JP2010020530A

JP2010020530A - Document classification providing device, document classification providing method and program

Info

Publication number: JP2010020530A
Application number: JP2008180200A
Authority: JP
Inventors: Yasukazu Mizushima; 靖和水嶋
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2008-07-10
Filing date: 2008-07-10
Publication date: 2010-01-28

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document classification providing device, a document classification method and a program, allowing reduction of providing leakage of classification to a document, and allowing providing of high-reliability classification. <P>SOLUTION: A characteristic amount extraction part 12 extracts the IPC (International Patent Classification) from a patent publication input from a document input part 11. A document characteristic amount vector generation part 13 extracts the head IPC from the IPCs obtained in the characteristic amount extraction part 12 in reference to a main characteristic amount dictionary 14, weights the head IPC, and thereafter generates a document vector with the IPCs as elements. An inter-document distance calculation part 15 calculates a distance between the patent publications by use of the document vectors. A two-dimensional coordinate mapping processing part 16 arranges a coordinate point corresponding to each patent publication on two-dimensional coordinates such that relation of the distance between the patent publications appears. A classification non-providing subset selection part 17 sets a circular area including a classification providing patent publication having the head IPC on the two-dimensional coordinates, and selects a set of the coordinate points corresponding to the classification non-providing patent publications inside the circular area. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、入力された文書に予め定められた分類区分を付与する文書分類付与装置、文書分類付与方法およびプログラムに関し、予め定められた分類区分が付与された文書をもとに、分類区分が付与されていない文書へ分類区分を付与するとともに、該予め定められた分類区分が付与されるべき文書を得るための検索式を提示する文書分類付与装置、文書分類付与方法およびプログラムに関し、特に２次元マップを用いてに予め定められた技術分類に関係する特許公報を選択、分類区分を付与するとともに該特許公報を得るための検索式を提示する文書分類付与装置、文書分類付与方法およびプログラムに関する。 The present invention relates to a document classification assigning device, a document classification assigning method, and a program for assigning a predetermined classification to an input document, and the classification classification is based on a document to which a predetermined classification classification is assigned. The present invention relates to a document classification assignment apparatus, a document classification assignment method, and a program for assigning a classification category to a document that has not been assigned and presenting a search formula for obtaining a document to which the predetermined classification category is to be assigned. The present invention relates to a document classification assigning apparatus, a document classification assigning method, and a program for selecting a patent publication related to a predetermined technical classification using a dimension map, assigning a classification classification, and presenting a search formula for obtaining the patent publication .

従来、入力された文書を何らかの指標で自動分類する方法が提案されているが、それらの方法では完全な自動分類はなしえなかった。自動分類したとしても、人が内容を確認し判断する必要があった。文書を自動分類、検索する技術として、例えば特許文献１がある。特許文献１に記載の技術は、文書検索において文章を構成する項目別に検索対象を絞り込んでおり、検索対象を絞り込むことで、目的に応じた検索を可能にし、検索精度の向上を実現している。 Conventionally, methods for automatically classifying input documents by some index have been proposed. However, complete automatic classification cannot be achieved by these methods. Even if automatic classification was performed, it was necessary for a person to check and judge the contents. As a technique for automatically classifying and searching documents, there is, for example, Patent Document 1. The technique described in Patent Literature 1 narrows down search targets by item constituting a sentence in document search, and by narrowing down the search targets, search according to the purpose is possible and search accuracy is improved. .

また、文書の分類は、一般的に文書に記載されている内容に基づいて行われる。特許文献２には、特許公報間の内容の類似度を距離で表現することにより、内容の類似度の関係を視覚的に認識可能とする技術が記載されている。この技術は、複数の文書をそれぞれの内容に応じて２次元座標上に表示する技術であり、該複数の文書の中の各２文書間の文書間距離を求め、２次元座標上に該各２文書間の文書間距離の大小関係を再現するように、該複数の文書を配置するものである。内容が類似している文書同士は近くに配置されるとともに、該複数の文書の全体を視覚的に認識できるようにしているため、文書集合全体の概観を一瞥することが可能になる。分類区分を付与する場合にも内容が類似するものが近くに存在するため、分類付与を容易に行うことができるという利点がある。
特開平１１−２９６５５０特開２００６−１９０２３５ Further, the classification of documents is generally performed based on the contents described in the documents. Japanese Patent Application Laid-Open No. 2004-228561 describes a technique that allows the relationship between content similarities to be visually recognized by expressing the content similarity between patent gazettes as a distance. This technique is a technique for displaying a plurality of documents on two-dimensional coordinates in accordance with the contents of each of the documents. The inter-document distance between each two documents in the plurality of documents is obtained, and each of the documents is displayed on the two-dimensional coordinates. The plurality of documents are arranged so as to reproduce the magnitude relationship of the distance between the two documents. Documents having similar contents are arranged close to each other, and the whole of the plurality of documents can be visually recognized. Therefore, it is possible to give an overview of the entire document set. In the case of assigning classification categories, there is an advantage that classification assignment can be easily performed because there are similar contents nearby.
JP-A-11-296550 JP 2006-190235 A

文書の内容が類似しているか否かを判断する指標として、文書内で使用されている単語が存在するが、異なる単語を使用していても同一の内容を表している場合がしばしばある。また同一の単語を使用していても異なる内容を表している場合がしばしばある。特許文献１では、文書に含まれる指定キーワードの有無により文書同士が同一の内容を表しているか否かを機械的に判断するため、類似する内容であっても指定キーワードがなければ検索結果に反映されず、検索漏れが発生するという問題があった。また、指定キーワードが含まれている文書を得ても、内容が異なるという検索ノイズが混入する問題があった。また、文書の内容は単語の有無だけで判断できるものではなく、得られた結果に対し再評価する工程が必要であるが、得られた結果に対する評価方法については特許文献１では言及がなされてはいない。 As an index for determining whether or not the contents of a document are similar, there are words used in the document, but even if different words are used, the same contents are often expressed. Even if the same word is used, it often represents different contents. In Patent Document 1, since it is mechanically determined whether or not the documents represent the same content depending on the presence or absence of the specified keyword included in the document, even if the content is similar, if there is no specified keyword, it is reflected in the search result. There was a problem that a search omission occurred. In addition, even if a document including the specified keyword is obtained, there is a problem that search noise that the contents are different is mixed. Further, the content of the document cannot be determined only by the presence or absence of a word, and a step of re-evaluating the obtained result is necessary. However, the evaluation method for the obtained result is referred to in Patent Document 1. No.

また、特許文献２では、文書間の類似度にもとづき、文書集合を２次元座標上に配置するため、内容が類似しているものと類似していないものとの境界が明確ではないという問題がある。文書Ａと文書Ｂとが２次元座標上で近くに位置し、文書Ｂと文書Ｃとが２次元座標上で近くに位置しており、なおかつ、文書Ａと文書Ｂ、文書Ｂと文書Ｃが類似した内容を保有している場合であっても、文書Ａと文書Ｃとの内容が類似していないということがしばしばある。図１３には、そのような問題が生じる場合の文字判別の一例を示したものである。図中、文字１１１はカタカナの「ヤ」を表し、文字１１３はカタカナの「カ」を表している。文字１１２はそのいずれかまたは別の記号を表したものである。文字１１１と文字１１２とは、いずれも「ヤ」の内容を表しているものとして類似性が高く、２次元座標上では近くに配置される。文字１１２と文字１１３とは、いずれも「カ」の内容を表しているものとして類似性が高く、２次元座標上では近くに配置される。しかし、文字１１１と文字１１３とは「ヤ」と「カ」であり、内容は全く異なるものである。図１３ではカタカナ表記を例に示したが、文書の内容の類似性についても同様の問題が生じる。例えば、文書Ａ、文書Ｂ、文書Ｃが存在する場合、Ａ−Ｂ間、またはＢ−Ｃ間のいずれの間に内容の類似性の観点から境界線を引くべきかが不明である。更に、文書Ａ、文書Ｂ、文書Ｃの間にそのような関係があること自体、２次元座標上に配置された図からは判断できないという問題がある。 Further, in Patent Document 2, since a document set is arranged on two-dimensional coordinates based on the similarity between documents, there is a problem that the boundary between what is similar and what is not similar is not clear. is there. Document A and document B are located close to each other on the two-dimensional coordinate, document B and document C are located close to each other on the two-dimensional coordinate, and document A and document B, document B and document C are Even when similar contents are held, the contents of document A and document C are often not similar. FIG. 13 shows an example of character discrimination when such a problem occurs. In the drawing, the character 111 represents Katakana “ya”, and the character 113 represents Katakana “ka”. Character 112 represents any one or another symbol. The character 111 and the character 112 are both highly similar as representing the contents of “Y”, and are arranged close to each other on the two-dimensional coordinates. The character 112 and the character 113 are both highly similar as representing the contents of “K”, and are arranged close to each other on the two-dimensional coordinates. However, the characters 111 and 113 are “YA” and “F”, and the contents are completely different. Although katakana notation is shown in FIG. 13 as an example, the same problem arises with respect to the similarity of document contents. For example, when document A, document B, and document C exist, it is unclear whether a boundary line should be drawn between A-B or B-C from the viewpoint of content similarity. Furthermore, there is a problem that such a relationship between the document A, the document B, and the document C cannot be determined from a diagram arranged on two-dimensional coordinates.

そこで、本発明は、上記従来の未解決の問題に着目してなされたものであり、文書に対する分類区分の付与漏れを少なくし、信頼性が高い分類区分を付与することを可能とする文書分類付与装置、文書分類方法およびプログラムを提供することを目的とする。 Therefore, the present invention has been made paying attention to the above-mentioned conventional unsolved problems, and it is possible to reduce the omission of classification classification to a document and to provide a classification classification with high reliability. An object is to provide a granting device, a document classification method, and a program.

上記問題を解決するために、本発明の請求項１に係る文書分類付与装置は、予め定められた分類区分が付与されている文書である分類付与文書をもとに、分類区分が付与されていない文書である分類未付与文書に前記予め定められた分類区分を付与する文書分類付与装置において、前記分類付与文書と前記分類未付与文書とを含む複数の文書を入力する文書入力部と、前記文書入力部から入力された各文書から特徴量を抽出する特徴量抽出部と、前記特徴量の中から主たる特徴量を抽出するルールを記憶する主特徴量辞書記憶部と、前記主特徴量辞書記憶部に記憶されたルールを参照して前記特徴量抽出部から得られた特徴量の中から主たる特徴量を抽出し、該主たる特徴量の値に１より大きい重みを重畳した後の前記特徴量を要素とした特徴量ベクトルを生成する文書特徴量ベクトル生成部と、前記文書特徴量ベクトル生成部により生成された特徴量ベクトルを用いて文書間の距離を計算する文書間距離計算部と、前記文書間距離計算部で得られた文書間の距離の関係が表れるように、各文書に対応する座標点を２次元座標上に配置する２次元座標マッピング処理部と、前記２次元座標上から、前記主たる特徴量をもつ分類未付与文書であって、かつ、前記予め定められた分類区分を付与すべき分類未付与文書に対応する座標点の集合を選択する分類未付与部分集合選択部とを備えることを特徴とする。 In order to solve the above problem, the document classification assigning apparatus according to claim 1 of the present invention is provided with classification categories based on a classification grant document that is a document with predetermined classification categories. A document input unit that inputs a plurality of documents including the classified document and the non-classified document in the document classification and grant device that assigns the predetermined classification category to a non-classified document that is a non-classified document; and A feature amount extraction unit that extracts a feature amount from each document input from a document input unit; a main feature amount dictionary storage unit that stores a rule for extracting a main feature amount from the feature amount; and the main feature amount dictionary The feature after extracting a main feature amount from the feature amount obtained from the feature amount extraction unit with reference to a rule stored in the storage unit and superimposing a weight greater than 1 on the value of the main feature amount Special features of quantity A document feature vector generation unit that generates a quantity vector, an inter-document distance calculation unit that calculates a distance between documents using the feature vector generated by the document feature vector generation unit, and the inter-document distance calculation unit The two-dimensional coordinate mapping processing unit that arranges the coordinate points corresponding to each document on the two-dimensional coordinates, and the main feature amount from the two-dimensional coordinates so that the relationship between the distances between the documents obtained in step 1 can be expressed. And a non-classified subset selection unit that selects a set of coordinate points corresponding to the non-classified document to which the predetermined classification category should be assigned. To do.

この請求項１の発明によれば、各文書がもつ主たる特徴量に重み付けをした上で文書間距離を計算し、文書間の距離の関係が表れるように２次元座標上に各文書に対応する座標点を配置するため、分類付与対象の文書とそれ以外の文書とが分離されるように配置することが可能となり、分類付与文書に距離が近い分類未付与文書を分類区分を付与すべき分類未付与文書として選択することで、分類区分の付与漏れが少なく、信頼性が高い分類区分の付与が可能となる。 According to the first aspect of the present invention, the distance between the documents is calculated after weighting the main feature amount of each document, and the correspondence between the distances between the documents is represented on the two-dimensional coordinates. Since the coordinate points are arranged, it is possible to arrange the document to be classified and the other documents so that they are separated, and the classification category should be assigned to the unclassified documents that are close to the classified document By selecting as an unassigned document, it is possible to assign a classification category with high reliability with few omissions.

また、請求項２に係る文書分類付与装置は、請求項１において、前記文書入力部から入力された分類付与文書から、文書の類似度判定に用いるためのフィルタリング用の特徴量を抽出するフィルタリング用特徴量抽出部と、前記予め定められた分類区分毎に、前記分類付与文書全体がもつフィルタリング用の特徴量の頻度分布の形状と、前記分類付与文書各々がもつフィルタリング用の特徴量の頻度分布の形状との類似度を判定し、該判定された類似度に基づき、所定の分類付与文書がもつフィルタリング用の特徴量の頻度分布が、前記分類付与文書全体がもつフィルタリング用の特徴量の頻度分布の形状に類似していないと判定された場合であって、かつ、前記分類付与文書全体がもつフィルタリング用の特徴量の頻度分布の中で最も頻度が高いフィルタリング用の特徴量と、前記所定の分類付与文書がもつフィルタリング用の特徴量の頻度分布の中で最も頻度が高いフィルタリング用の特徴量とが同一ではない場合に、前記所定の分類付与文書に付与されている分類区分を除去する分類除去部とを更に備え、前記分類除去部で得られた文書に対応する座標点を２次元座標上に配置するために、該文書を前記２次元座標マッピング処理部に入力することを特徴とする。 According to a second aspect of the present invention, there is provided the document classification assigning apparatus according to the first aspect, wherein the filtering feature amount is extracted from the classification imparted document input from the document input unit and used for determining the similarity of the document. For each of the predetermined classification categories, a feature amount extraction unit, a shape of a frequency distribution of filtering feature amounts included in the entire classification-added document, and a frequency distribution of filtering feature amounts included in each of the classification-added documents And the frequency distribution of the filtering feature quantity of the predetermined classification-added document based on the determined similarity degree is the frequency of the filtering feature quantity of the entire classification-added document. It is a case where it is determined that the shape of the distribution is not similar, and the highest frequency in the frequency distribution of the filtering feature quantity possessed by the entire classified document When the high filtering feature quantity and the filtering characteristic quantity with the highest frequency in the frequency distribution of the filtering feature quantity of the predetermined classification assignment document are not the same, the predetermined classification assignment document A classification removal unit for removing the classification assigned to the document, and in order to place coordinate points corresponding to the document obtained by the classification removal unit on the two-dimensional coordinates, Input to the mapping processing unit.

この請求項２の発明によれば、各分類付与文書がもつフィルタリング用の特徴量の頻度分布の全体に対する類似度を判定し、所定の分類付与文書の類似度が低く類似していないと判定された場合、付与された分類区分が妥当でなく信頼性が低い可能性があるため、所定の分類付与文書から分類区分を除去することで、信頼性が高い分類区分が付与された分類付与文書集合が得られる。したがって、分類付与文書集合を用いて、分類区分を付与すべき適切な文書を選択することができ、信頼性の高い分類区分の付与が可能となる。
また、請求項３に係る文書分類付与装置は、請求項２において、前記分類除去部は、前記フィルタリング用の特徴量の頻度分布の形状の類似度判定に、χ２乗検定処理を用いることを特徴とする。 According to the second aspect of the present invention, the similarity with respect to the entire frequency distribution of the filtering feature quantity possessed by each classified document is determined, and it is determined that the similarity of the predetermined classified document is low and not similar. In this case, the assigned classification category is not valid and the reliability may be low. Therefore, by removing the classification category from the given classification-assigned document, the classification-assigned document set to which the high-reliability classification category is assigned Is obtained. Therefore, it is possible to select an appropriate document to which a classification category is to be assigned using the classification-assigned document set, and it is possible to assign a highly reliable classification category.
According to a third aspect of the present invention, there is provided the document classification assigning apparatus according to the second aspect, wherein the classification removing unit uses a chi-square test process to determine the similarity of the shape of the frequency distribution of the feature quantity for filtering. And

また、請求項４に係る文書分類付与装置は、請求項１乃至３のいずれか１項において、前記分類未付与部分集合選択部は、前記２次元座標上で前記分類付与文書に対応する座標点を含む円形領域を設定し、該円形領域に含まれ、かつ、前記主たる特徴量をもつ分類未付与文書に対応する座標点の集合を選択することを特徴とする。
この請求項４の発明によれば、分類付与対象の分類未付与文書を、主たる特徴量をもつ分類未付与文書とすることにより、分類付与対象の文書に対応する座標点とそれ以外との２次元座標上における境界を得ることが可能になるとともに、分類付与対象の分類未付与文書に対応する座標点を、前記分類付与文書に対応する座標点を含む円形領域に絞り込むことで、効率的な分類付与が可能になる。 According to a fourth aspect of the present invention, there is provided the document classification assigning apparatus according to any one of the first to third aspects, wherein the unclassified subset selection unit is a coordinate point corresponding to the classified document on the two-dimensional coordinates. And a set of coordinate points corresponding to the unclassified document that is included in the circular region and has the main feature amount is selected.
According to the fourth aspect of the present invention, the classification-unassigned document to be classified is a non-classified document having the main feature amount, so that the coordinate point corresponding to the document to be classified and the other 2 It is possible to obtain a boundary on a dimensional coordinate, and by narrowing down the coordinate points corresponding to the classification-unassigned document to be classified to a circular area including the coordinate points corresponding to the classification-assigned document, it is efficient. Classification can be given.

また、請求項５にかかる文書分類付与装置は、請求項１乃至４のいずれか１項において、前記入力される複数の文書が特許公報であり、前記抽出される文書の特徴量およびフィルタリング用の特徴量がＩＰＣ（International Patent Classification：国際特許分類）、ＦＩ、ＦＴ（File Forming Term：Ｆターム）、または特許公報中の単語であり、前記分類区分が技術分類の区分であり、前記主文書特徴量辞書記憶部に記憶されているルールが、筆頭ＩＰＣ、筆頭ＩＰＣに内容が類似したＦＩ、該ＦＩに関連するＦＴ、または特許公報中の単語を抽出することであることを特徴とする。
この請求項５の発明によれば、特許公報に対して、付与漏れが少なく信頼性の高い技術分類付与を行うことが可能となる。 According to a fifth aspect of the present invention, there is provided the document classification assigning apparatus according to any one of the first to fourth aspects, wherein the plurality of inputted documents are patent gazettes, and the feature amount and filtering for the extracted document are used. The feature quantity is a word in IPC (International Patent Classification), FI, FT (File Forming Term) or patent gazette, and the classification category is a technical classification category, and the main document feature The rule stored in the quantity dictionary storage unit is to extract a first IPC, an FI similar in content to the first IPC, an FT related to the FI, or a word in a patent gazette.
According to the fifth aspect of the present invention, it is possible to apply a technical classification with high reliability with little omission to a patent publication.

また、請求項６の発明にかかる文書分類方法は、予め定められた分類区分が付与されている文書である分類付与文書をもとに、分類区分が付与されていない分類未付与文書に前記予め定められた分類区分を付与する文書分類付与装置が行う文書分類付与方法において、前記分類付与文書と前記分類未付与文書とを含む複数の文書を入力する文書入力ステップと、前記文書入力ステップから入力された各文書から特徴量を抽出する特徴量抽出ステップと、前記特徴量の中から主たる特徴量を抽出するルールを主特徴量辞書記憶部に記憶する主特徴量辞書記憶ステップと、前記主特徴量辞書記憶部を参照して前記特徴量抽出ステップで得られた特徴量の中から主たる特徴量を抽出し、該主たる特徴量に１より大きい重みを重畳した後の前記特徴量を要素とした特徴量ベクトルを作成する文書特徴量ベクトル生成ステップと、前記文書特徴量ベクトル生成ステップで生成された特徴量ベクトルを用いて文書間の距離を計算する文書間距離計算ステップと、前記文書間距離計算ステップで得られた文書間の距離の関係が表れるように、各文書に対応する座標点を２次元座標上に配置する２次元座標マッピング処理ステップと、前記２次元座標上から、前記主たる特徴量をもつ分類未付与文書であって、かつ、前記予め定められた分類区分を付与すべき分類未付与文書に対応する座標点を選択する分類未付与部分集合選択ステップとを備えることを特徴とする。 In addition, the document classification method according to the invention of claim 6 is based on a classification-assigned document that is a document to which a predetermined classification category is assigned, and the previously classified document to which no classification category is assigned is assigned to the document classification method. In a document classification assigning method performed by a document classification assigning device for assigning a predetermined classification, a document input step for inputting a plurality of documents including the classified grant document and the unclassified document, and an input from the document input step A feature amount extracting step for extracting a feature amount from each document, a main feature amount dictionary storing step for storing a rule for extracting a main feature amount from the feature amounts in a main feature amount dictionary storage unit, and the main feature The feature amount after extracting a main feature amount from the feature amounts obtained in the feature amount extraction step with reference to the amount dictionary storage unit and superimposing a weight greater than 1 on the main feature amount A document feature vector generation step for creating a feature vector as an element, an inter-document distance calculation step for calculating a distance between documents using the feature vector generated in the document feature vector generation step, and the document A two-dimensional coordinate mapping step of arranging coordinate points corresponding to each document on the two-dimensional coordinates so that the relationship of the distance between documents obtained in the inter-distance calculation step is expressed; A non-classified subset selection step of selecting a coordinate point corresponding to a non-classified document that is a non-classified document having a main feature amount and to which the predetermined classification category should be assigned. Features.

また、請求項７の発明にかかる文書分類方法は、請求項６において、前記文書入力ステップから入力された分類付与文書から、文書の類似度判定に用いるためのフィルタリング用の特徴量を抽出するフィルタリング用特徴量抽出ステップと、前記予め定められた分類区分毎に、前記分類付与文書全体がもつフィルタリング用の特徴量の頻度分布と、前記分類付与文書各々がもつフィルタリング用の特徴量の頻度分布との類似度を判定し、該判定された類似度に基づき、所定の分類付与文書がもつフィルタリング用の特徴量の頻度分布が、前記分類付与文書全体がもつフィルタリング用の特徴量の頻度分布の形状に類似していないと判定された場合であって、かつ、前記分類付与文書全体がもつフィルタリング用の特徴量の頻度分布の中で最も頻度が高いフィルタリング用の特徴量と、前記所定の分類付与文書がもつフィルタリング用の特徴量の頻度分布の中で最も頻度が高いフィルタリング用の特徴量とが同一ではない場合に、前記所定の分類付与文書に付与されている分類区分を除去する分類除去ステップとを更に備え、前記２次元座標マッピング処理ステップでは、前記分類除去ステップで得られた文書に対応する座標点を前記２次元座標上に配置することを特徴とする。 A document classification method according to a seventh aspect of the present invention is the document classification method according to the sixth aspect, wherein the filtering feature amount used for the similarity determination of the document is extracted from the classified document input from the document input step. Feature amount extraction step; for each of the predetermined classification categories, a filtering feature amount frequency distribution of the entire classified document, and a filtering feature amount frequency distribution of each of the classified documents The frequency distribution of the filtering feature quantity of the predetermined classification-added document based on the determined similarity degree is the shape of the frequency distribution of the filtering feature quantity of the entire classification-added document. In the frequency distribution of the filtering feature quantity of the entire classification-added document. If the feature quantity for filtering having a high degree is not the same as the feature quantity for filtering having the highest frequency in the frequency distribution of the feature quantity for filtering included in the predetermined classification-added document, the predetermined classification A classification removal step for removing a classification section assigned to the assigned document, and in the two-dimensional coordinate mapping processing step, a coordinate point corresponding to the document obtained in the classification removal step is placed on the two-dimensional coordinates. It is characterized by arranging.

また、請求項８の発明にかかる文書分類方法は、請求項７において、前記分類除去ステップでは、前記フィルタリング用の特徴量の頻度分布の形状の類似度判定に、χ２乗検定処理を用いることを特徴とする。
また、請求項９の発明に係る文書分類方法は、請求項６乃至８の何れか１項において、前記分類未付与部分集合選択ステップでは、前記２次元座標上で前記分類付与文書に対応する座標点を含む円形領域を設定し、該円形領域に含まれ、かつ、前記主たる特徴量をもつ分類未付与文書に対応する座標点の集合を選択することを特徴とする。 The document classification method according to an eighth aspect of the present invention is the document classification method according to the seventh aspect, wherein, in the classification removal step, a chi-square test process is used for determining the similarity of the shape of the frequency distribution of the filtering feature quantity. Features.
A document classification method according to a ninth aspect of the present invention is the document classification method according to any one of the sixth to eighth aspects, wherein, in the unclassified subset selection step, coordinates corresponding to the classified document on the two-dimensional coordinates. A circular area including points is set, and a set of coordinate points corresponding to the unclassified document included in the circular area and having the main feature amount is selected.

また、請求項１０の発明に係る文書分類方法は、請求項６乃至９の何れか１項において、前記入力される複数の文書が特許公報であり、前記抽出される文書の特徴量およびフィルタリング用の特徴量がＩＰＣ（International Patent Classification：国際特許分類）、ＦＩ、ＦＴ（File Forming Term：Ｆターム）、または特許公報中の単語であり、前記分類区分が技術分類の区分であり、前記主文書特徴量辞書記憶部に記憶されているルールが、筆頭ＩＰＣ、筆頭ＩＰＣに内容が類似したＦＩ、該ＦＩに関連するＦＴ、または特許公報中の単語を抽出することであることを特徴とする。 The document classification method according to a tenth aspect of the present invention is the document classification method according to any one of the sixth to ninth aspects, wherein the plurality of input documents are patent gazettes, and the feature amount of the extracted document and the filtering Is an IPC (International Patent Classification), FI, FT (File Forming Term) or a word in a patent gazette, the classification category is a technical classification category, and the main document The rule stored in the feature dictionary storage unit is to extract the first IPC, the FI similar in content to the first IPC, the FT related to the FI, or a word in the patent publication.

また、請求項１１の発明に係るプログラムは、請求項６乃至１０のいずれか１項に記載された方法をコンピュータに実行させるためのプログラムであることを特徴とする。
請求項１１の発明によれば、このプログラムを、サーバからのダウンロードあるいは記録媒体からのコピーによってコンピュータに記憶させ実行させることで、請求項６乃至１０のいずれか１項に記載された方法をコンピュータによって実現することが可能となる。 A program according to an eleventh aspect of the invention is a program for causing a computer to execute the method according to any one of the sixth to tenth aspects.
According to invention of Claim 11, this program is memorize | stored in a computer by the download from a server or the copy from a recording medium, and is made to perform, The method as described in any one of Claim 6 thru | or 10 is made into a computer Can be realized.

本発明により、文書の主たる特徴量に重み付けをして文書間の距離を２次元座標上に表して、分類区分を付与すべき分類未付与文書を選択するため、分類区分の付与漏れが少なく、信頼性が高い分類区分の付与が可能となる。 According to the present invention, the main feature amount of the document is weighted, the distance between the documents is represented on the two-dimensional coordinates, and the unclassified document to which the classification category is to be assigned is selected. It is possible to assign classification categories with high reliability.

以下、本発明の文書分類付与装置の実施例について図面を参照して説明する。 Embodiments of the document classification assigning apparatus of the present invention will be described below with reference to the drawings.

本実施例では、ある特定技術について検索して得られた特許公報の集合のうち、予め技術分類の区分（以下、単に「技術分類」という）が付与された分類付与特許公報の集合と、技術分類が付与されていない分類未付与特許公報の集合と、を文書分類付与装置への入力とした場合について説明する。ここで、あらかじめ付与されている技術分類とは、たとえば特定の技術分野への属否である○、×（属する、属さない）である。 In this embodiment, among a set of patent gazettes obtained by searching for a specific technology, a set of classification grant patent gazettes to which a classification of technology classification (hereinafter simply referred to as “technical classification”) is assigned, and a technology A case will be described in which a set of unclassified patent gazettes to which no classification is assigned is input to the document classification assignment apparatus. Here, the technical classification assigned in advance is, for example, “O” or “X” (belongs to or does not belong), which is whether or not it belongs to a specific technical field.

図１は、本実施例に係る文書分類付与装置の構成図である。同図に示すように、文書分類付与分装置は、文書入力部１１、特徴量抽出部１２、文書特徴量ベクトル生成部１３、主特徴量辞書（「主特徴量辞書記憶部」に対応）１４、文書間距離計算部１５、２次元座標マッピング処理部１６および分類未付与部分集合選択部１７を備えている。主特徴量辞書１４は、文書分類付与装置が備える図示せぬ記憶装置に設けられたデータベースであり、文書入力部１１、特徴量抽出部１２、文書特徴量ベクトル生成部１３、文書間距離計算部１５、２次元座標マッピング処理部１６および分類未付与部分集合選択部１７は、文書分類付与装置が備える図示せぬＣＰＵが、記憶装置に記憶されたプログラムを実行することにより実現される機能である。 FIG. 1 is a configuration diagram of a document classification assigning apparatus according to the present embodiment. As shown in the figure, the document classification assignment apparatus includes a document input unit 11, a feature amount extraction unit 12, a document feature amount vector generation unit 13, a main feature amount dictionary (corresponding to a “main feature amount dictionary storage unit”) 14. The inter-document distance calculation unit 15, the two-dimensional coordinate mapping processing unit 16, and the unclassified subset selection unit 17 are provided. The main feature dictionary 14 is a database provided in a storage device (not shown) included in the document classification assigning device, and includes a document input unit 11, a feature extraction unit 12, a document feature vector generation unit 13, and an inter-document distance calculation unit. 15, the two-dimensional coordinate mapping processing unit 16 and the unclassified subset selection unit 17 are functions realized by a CPU (not shown) included in the document classification assigning device executing a program stored in the storage device. .

文書入力部１１には、分類未付与特許公報集合１００と分類付与特許公報集合１０１とが入力される。
特徴量抽出部１２では、各特許公報の特徴量としてＩＰＣ（International Patent Classification：国際特許分類）を抽出する。
文書特徴量ベクトル生成部１３は、特徴量抽出部１２で抽出されたＩＰＣを要素とし、その出現頻度を値とする文書ベクトルを各特許公報について生成する。その際に、主特徴量辞書１４を参照し、筆頭ＩＰＣに対応する値を１０倍にする。ここで、筆頭ＩＰＣとは、特許公報に記載されている発明を代表する分類であり、一般的に特許公報のＩＰＣ記載欄の最上段に記載される。 The document input unit 11 receives an unclassified patent gazette set 100 and a classified grant patent gazette set 101.
The feature quantity extraction unit 12 extracts an IPC (International Patent Classification) as a feature quantity of each patent publication.
The document feature quantity vector generation unit 13 generates a document vector having each IPC extracted by the feature quantity extraction unit 12 as an element and having the appearance frequency as a value for each patent publication. At that time, the main feature dictionary 14 is referred to, and the value corresponding to the first IPC is multiplied by ten. Here, the first IPC is a classification representing the invention described in the patent gazette, and is generally described at the top of the IPC description column of the patent gazette.

主特徴量辞書１４には、主たる特徴量（本実施例では、筆頭ＩＰＣ）や、文書入力部１１から入力された特許公報から抽出された特徴量の中から主たる特徴量を抽出するルールが記憶されている。本実施例では、特許公報に記載されたＩＰＣの中から筆頭ＩＰＣを抽出するルールが記憶されている。さらに、主特徴量辞書１４には、抽出した筆頭ＩＰＣに１０倍の重みづけをおこなうルールが記憶されている。 The main feature dictionary 14 stores rules for extracting main features from the main features (first IPC in this embodiment) and feature values extracted from the patent gazette input from the document input unit 11. Has been. In this embodiment, a rule for extracting the leading IPC from the IPC described in the patent publication is stored. Further, the main feature dictionary 14 stores a rule for weighting the extracted first IPC 10 times.

なお、主たる特徴量を抽出するルールとしては、筆頭ＩＰＣに限らず、筆頭ＩＰＣに内容が類似したＦＩ、当該ＦＩに関連するＦＴ（File Forming Term：Ｆターム）、あるいは、特許公報中の単語等を抽出するルールであってもよい。
文書間距離計算部１５は、生成された各文書の文書ベクトルから各２文書間距離を計算する。本実施例では、文書間距離としてコサイン距離を用いる。コサイン距離は、図２に示すように、２つの文書ベクトルがなす角度（θ）を元に、１−ｃｏｓ（θ）で求められる。２つの文書の文書ベクトルが同一である場合には、距離が０になり、これらの２つの文書の文書ベクトルが２次元座標上に配置された場合には同一座標に配置される。 The rule for extracting the main feature amount is not limited to the first IPC, but an FI similar in content to the first IPC, an FT (File Forming Term) related to the FI, a word in the patent gazette, etc. May be a rule for extracting.
The inter-document distance calculation unit 15 calculates the distance between each two documents from the generated document vector of each document. In this embodiment, the cosine distance is used as the inter-document distance. As shown in FIG. 2, the cosine distance is obtained by 1-cos (θ) based on an angle (θ) formed by two document vectors. When the document vectors of two documents are the same, the distance is 0, and when the document vectors of these two documents are arranged on the two-dimensional coordinates, they are arranged at the same coordinates.

２次元座標マッピング処理部１６では、前記コサイン距離の大小関係を再現するように、２次元座標上に各特許公報に対応する座標点をマッピングし、前記予め技術分類が付与された分類付与特許公報集合１０１に含まれる特許公報に対応する座標点を２次元座標中でハイライト表示する。各特許公報に対応する座標点の値は公知の多次元尺度構成法によって求められる。図３は、得られた各特許公報に対応する座標点を２次元座標上に表示した特許公報２次元座標マッピング図である。図中、１点１点は一つの特許公報に対応しており、黒丸は特定の技術分類への属否が○（属）のハイライトされた特許公報に対応するものである。筆頭ＩＰＣの重みを１０倍した特徴量ベクトルを用いることで、同一筆頭ＩＰＣを持つ特許公報がお互い近くに寄っている図が得られている。図４は、文書ベクトル作成時に筆頭ＩＰＣを重み付けしていない場合の特許公報２次元座標マッピング図である。図中、１点１点は一つの特許公報に対応しており、黒丸は特定の技術分類への属否が○（属）の特許公報に対応するものである。図４では文書間の関係だけが表現されており、意味的な境界が不明であることが明らかである。そして、図３と図４とを比較することにより、筆頭ＩＰＣの重み付けを行うと文書間の意味的境界が明らかになるという効果が明らかである。なお、筆頭ＩＰＣの重み付けは１０倍に限定されることはなく、１より大きい値で重み付けすることで同様の効果が得られる。 The two-dimensional coordinate mapping processing unit 16 maps the coordinate points corresponding to each patent publication on the two-dimensional coordinates so as to reproduce the magnitude relationship of the cosine distance, and the classification grant patent publication to which the technical classification is assigned in advance. The coordinate points corresponding to the patent publications included in the set 101 are highlighted in the two-dimensional coordinates. The value of the coordinate point corresponding to each patent publication is obtained by a known multidimensional scale construction method. FIG. 3 is a patent publication two-dimensional coordinate mapping diagram in which coordinate points corresponding to the obtained patent publications are displayed on the two-dimensional coordinates. In the figure, each point corresponds to one patent gazette, and a black circle corresponds to a highlighted patent gazette in which the affiliation to a specific technical classification is ○ (genus). By using a feature vector obtained by multiplying the weight of the first IPC by 10, the figure in which patent publications having the same first IPC are close to each other is obtained. FIG. 4 is a patent gazette two-dimensional coordinate mapping diagram in the case where the leading IPC is not weighted when the document vector is created. In the figure, each point corresponds to one patent gazette, and a black circle corresponds to a patent gazette in which a particular technical classification belongs to ○ (genus). In FIG. 4, only the relationship between the documents is expressed, and it is clear that the semantic boundary is unknown. Then, by comparing FIG. 3 and FIG. 4, it is clear that the semantic boundary between the documents becomes clear when the first IPC is weighted. Note that the weighting of the first IPC is not limited to 10 times, and the same effect can be obtained by weighting with a value larger than 1.

分類未付与部分集合選択部１７では、黒丸で囲まれた箇所を中心に、選択的に、特定の特許分類を付与すべき分類未付与特許公報に対応する座標点を抽出する。その際、分類未付与部分集合選択部１７は、主特徴量辞書１４を参照し、抽出する座標点を、主特徴量辞書１４に記憶されている筆頭ＩＰＣと同一の筆頭ＩＰＣをもつ特許公報に対応する座標点に制限することにより、選択範囲の境界を考慮した選択を行う。図５は、特定の特許分類を付与すべき分類未付与特許公報に対応する座標点を抽出する領域として選択される円形領域の一部を示したものである。円で囲まれた領域「Ａ」内に配置された座標点に対応する特許公報については、当該技術分類が付与される可能性が高いものとして、分類付与対象として選択することで、技術分類付与漏れを効果的に減少させることが可能になる。更に、前記分類未付与特許公報に対応する座標点を抽出する領域に含まれる分類付与特許公報および分類未付与特許公報を使い、特許公報中に高頻度で現れるＩＰＣを抽出し、該高頻度で現れるＩＰＣを検索条件に付加することで、より精度が高い検索結果が得られる検索式を得ることが可能となる。 The unclassified subset selection unit 17 selectively extracts coordinate points corresponding to unclassified patent gazettes to which a specific patent classification should be assigned centering on a portion surrounded by a black circle. At that time, the unclassified subset selection unit 17 refers to the main feature dictionary 14 and sets a coordinate point to be extracted to a patent publication having the same first IPC as the first IPC stored in the main feature dictionary 14. By limiting to the corresponding coordinate points, selection is performed in consideration of the boundary of the selection range. FIG. 5 shows a part of a circular area selected as an area for extracting coordinate points corresponding to an unclassified patent publication to which a specific patent classification should be assigned. For patent gazettes corresponding to coordinate points arranged in the circled area “A”, it is highly probable that the technical classification will be given, and the technical classification is given by selecting it as a classification grant target. Leakage can be effectively reduced. Furthermore, using the classification grant patent gazette and the classification non-grant patent publication included in the region for extracting the coordinate points corresponding to the non-classification patent gazette, the IPC that appears frequently in the patent gazette is extracted, and the high frequency By adding the appearing IPC to the search condition, it is possible to obtain a search expression that can obtain a search result with higher accuracy.

図７は、図６に示す領域「Ａ」および「ア」における分類付与、未付与の精度を示す図である。同図に示す精度（正解率）は、各領域での特定技術への属否が○（属）である正解特許公報数をもとにして得られた精度である。領域「Ａ」では３５％の精度で正解特許公報を抽出するとともに、選択対象外の領域である領域「ア」では未付与率１００％の精度が得られている。領域「Ａ」における分類付与精度が低いということは検索ノイズが多いことを示すが、反面、検索漏れが少ないことを示唆している。また、領域「ア」における分類未付与精度が１００％であるということは、領域内に正解特許公報が存在せず、分類付与時に領域「ア」を抽出対象から除外してもかまわないことを示しており、本手法の有効性を示している。 FIG. 7 is a diagram showing the accuracy of classification assignment and non-assignment in the areas “A” and “A” shown in FIG. The accuracy (accuracy rate) shown in the figure is the accuracy obtained based on the number of correct patent gazettes in which the affiliation to a specific technology in each region is ○ (genus). In the area “A”, the correct patent gazette is extracted with an accuracy of 35%, and the accuracy of the unassigned rate of 100% is obtained in the area “a”, which is an area not to be selected. A low classification imparting accuracy in the region “A” indicates that there is a lot of search noise, but on the other hand, it suggests that there are few search omissions. Further, the accuracy of unassigned classification in the area “A” is 100% means that there is no correct patent gazette in the area, and the area “A” may be excluded from the extraction target at the time of assigning the classification. This shows the effectiveness of this method.

また、頻度ではなく、分類付与特許公報集合１０１において属否が○（属）である比率を用いることで、効果的な公報の選択を行うことができる。特許公報２次元座標マッピング図に特許公報数をｚ軸として加え、３次元表示したものが図８である。図中、色が濃い領域が、属否が○（属）である特許公報の比率が高い領域である。図８中の３次元の領域「Ａ」が、図６における２次元の領域「Ａ」に相当している。領域「α」は、同一の筆頭ＩＰＣを持つ特許公報に対応する座標点が存在する領域のうち、領域「Ａ」以外の領域を表している。同一の筆頭ＩＰＣをもつ特許公報に対応する座標点は領域「α」に集中しているが、属否が○（属）の特許公報に対応する座標点は領域「Ａ」に集中しており、分類未付与部分集合選択部１７が領域「Ａ」のみを選択することで、特定の特許分類を付与する特許公報の選択を効果的に行ことができる。 In addition, effective publication selection can be performed by using a ratio in which the genus is ◯ (genus) in the classification grant patent publication set 101 instead of the frequency. FIG. 8 shows a three-dimensional display in which the number of patent publications is added to the patent publication two-dimensional coordinate mapping diagram as the z-axis. In the figure, dark regions are regions where the ratio of patent gazettes where the genus is ◯ (genus) is high. The three-dimensional area “A” in FIG. 8 corresponds to the two-dimensional area “A” in FIG. The region “α” represents a region other than the region “A” among regions where coordinate points corresponding to patent gazettes having the same first IPC exist. The coordinate points corresponding to the patent publication having the same first IPC are concentrated in the region “α”, but the coordinate points corresponding to the patent publication having the genus of ○ (genus) are concentrated in the region “A”. By selecting only the region “A” by the unclassified subset selection unit 17, it is possible to effectively select a patent publication to which a specific patent classification is assigned.

図９は、分類未付与部分集合選択部１７での動作を表したフローチャートである。
Ｓ１３０１では、本処理の最後に出力すべき分類未付与文書集合の記憶領域の初期化を行う。
Ｓ１３０２では、属否が○（属）の特許公報全てについて処理が行われたか否かを判断し、すべての属否が○（属）の特許公報について処理し終わっていれば、Ｓ１３０９で、特定の特許分類を付与すべき分類付与対象特許公報部分集合１８として、分類未付与文書集合を出力して終了する。一方、すべての属性が○（属）の特許公報について処理し終わっていなければＳ１３０３に進む。 FIG. 9 is a flowchart showing the operation of the unclassified subset selection unit 17.
In S1301, the storage area of the unclassified document set to be output at the end of this process is initialized.
In S1302, it is determined whether or not processing has been performed for all patent gazettes with a genus of ○ (genus). The classification unassigned document set is output as the classification grant target patent gazette subset 18 to which the above patent classification is to be assigned. On the other hand, if all the attributes have not been processed for the patent gazette of ○ (genus), the process proceeds to S1303.

Ｓ１３０３では属否が○（属）の特許公報の中から、処理が終わっていないものを一つ選択する。Ｓ１３０４では、選択された前記特許公報の２次元座標上での座標点を中心として、予め定めてある半径Ｒｍａｘをもつ円形領域を設定し、Ｓ１３０５で該円形領域内部に属否が○（属）の特許公報があるか否かを判断する。該円形領域内部に属否が○（属）の特許公報がない場合には、Ｓ１３０６で円の中心を意味する変数ｏに前記特許公報の２次元座標上の座標点の座標値を設定し、半径を意味する変数ｒに予め定めてあるＲｍｉｎを設定する。一方、該円形領域内部に属否が○（属）の特許公報がある場合には、Ｓ１３０７で円の中心を意味する変数ｏに、前記円形領域内部にある属否が○（属）の各特許公報に対応する座標点の重心値を設定し、半径を意味する変数ｒに該円形領域内部にある属否が○（属）の特許公報に対応する座標点全てを含む最小の値に設定する。
Ｓ１３０８では、前記円の中心を意味する変数ｏ、前記半径を意味する変数ｒ、で作られる円内部に含まれる分類未付与特許公報を、分類未付与部分集合に加える。 In S1303, one that has not been processed is selected from among patent gazettes with a genus of ○ (genus). In S1304, a circular area having a predetermined radius Rmax is set around the selected coordinate point on the two-dimensional coordinate of the above-mentioned patent gazette. In S1305, whether or not the circle belongs to the circle area (genus). It is determined whether there is any patent gazette. If there is no patent gazette of which genus is ○ (genus) inside the circular region, the coordinate value of the coordinate point on the two-dimensional coordinate of the patent gazette is set in the variable o meaning the center of the circle in S1306, A predetermined Rmin is set in a variable r indicating a radius. On the other hand, if there is a patent publication in which the genus is ◯ (genus) inside the circular region, each of the genus in the circular region is ◯ (genus) in the variable o meaning the center of the circle in S1307. The center of gravity value of the coordinate point corresponding to the patent gazette is set, and the variable r meaning the radius is set to the minimum value including all the coordinate points corresponding to the patent gazette of ○ (genus) in the circular region. To do.
In S1308, the unclassified patent publication included in the circle formed by the variable o indicating the center of the circle and the variable r indicating the radius is added to the unclassified subset.

実施例２は実施例１に対し、分類除去処理を付加したものである。図１０は、本実施例に係る文書分類付与装置の構成図を示したものである。実施例２に係る文書分類付与装置は、実施例１に係る文書分類付与装置に対して、分類付与文書入力部２０、フィルタリング用特徴量抽出部２１および分類除去部２２が付加されている点が実施例１とは異なる。それ以外は実施例１と同等である。
フィルタリング用特徴量抽出部２１は、分類付与文書入力部２０から入力された分類付与特許公報から、類似度判定に用いるためのフィルタリング用特徴量としてＩＰＣを抽出する。 In the second embodiment, a classification removal process is added to the first embodiment. FIG. 10 shows a configuration diagram of the document classification assigning apparatus according to the present embodiment. The document classification assigning apparatus according to the second embodiment is different from the document classification assigning apparatus according to the first embodiment in that a classification addition document input unit 20, a filtering feature amount extraction unit 21, and a classification removal unit 22 are added. This is different from the first embodiment. Other than that is the same as Example 1.
The filtering feature quantity extraction unit 21 extracts an IPC as a filtering feature quantity to be used for similarity determination from the classification grant patent publication input from the classification grant document input unit 20.

分類除去部２２では、属否が○（属）である特許公報集合について、該特許公報集合全体が持つＩＰＣのヒストグラムを抽出した後に、属否が○（属）である該特許公報集合の中の特許公報１件１件について、該特許公報がもつＩＰＣのヒストグラムと前記特許公報集合全体が持つＩＰＣのヒストグラムとの類似度を判定し、類似度が基準値よりも低い等により類似していないと判定されたものを技術分類の除去対象とし、このようにして分類除去部２１で得られた特許公報を２次元座標マッピング処理部１６への入力とし、２次元座標マッピング処理部１６は分類除去部２１から入力された特許公報に対応する座標点を２次元座標上に配置している。 The classification removing unit 22 extracts the IPC histogram of the entire patent publication set for the patent publication set whose genus is ◯ (genus), and then, in the patent publication set whose genus is ○ (genus). For each patent publication of No. 1, the similarity between the IPC histogram of the patent publication and the IPC histogram of the entire patent publication set is determined, and the similarity is not similar, for example, lower than the reference value And the patent gazette thus obtained by the classification removing unit 21 is input to the two-dimensional coordinate mapping processing unit 16, and the two-dimensional coordinate mapping processing unit 16 performs the classification removal. Coordinate points corresponding to the patent gazette input from the unit 21 are arranged on the two-dimensional coordinates.

図１１は、分類除去部２１での動作を示したフローチャートである。
Ｓ７１では、調査対象である属否が○（属）である特許公報全体がもつＩＰＣの出現頻度を表すヒストグラム（全体ヒストグラム）を作成する。
Ｓ７２では、調査対象特許公報すべてについて調査し終えたか否か判断する。調査し終えたならば終了し、調査し終えていなければ、Ｓ７３で未調査特許公報１件を選択する。
Ｓ７４で該特許公報がもつＩＰＣのヒストグラム（個別ヒストグラム）を作成する。
Ｓ７５で全体ヒストグラムと個別ヒストグラムとの分布の類似度を計測する。本実施例ではχ２乗値を用いる。 FIG. 11 is a flowchart showing the operation of the classification removal unit 21.
In S71, a histogram (overall histogram) representing the appearance frequency of the IPC possessed by the entire patent gazette whose genus is ◯ (genus) as the investigation target is created.
In S72, it is determined whether or not all the patent publications to be searched have been searched. If the search has been completed, the process ends. If the search has not been completed, one unsearched patent publication is selected in S73.
In S74, the IPC histogram (individual histogram) of the patent publication is created.
In S75, the similarity of distribution between the whole histogram and the individual histogram is measured. In this embodiment, a chi-square value is used.

Ｓ７６では、Ｓ７５で得られたχ２乗値について、有意水準５％で判定し、全体ヒストグラムと個別ヒストグラムとの分布が類似していると判断した場合には、次の特許公報の処理へと進む。一方、類似していないと判断した場合には、Ｓ７７で、全体ヒストグラムで最も頻度が高いＩＰＣと個別ヒストグラムで最も頻度が高いＩＰＣとを比較し、それらが同一であれば、付与されている分類は確からしいと判断し、次の特許候補の処理に進む。一方、お互いの最も頻度が高いＩＰＣが異なる場合には、付与されている技術分類は疑わしいと判断し、Ｓ７８で該特許公報に付与されている技術分類を除去し、次の特許公報の処理へと進む。
上記処理により、疑わしいと思われる技術分類が除去され、信頼性が高い技術分類が付与された特許公報集合が得られるため、技術分野を付与すべき特許公報の効果的な選択が実現される。 In S76, the chi-square value obtained in S75 is determined at a significance level of 5%. If it is determined that the distribution of the whole histogram and the individual histogram is similar, the process proceeds to the next patent publication. . On the other hand, if it is determined that they are not similar, in step S77, the IPC having the highest frequency in the entire histogram is compared with the IPC having the highest frequency in the individual histogram. And proceed to processing of the next patent candidate. On the other hand, when the IPCs having the highest frequency are different from each other, it is determined that the assigned technical classification is suspicious, and the technical classification assigned to the patent gazette is removed in S78, and the next patent publication is processed. Proceed with
By the above process, the technical classification that seems to be suspicious is removed, and a collection of patent gazettes to which a highly reliable technical classification is assigned is obtained, so that effective selection of patent publications to which a technical field should be assigned is realized.

実施例３は実施例１に対し、入力された文書に対し、技術分類が付与されていない分類未付与特許公報集合１００全体のサイズを分割する処理が付加されている。そのほかの処理はすべて同等である。図１２は、本実施例に係る文書分類付与装置の構成図である。
分類未付与文書集合分割部９１では、入力された分類未付与特許公報集合１００を３分割し、分割された分類未付与特許公報集合９２の各々を、分類付与特許公報集合１０１とともに文書入力部１１に入力する。本実施例は２次元座標マッピング処理に要する処理時間を削減する。分類が付与される分類未付与特許公報は、分類付与特許公報との関係から得られるため、分類未付与特許公報集合１００を分割しても同様の効果が得られるとともに、処理時間の短縮が実現される。 The third embodiment is different from the first embodiment in that a process for dividing the size of the entire unclassified patent publication set 100 to which no technical classification is assigned is added to the input document. All other processing is equivalent. FIG. 12 is a configuration diagram of the document classification assigning apparatus according to the present embodiment.
The unclassified document set dividing unit 91 divides the inputted unclassified patent gazette set 100 into three, and each of the divided non-granted patent gazette sets 92 together with the classified grant patent gazette set 101 is a document input unit 11. To enter. This embodiment reduces the processing time required for the two-dimensional coordinate mapping process. A non-classified patent gazette to which a classification is assigned is obtained from a relationship with a classification-granted patent gazette, so that the same effect can be obtained even if the non-classified patent gazette set 100 is divided, and the processing time is reduced. Is done.

本発明の実施例１に係る文書分類付与装置の構成図である。It is a block diagram of the document classification provision apparatus which concerns on Example 1 of this invention. 文書間距離の説明図である。It is explanatory drawing of the distance between documents. 文書ベクトル作成時に筆頭ＩＰＣの値に対して１０倍の重み付けをした場合の特許公報２次元座標マッピング図である。It is a patent gazette two-dimensional coordinate mapping figure at the time of weighting 10 times with respect to the value of the first IPC at the time of document vector creation. 文書ベクトル作成時に筆頭ＩＰＣに重み付けをしていない場合の特許公報２次元座標マッピング図である。It is a patent gazette two-dimensional coordinate mapping figure in case weighting is not carried out to the first IPC at the time of document vector creation. 特許公報２次元座標マッピング図における選択対象領域の説明図である。It is explanatory drawing of the selection object area | region in a patent gazette two-dimensional coordinate mapping figure. 特許公報２次元座標マッピング図における領域の説明図である。It is explanatory drawing of the area | region in a patent gazette two-dimensional coordinate mapping figure. 図６に示す領域「Ａ」および「ア」における分類付与、未付与の精度を示す図である。It is a figure which shows the classification | category assignment | providing accuracy in the area | region "A" shown in FIG. 特許公報２次元座標マッピング図に対して特許公報数をｚ軸として加えた場合の特許公報３次元座標マッピング図である。It is a patent publication three-dimensional coordinate mapping figure at the time of adding the number of patent publications as a z-axis with respect to a patent publication two-dimensional coordinate mapping figure. 本発明の実施例１に係る分類未付与部分集合選択部での動作を示したフローチャートである。It is the flowchart which showed the operation | movement in the non-classification | assignment subset selection part which concerns on Example 1 of this invention. 本発明の実施例２に係る文書分類付与装置の構成図である。It is a block diagram of the document classification provision apparatus which concerns on Example 2 of this invention. 同実施例に係る分類除去部での動作を示したフローチャートである。It is the flowchart which showed the operation | movement in the classification removal part which concerns on the same Example. 本発明の実施例３に係る文書分類付与装置の構成図である。It is a block diagram of the document classification provision apparatus which concerns on Example 3 of this invention. ３文字間の類似性の説明図である。It is explanatory drawing of the similarity between 3 characters.

Explanation of symbols

１１文書入力部
１２特徴量抽出部
１３文書特徴量ベクトル生成部
１４主特徴量辞書
１５文書間距離計算部
１６次元座標マッピング処理部
１７分類未付与部分集合選択部
１８分類付与対象特許公報部分集合
２０分類付与文書入力部
２１フィルタリング用特徴量抽出部
２２分類除去部
９１分類未付与文書集合分割部
９２分割された分類未付与特許公報集合
１００分類未付与特許公報集合
１０１分類付与特許公報集合 DESCRIPTION OF SYMBOLS 11 Document input part 12 Feature-value extraction part 13 Document feature-value vector generation part 14 Main feature-value dictionary 15 Inter-document distance calculation part 16 Dimensional coordinate mapping process part 17 Unclassified subset selection part 18 Classification grant object patent publication subset 20 Classification-added document input unit 21 Filtering feature amount extraction unit 22 Classification removal unit 91 Unclassified document set division unit 92 Divided non-granted patent gazette set 100 Non-classified patent gazette set 101 Classification-granted patent gazette set

Claims

Document classification assignment that assigns the predetermined classification category to a non-classification document that is a document that has not been assigned a classification category based on a classification grant document that is a document to which a predetermined classification category has been assigned In the device
A document input unit for inputting a plurality of documents including the classified documents and the unclassified documents;
A feature amount extraction unit that extracts a feature amount from each document input from the document input unit;
A main feature dictionary storage unit for storing a rule for extracting a main feature from the feature;
A main feature amount is extracted from the feature amounts obtained from the feature amount extraction unit with reference to the rules stored in the main feature amount dictionary storage unit, and a weight greater than 1 is superimposed on the value of the main feature amount A document feature quantity vector generation unit for generating a feature quantity vector having the feature quantity as an element after
An inter-document distance calculation unit that calculates a distance between documents using the feature vector generated by the document feature vector generation unit;
A two-dimensional coordinate mapping processing unit that arranges coordinate points corresponding to each document on two-dimensional coordinates so that the relationship between the distances between documents obtained by the inter-document distance calculation unit appears.
From the two-dimensional coordinates, an unclassified document for selecting a set of coordinate points corresponding to the unclassified document having the main feature amount and to which the predetermined classification category should be assigned. A document classification assigning device comprising: an assignment subset selection unit.

In Claim 1, the filtering feature-value extraction part which extracts the feature-value for filtering used for the similarity determination of a document from the classification provision document input from the said document input part,
For each of the predetermined classification categories, the similarity between the shape of the filtering feature amount frequency distribution of the entire classification-giving document and the shape of the filtering feature amount frequency distribution of each of the classification-giving documents Based on the determined similarity, the frequency distribution of the filtering feature amount of the predetermined classification-added document is similar to the shape of the frequency distribution of filtering feature amount of the entire classification-added document. A filtering feature amount having the highest frequency in the frequency distribution of filtering feature amounts of the entire classification-giving document and the predetermined classification-giving document. If the filtering feature quantity with the highest frequency in the frequency distribution of filtering feature quantities is not the same, it is given to the predetermined classification grant document. Anda classification removing unit for removing the classification category,
An apparatus for assigning document classification, wherein the document is input to the two-dimensional coordinate mapping processing unit in order to arrange coordinate points corresponding to the document obtained by the classification removing unit on the two-dimensional coordinate.

3. The document classification assigning apparatus according to claim 2, wherein the classification removing unit uses a chi-square test process for determining the similarity of the shape of the frequency distribution of the feature quantity for filtering.

4. The unclassified subset selection unit according to claim 1, wherein the unclassified subset selection unit sets a circular area including a coordinate point corresponding to the classified document on the two-dimensional coordinates, and is included in the circular area. And a set of coordinate points corresponding to a non-classified document having the main feature amount is selected.

5. The document according to claim 1, wherein the plurality of input documents is a patent gazette, and the feature amount of the extracted document and the feature amount for filtering are IPC (International Patent Classification). , FI, FT (File Forming Term) or a word in a patent gazette, the classification category is a technical classification category, and the rule stored in the main document feature dictionary storage unit is the first A document classification assigning apparatus, characterized by extracting an IPC, an FI similar in content to the first IPC, an FT related to the FI, or a word in a patent publication.

Performed by the document classification assigning device for assigning the predetermined classification category to the unclassified document without the classification category based on the classification grant document that is a document to which the predetermined classification category is assigned. In the document classification assignment method,
A document input step of inputting a plurality of documents including the classified document and the unclassified document;
A feature amount extraction step of extracting a feature amount from each document input from the document input step;
A main feature dictionary storing step for storing a rule for extracting a main feature from the feature in the main feature dictionary storage;
A main feature amount is extracted from the feature amounts obtained in the feature amount extraction step with reference to the main feature amount dictionary storage unit, and the feature amount after a weight greater than 1 is superimposed on the main feature amount A document feature vector generation step for creating a feature vector as an element;
An inter-document distance calculation step of calculating a distance between documents using the feature amount vector generated in the document feature amount vector generation step;
A two-dimensional coordinate mapping processing step of arranging coordinate points corresponding to each document on the two-dimensional coordinates so that the relationship between the distances of the documents obtained in the inter-document distance calculation step appears.
A non-classified portion for selecting a coordinate point corresponding to a non-classified document to which the predetermined classification category is to be assigned and which is the non-classified document having the main feature amount from the two-dimensional coordinates A document classification assigning method comprising: a set selection step.

The filtering feature value extracting step according to claim 6, wherein a filtering feature value for use in document similarity determination is extracted from the classified document input from the document input step;
For each of the predetermined classification categories, determine the degree of similarity between the filtering feature amount frequency distribution of the entire classification-giving document and the filtering feature amount frequency distribution of each of the classification-giving documents, Based on the determined similarity, it is determined that the frequency distribution of the filtering feature value possessed by the predetermined classification grant document is not similar to the shape of the frequency distribution of the filtering feature amount possessed by the entire classification grant document. And the filtering feature amount having the highest frequency in the frequency distribution of the filtering feature amount possessed by the entire classification-giving document and the filtering feature possessed by the predetermined classification-giving document. When the feature quantity for filtering with the highest frequency in the frequency distribution of quantities is not the same, the classification category assigned to the predetermined classification grant document is Anda classification removal step that support,
In the two-dimensional coordinate mapping processing step,
A document classification assigning method, wherein coordinate points corresponding to a document obtained in the classification removal step are arranged on the two-dimensional coordinates.

8. The document classification assigning method according to claim 7, wherein in the classification removal step, a chi-square test process is used for determining the similarity of the shape of the frequency distribution of the filtering feature quantity.

9. The non-classified subset selection step according to any one of claims 6 to 8, wherein a circular area including a coordinate point corresponding to the classified document is set on the two-dimensional coordinates and is included in the circular area. And a document classification assigning method, wherein a set of coordinate points corresponding to an unclassified document having the main feature amount is selected.

10. The input document according to claim 6, wherein the plurality of input documents are patent gazettes, and the feature amount of the extracted document and the feature amount for filtering are IPC (International Patent Classification). , FI, FT (File Forming Term) or a word in a patent gazette, the classification category is a technical classification category, and the rule stored in the main document feature dictionary storage unit is the first A document classification assigning method, characterized by extracting an IPC, an FI similar in content to the first IPC, an FT related to the FI, or a word in a patent gazette.

A program for causing a computer to execute the method according to any one of claims 6 to 10.