JP2009251796A

JP2009251796A - Document data sorting device, its method, and program

Info

Publication number: JP2009251796A
Application number: JP2008097163A
Authority: JP
Inventors: Yasukazu Mizushima; 靖和水嶋
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2008-04-03
Filing date: 2008-04-03
Publication date: 2009-10-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document data sorting device capable of performing sorting with a raised reproduction ratio when document data is sorted into prescribed categories, and to provide its method and a program. <P>SOLUTION: The document data sorting device includes: a dictionary generation attribute extracting part 4 for extracting an FT from known-category patent publications data 3 which are previously sorted into the categories "applied" and "not-applied"; a probability value calculation method selecting part 6 for selecting an expression to calculate a probability value E, based on the ratio of the category "applied" to the category "not-applied"; an inter-attribute-category probability value calculating part 5 for calculating the probability value E by using the selected expression so as to store the value in a probability value dictionary 7; an attribute extracting part 2 for extracting the FT of the sorted patent publications data 1; and a category determining part 8 for determining the category with respect to the sorted patent publications data 1 through the use of the FT of the sorted patent publications data 1 and the probability value dictionary 7, and outputting the determination result 9. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、文書データを所定のカテゴリに自動区分する文書データ区分装置およびその方法とプログラムに関し、特に、特許公報を自動区分する文書データ区分装置およびその方法とプログラムに関する。 The present invention relates to a document data sorting apparatus and method and program for automatically sorting document data into predetermined categories, and more particularly to a document data sorting apparatus and method and program for automatically sorting patent publications.

従来、特許公報の自動区分に関する技術として、特許文献１がある。この技術は、出願人、発明者の表示に基づいて自動区分するものであり、その処理は図３に記載したとおりである。すなわち、入力された被区分特許公報データ３１から発明者情報抽出部３２で発明者情報を抽出し、入力された前記被区分特許公報データ３１を、前記発明者情報抽出部３２で抽出された前記発明者情報をもとに特許公報分類部３３で区分けし、それを発明者別特許公報ＤＢ３４に登録するというものである。本方法に従えば発明者別に特許公報を区分けできるが、発明内容、発明カテゴリは発明者によることはなく、同一発明者であっても複数の異なる発明をすることがある。また発明者情報の有無によって被区分特許公報データ３１を区分けし、被区分特許公報データ３１に記載されている技術内容によって区分けするわけではない。そのため、区分けされた結果についてみると、その技術内容に基づき二次検索、二次調査を効率的に行うには、その精度、再現率は不十分であった。ここでいう精度、再現率とは、図４に記載したとおりである。すなわち、精度とは、自動区分した特許公報中に含まれる関連特許公報の度合い（Ａ／Ａ＋Ｂ）のことをさし、再現率とは、全特許公報中に含まれる関連する特許公報を漏れなく抽出できたかどうかを示す度合い（Ａ／Ａ＋Ｃ）をさす。 Conventionally, there is Patent Document 1 as a technique related to automatic classification of patent publications. This technique automatically sorts based on the display of the applicant and the inventor, and the processing is as described in FIG. That is, the inventor information extraction unit 32 extracts inventor information from the input classified patent publication data 31, and the input classified patent publication data 31 is extracted by the inventor information extraction unit 32. Based on the inventor information, the patent publication classification unit 33 classifies the information and registers it in the inventor-specific patent publication DB 34. According to this method, patent publications can be classified by inventor. However, the inventor does not depend on the inventor's content and invention category, and the same inventor may make a plurality of different inventions. Further, the classified patent gazette data 31 is classified based on the presence or absence of the inventor information, and is not classified according to the technical contents described in the classified patent gazette data 31. Therefore, when looking at the classified results, the accuracy and recall were insufficient to efficiently perform secondary search and secondary investigation based on the technical contents. The precision and the recall here are as described in FIG. That is, the accuracy refers to the degree (A / A + B) of the related patent gazette included in the automatically classified patent gazette, and the recall rate does not leak the related patent gazette included in all patent gazettes. Describes the degree (A / A + C) indicating whether or not the extraction was successful.

従来知られている自動区分手法は、精度向上を課題とし、そこではさまざまな分類手法がある。その中には、サポートベクトルマシン、決定木、決定リストがある。決定リストでは、辞書を参照し、関連する属性に対応した確率値のリストを作成し、リストの中で最も高い確率値を与えるカテゴリを区分結果としている。
たとえば、非特許文献１には、同じひらがな表記を持つ単語集合の中から文意に適した単語を抽出する問題同音異義語問題に決定リストを使う方法が記載されている。 Conventional automatic classification methods have a problem of improving accuracy, and there are various classification methods. Among them are support vector machines, decision trees, and decision lists. In the decision list, a dictionary is referenced to create a list of probability values corresponding to related attributes, and the category that gives the highest probability value in the list is used as the classification result.
For example, Non-Patent Document 1 describes a method of using a decision list for a problem homonym problem that extracts words suitable for sentence meaning from a set of words having the same hiragana notation.

他に、非特許文献２には、決定リストを使い、確率値の計算にベイズ統計の手法を利用し、一様分布、ベータ分布を近似的に事前分布として導入する方法が記載されている。
他に、特許文献２には、決定リストを使い、属性に依存した確率的パープレキシティと属性に依存しない確率的パープレキシティとの差分を使う方法が記載されている。
特開２００６−５８９７７号公報特開２００１−２６６０６０号公報情報処理学会論文誌、Ｖｏｌ．３９、Ｎｏ．１２「複合語からの証拠に重みをつけた決定リストによる同音異義語判別」（１９９８）自然言語処理、Ｖｏｌ．９、Ｎｏ．３「ベイズ統計の手法を利用した決定リストのルール信頼度推定法」（２００１） In addition, Non-Patent Document 2 describes a method in which a uniform distribution and a beta distribution are approximately introduced as a prior distribution by using a decision list and using a Bayesian statistical method for calculating a probability value.
In addition, Patent Document 2 describes a method of using a difference between probabilistic perplexity depending on attributes and probabilistic perplexity not depending on attributes using a decision list.
JP 2006-58977 A JP 2001-266060 A IPSJ Journal, Vol. 39, no. 12 “Identification of homonyms using decision lists weighting evidence from compound words” (1998) Natural language processing, Vol. 9, no. 3. “Rule reliability estimation method for decision lists using Bayesian statistics” (2001)

しかしながら、特許文献１では、出願人、発明者等の項目の有無によって自動区分するが、同一発明者であっても、複数の異なる発明をすることがあり、また特許公報に記載されている技術内容に基づき区分けされていないため、区分けされた結果には特許公報記載の技術内容が反映されにくく、技術内容に基づいた効率的な二次検索、二次調査には、精度、再現率いずれも不十分である。また、特許文献２、非特許文献１乃至非特許文献２では、精度向上には寄与するが、再現率向上には寄与しない。一般に精度向上と再現性向上はトレードオフの関係にあり、精度向上を目指している従来技術では再現性向上は期待できない。また、単に精度を落とせば、再現性が向上する訳ではない。 However, in patent document 1, although it classifies automatically by the presence or absence of the item of an applicant, an inventor, etc., even if it is the same inventor, a several different invention may be made, and the technique described in the patent gazette Since it is not classified based on the contents, the technical contents described in the patent gazette are not easily reflected in the classified results, and both the accuracy and recall are effective for efficient secondary search and secondary investigation based on the technical contents. It is insufficient. Patent Document 2 and Non-Patent Document 1 to Non-Patent Document 2 contribute to improving accuracy but do not contribute to improving recall. In general, accuracy improvement and reproducibility improvement are in a trade-off relationship, and it is not possible to expect improvement in reproducibility with conventional techniques aiming at accuracy improvement. Also, simply reducing the accuracy does not improve the reproducibility.

これらの精度向上を目的とした手法は、少ない検索漏れが望ましい調査には不向きであった。特に、一旦分類した後に再度中身を精査するような、特許公報の二次検索、二次調査への活用には不向きであった。
そこで本発明は、上記従来の未解決の問題に着目してなされたものであり、文書データを所定のカテゴリに区分する際に、再現率を高めた区分を行うことを可能とする文書データ区分装置およびその方法とプログラムを提供することを目的とする。 These methods aimed at improving accuracy are not suitable for investigations where a small number of search omissions is desirable. In particular, it was unsuitable for the secondary search of patent gazettes and the secondary search where the contents were scrutinized once again after classification.
Therefore, the present invention has been made paying attention to the above-mentioned conventional unsolved problems, and it is possible to perform document data classification that enables classification with an increased recall when document data is classified into predetermined categories. It is an object of the present invention to provide an apparatus and a method and program thereof.

上記問題を解決するために、本発明の請求項１にかかる、文書データを所定のカテゴリに区分する文書データ区分装置は、入力された被区分文書データの属性を抽出する属性抽出手段と、予め所定のカテゴリに区分された既知文書データから属性を抽出する辞書作成用属性抽出手段と、前記所定のカテゴリに区分された既知文書データのカテゴリと前記辞書作成用属性抽出手段で抽出された属性との関係を表す確率値を計算する属性カテゴリ間・確率値計算手段と、前記所定のカテゴリに区分された既知文書データのカテゴリ間の比率に基づいて前記属性カテゴリ間・確率値計算手段で用いる計算方法を選択する確率値計算方法選択手段と、前記属性カテゴリ間・確率値計算手段で計算された確率値を前記カテゴリおよび前記属性と組み合わせて保存する確率値辞書と、前記属性抽出手段で抽出された属性と前記確率値辞書とを用いて、前記入力された被区分文書データに対するカテゴリを判定するカテゴリ判定手段と、前記判定された判定結果を出力する判定結果出力手段とを備えたことを特徴とする。 In order to solve the above-mentioned problem, a document data classification device according to claim 1 of the present invention for classifying document data into a predetermined category includes attribute extraction means for extracting attributes of inputted classified document data, A dictionary creating attribute extracting means for extracting attributes from known document data classified into a predetermined category, a category of known document data classified into the predetermined category, and an attribute extracted by the dictionary creating attribute extracting means; Between the attribute categories / probability value calculating means for calculating the probability value representing the relationship between the attribute categories, and between the attribute categories / probability value calculating means based on the ratio between the categories of the known document data divided into the predetermined categories Probability value calculation method selection means for selecting a method, and the probability values calculated by the attribute category / probability value calculation means are combined with the category and the attribute. A category determination unit that determines a category for the input classified document data using the probability value dictionary to be stored and the attribute extracted by the attribute extraction unit and the probability value dictionary, and the determined determination And a determination result output means for outputting a result.

この請求項１の発明によれば、予め所定のカテゴリに区分された既知文書データのカテゴリ間の比率によって計算方法を選択することにより、再現率向上が期待される計算方法を選択することが可能となり、再現率を高めた文書データの自動区分が可能になる。
また、請求項２にかかる文書データ区分装置は、請求項１にかかる文書データ区分装置において、前記確率値計算方法選択手段は、（式１）または（式２）を用いて前記確率値を計算する方法を選択することを特徴とする。 According to the first aspect of the present invention, it is possible to select a calculation method that is expected to improve the reproduction rate by selecting a calculation method according to a ratio between categories of known document data that is divided into predetermined categories in advance. Thus, automatic classification of document data with a high reproduction rate becomes possible.
The document data sorting apparatus according to claim 2 is the document data sorting apparatus according to claim 1, wherein the probability value calculation method selection means calculates the probability value using (Expression 1) or (Expression 2). It is characterized by selecting the method to do.

また、請求項３にかかる文書データ区分装置は、入力された被区分文書データの属性を抽出する属性抽出手段と、予め所定のカテゴリに区分された既知文書データから属性を抽出する辞書作成用属性抽出手段と、前記所定のカテゴリに区分された既知文書データのカテゴリと前記辞書作成用属性手段で抽出された属性との関係を表す確率値を（式３）を用いて計算する属性カテゴリ間・確率値計算手段と、前記所定のカテゴリに区分された既知文書データのカテゴリ間の比率に基づいて前記属性カテゴリ間・確率値計算手段で用いるパラメータを決定する確率値計算パラメータ決定手段と、前記属性カテゴリ間・確率値計算手段で計算された確率値を前記カテゴリおよび前記属性と組み合わせて保存する確率値辞書と、前記属性抽出手段で抽出された属性と前記確率値辞書とを用いて、前記入力された被区分文書データに対するカテゴリを判定するカテゴリ判定手段と、前記判定された判定結果を出力する判定結果出力手段とを備えたことを特徴とする。 According to another aspect of the present invention, there is provided an attribute extracting unit for extracting an attribute of input classified document data, and a dictionary creating attribute for extracting an attribute from known document data previously classified into a predetermined category. A probability value representing the relationship between the extraction means, the category of the known document data classified into the predetermined category and the attribute extracted by the dictionary creation attribute means, using (Equation 3) A probability value calculating means; a probability value calculating parameter determining means for determining a parameter to be used in the attribute category / probability value calculating means based on a ratio between categories of the known document data divided into the predetermined categories; and the attribute A probability value dictionary that stores the probability value calculated by the inter-category / probability value calculating means in combination with the category and the attribute, and extracted by the attribute extracting means. A category determination unit that determines a category for the input classified document data using the attribute and the probability value dictionary, and a determination result output unit that outputs the determined determination result. Features.

この請求項３の発明によれば、予め所定のカテゴリに区分された既知文書データのカテゴリ間の比率によって、計算に用いるパラメータを決定することにより、再現率向上が期待される計算式を決定することが可能となる。
また、請求項４にかかる文書データ区分装置は、請求項１から３のいずれか１項にかかる文書データ区分装置において、前記入力される被区分文書データが特許公報のデータであり、前記所定のカテゴリが所定の技術分野への属否であり、前記属性抽出手段と前記辞書作成用属性抽出手段とで抽出される属性が、Ｆタームと、ＩＰＣ（国際特許分類）と、ＦＩと、特許公報のデータ中に現れる単語との少なくとも１つであることを特徴とする。
この請求項４の発明によれば、文書データを特許公報のデータとすることで、検索漏れが少ないことが期待される特許公報自動区分が可能となる。 According to the third aspect of the present invention, the calculation formula that is expected to improve the reproduction rate is determined by determining the parameter used for the calculation based on the ratio between the categories of the known document data that is divided into predetermined categories in advance. It becomes possible.
A document data sorting apparatus according to claim 4 is the document data sorting apparatus according to any one of claims 1 to 3, wherein the inputted classified document data is data of a patent publication, and the predetermined data The category belongs to a predetermined technical field, and the attributes extracted by the attribute extraction means and the dictionary creation attribute extraction means are F-term, IPC (International Patent Classification), FI, and Patent Gazette And at least one of the words appearing in the data.
According to the fourth aspect of the present invention, by using document data as patent gazette data, automatic classification of patent gazettes that are expected to have few search omissions becomes possible.

また、請求項５にかかる文書データ区分方法は、入力された被区分文書データの属性を抽出する属性抽出ステップと、予め所定のカテゴリに区分された既知文書データから属性を抽出する辞書作成用属性抽出ステップと、前記所定のカテゴリに区分された既知文書データのカテゴリ間の比率に基づいて前記既知文書データのカテゴリと前記辞書作成用属性抽出ステップで抽出された属性との関係を表す確率値を計算する計算方法を選択する確率値計算方法選択ステップと、前記選択された計算方法を用いて前記確率値を計算する属性カテゴリ間・確率値計算ステップと、前記計算された確率値を前記カテゴリおよび前記属性と組み合わせて確率値辞書に保存する確率値辞書保存ステップと、前記被区分文書データの属性と前記確率値辞書とを用いて、前記入力された被区分文書データに対するカテゴリを判定するカテゴリ判定ステップと、前記判定された判定結果を出力する判定結果出力ステップとを備えたことを特徴とする。 Further, the document data classification method according to claim 5 includes an attribute extraction step of extracting attributes of inputted classified document data, and a dictionary creation attribute for extracting attributes from known document data previously classified into predetermined categories. A probability value representing a relationship between the extraction step and the category of the known document data and the attribute extracted in the dictionary creating attribute extraction step based on a ratio between categories of the known document data classified into the predetermined category; A probability value calculation method selection step for selecting a calculation method to be calculated; an attribute inter-category / probability value calculation step for calculating the probability value using the selected calculation method; and the calculated probability value as the category and A probability value dictionary storing step for storing in the probability value dictionary in combination with the attribute; the attribute of the classified document data and the probability value dictionary; There are, characterized by comprising a determining category determining step a category with respect to the classification document data the input, and a determination result output step of outputting the determined determination result.

この請求項５の発明によれば、予め所定のカテゴリに区分された既知文書データのカテゴリの比率によって計算方法を選択することにより、再現率向上が期待される計算方法を選択することが可能となる。
また、請求項６にかかる文書データ区分方法は、請求項５にかかる文書データ区分方法において、前記確率値計算方法選択ステップでは、（式１）または（式２）を用いて前記確率値を計算する方法を選択することを特徴とする。 According to the invention of claim 5, it is possible to select a calculation method that is expected to improve the reproduction rate by selecting the calculation method according to the ratio of the categories of the known document data previously classified into predetermined categories. Become.
The document data classification method according to claim 6 is the document data classification method according to claim 5, wherein the probability value is calculated using (Equation 1) or (Equation 2) in the probability value calculation method selection step. It is characterized by selecting the method to do.

また、請求項７にかかる文書データ区分方法は、コンピュータが実行する、文書データを所定のカテゴリに区分する文書データ区分方法において、入力された被区分文書データの属性を抽出する属性抽出ステップと、予め所定のカテゴリに区分された既知文書データから属性を抽出する辞書作成用属性抽出ステップと、前記所定のカテゴリに区分された既知文書データのカテゴリ間の比率に基づいて前記既知文書データのカテゴリと前記辞書作成用属性抽出ステップで抽出された属性との関係を表す確率値を計算するための（式３）におけるパラメータを決定するパラメータ決定ステップと、前記所定のカテゴリに区分された既知文書データのカテゴリと前記辞書用属性との関係を表す確率値を（式３）を用いて計算する属性カテゴリ間・確率値計算ステップと、前記算出された確率値を前記カテゴリおよび前記属性と組み合わせて確率値辞書に保存する確率値辞書保存ステップと、前記被区分文書データの属性と前記確率値辞書とを用いて、前記入力された被区分文書データに対するカテゴリを判定するカテゴリ判定ステップと、前記判定された判定結果を出力する判定結果出力ステップとを備えたことを特徴とする。 The document data classification method according to claim 7 is an attribute extraction step of extracting attributes of inputted classified document data in the document data classification method for classifying document data into a predetermined category, which is executed by a computer. A dictionary creating attribute extracting step for extracting attributes from known document data previously classified into predetermined categories; and a category of the known document data based on a ratio between categories of known document data classified into the predetermined categories; A parameter determining step for determining a parameter in (Equation 3) for calculating a probability value representing the relationship with the attribute extracted in the dictionary creating attribute extracting step, and the known document data classified into the predetermined category A probability value representing the relationship between the category and the dictionary attribute is calculated using (Equation 3). Using a value calculation step, a probability value dictionary storage step of storing the calculated probability value in the probability value dictionary in combination with the category and the attribute, the attribute of the classified document data and the probability value dictionary, A category determining step for determining a category for the inputted classified document data, and a determination result outputting step for outputting the determined determination result are provided.

この請求項７の発明によれば、予め所定のカテゴリに区分された既知文書データのカテゴリの比率によって、計算に用いるパラメータを決定することにより、再現率向上が期待される計算式を決定することが可能となる。
また、請求項８の発明にかかるプログラムは、サーバからのダウンロードあるいは記録媒体からのコピーによってコンピュータに記憶させることで、請求項５から７のいずれか１項に記載された方法をコンピュータによって実現することが可能となる。 According to the seventh aspect of the present invention, the calculation formula that is expected to improve the reproduction rate is determined by determining the parameter used for the calculation based on the ratio of the categories of the known document data that are classified into predetermined categories in advance. Is possible.
The program according to the invention of claim 8 is stored in the computer by downloading from the server or copying from the recording medium, thereby realizing the method described in any one of claims 5 to 7 by the computer. It becomes possible.

本発明により、予め所定のカテゴリに区分された既知文書データのカテゴリ間の比率によって計算方法を選択することにより、再現率向上が期待される計算方法を選択することが可能となり、再現率を高めた文書データの自動区分が可能になる。 According to the present invention, it is possible to select a calculation method that is expected to improve the reproduction rate by selecting a calculation method according to the ratio between categories of known document data that is divided into predetermined categories in advance, thereby improving the reproduction rate. Automatic classification of document data becomes possible.

以下、本発明の実施の形態について具体的な例として、１つの特許公報の内容が含まれる特許公報データを入力とした場合について説明する。図１は、本実施例にかかる文書データ区分装置の機能構成を示す図である。この文書データ区分装置は、属性抽出部２、辞書作成用属性抽出部４、属性カテゴリ間・確率値計算部５、確率値計算方法選択部６、確率値辞書７、カテゴリ判定部８を備えている。確率値辞書７は、文書データ区分装置が備える図示せぬ記憶装置に設けられたデータベースであり、属性抽出部２、辞書作成用属性抽出部４、属性カテゴリ間・確率値計算部５、確率値計算方法選択部６、カテゴリ判定部８は、文書データ区分装置が備える図示せぬＣＰＵが記憶装置に記憶されたプログラムを実行することにより実現される機能である。
入力としての被区分特許公報データ１が文書データ区分装置に入力され、その公報データの属性データであるたとえばＦターム（ＦＴ：File Forming Term）を属性抽出部２で抽出する。 Hereinafter, as a specific example of the embodiment of the present invention, a case where patent gazette data including the contents of one patent gazette is input will be described. FIG. 1 is a diagram illustrating a functional configuration of the document data sorting apparatus according to the present embodiment. This document data sorting apparatus includes an attribute extraction unit 2, a dictionary creation attribute extraction unit 4, an attribute category / probability value calculation unit 5, a probability value calculation method selection unit 6, a probability value dictionary 7, and a category determination unit 8. Yes. The probability value dictionary 7 is a database provided in a storage device (not shown) provided in the document data classification device, and includes an attribute extraction unit 2, a dictionary creation attribute extraction unit 4, an attribute category / probability value calculation unit 5, and a probability value. The calculation method selection unit 6 and the category determination unit 8 are functions realized by executing a program stored in a storage device by a CPU (not shown) included in the document data sorting device.
The classified patent publication data 1 as input is input to the document data sorting apparatus, and the attribute extraction unit 2 extracts, for example, F-term (FT: File Forming Term) which is attribute data of the publication data.

カテゴリ既知特許公報データ３は、予め所定のカテゴリが付与された特許公報のデータであり、多数用意されている。本実施例では、被区分特許公報データ１に含まれる技術分野と同じ技術分野を含むカテゴリ既知特許公報データ３には、カテゴリとして、たとえば属否である○、×（属する、属さない）のうち○が付与されており、異なる技術分野を含むカテゴリ既知特許公報データ３には、カテゴリとして×が付与されている。一般的に当該カテゴリ既知特許公報データ３の数が大きいほど正しい区分け結果が得られやすい。このカテゴリ既知特許公報データ３についても、前述のように属性データであるたとえばＦターム（ＦＴ）を、辞書作成用属性抽出部４で抽出する。
さらに、辞書作成用属性抽出部４で抽出された属性と既知カテゴリとの関係を、属性カテゴリ間・確率値計算部５で、確率値で表す。確率値Ｅは、（式４）または（式５）で示される。 The category known patent gazette data 3 is data of a patent gazette to which a predetermined category is assigned in advance, and a large number are prepared. In the present embodiment, the category known patent publication data 3 including the same technical field as the technical field included in the classified patent publication data 1 includes, for example, among the categories of ○ and × (belonging or not belonging) as belonging ○ is given, and the category known patent gazette data 3 including different technical fields is given x as a category. In general, the larger the number of category known patent publication data 3 is, the easier it is to obtain a correct classification result. For this category known patent gazette data 3 as well, for example, F-term (FT), which is attribute data, is extracted by the dictionary creating attribute extracting unit 4 as described above.
Further, the relationship between the attribute extracted by the dictionary creating attribute extraction unit 4 and the known category is expressed by a probability value by the attribute category / probability value calculation unit 5. The probability value E is expressed by (Expression 4) or (Expression 5).

Ｐ（カテゴリ｜ＦＴ）は、ＦＴが与えられた時に、特許公報に付与された区分が当該カテゴリである条件付確率値を表しており、カテゴリ既知特許公報データ３から計算される値である。具体的には、カテゴリが○（属する）の場合は、Ｐ（○｜ＦＴ）＝カテゴリが○の特許公報数／当該ＦＴを持つ特許公報数、なので、当該ＦＴを含む特許公報の中でカテゴリが○である特許公報の比率を意味する。例えば当該ＦＴを持つ特許公報数が１００件、そのうち、カテゴリが○の特許公報数が３０件であれば、Ｐ（○｜ＦＴ）＝０．３となる。 P (category | FT) represents a conditional probability value in which the category assigned to the patent gazette is the category when the FT is given, and is a value calculated from the category known patent gazette data 3. Specifically, when the category is ○ (belongs to), P (○ | FT) = the number of patent publications having the category ○ / the number of patent publications having the FT, Means the ratio of patent gazettes with a circle. For example, if the number of patent gazettes having the FT is 100, and among them, the number of patent gazettes with a category of 30 is 30, P (◯ | FT) = 0.3.

Ｐ（カテゴリ）は、ＦＴの値に関係なく、特許公報に付与された区分が当該カテゴリである確率値を表しており、カテゴリ既知特許公報データ３から計算される値である。具体的には、カテゴリが○（属する）の場合は、Ｐ（○）＝カテゴリが○の特許公報数／全特許公報数、であり、ＦＴに関係なくカテゴリが○である特許公報の比率を意味する。例えばカテゴリ既知の全特許公報数が１０００件、そのうち、カテゴリが○の特許公報数が３０件であれば、Ｐ（○）＝０．０３となる。 P (category) represents a probability value that the category assigned to the patent gazette is the category, regardless of the value of FT, and is a value calculated from the category known patent gazette data 3. Specifically, when the category is ○ (belongs to), P (○) = number of patent publications with category ○ / number of all patent publications, and the ratio of patent publications with category ○ regardless of FT. means. For example, if the number of all patent gazettes whose categories are known is 1000, of which 30 is the number of patent gazettes with a category of ○, P (◯) = 0.03.

ここで、¬カテゴリとは、カテゴリ集合において、当該カテゴリの補集合をあらわしている。具体的には、¬○とは×を意味する。
つまり、（式４）の確率値Ｅは、ＦＴに依存する確率値とＦＴに依存しない確率値との差分を使うことにより、確率値の正規化を行っている。
確率値計算方法選択部６では、カテゴリ既知特許公報データ３に付与されたカテゴリ（○、×）の比率により、確率値Ｅを算出するための式として、（式４）と（式５）のいずれかを用いることを選択する。 Here, the ¬ category represents a complementary set of the category in the category set. Specifically, ¬ means x.
That is, the probability value E in (Expression 4) is normalized by using the difference between the probability value that depends on FT and the probability value that does not depend on FT.
The probability value calculation method selection unit 6 uses (Expression 4) and (Expression 5) as expressions for calculating the probability value E based on the ratio of the category (◯, ×) given to the category known patent publication data 3. Choose to use either.

例えば、カテゴリが○である特許公報数が３０件であり、カテゴリが×である特許公報数が９７０件である場合、○よりも×の比率が非常に高いため、確率値計算方法選択部６では（式４）を選択し、属性カテゴリ間・確率値計算部５に指示する。
属性カテゴリ間・確率値計算部５で計算された確率値Ｅは、確率値辞書７に登録される。
被区分特許公報データ１から属性抽出部２で抽出されたＦＴは、カテゴリ判定部８に入力される。カテゴリ判定部８では、確率値辞書７を参照し、当該ＦＴに関係する確率値のリストを作成し、その中で最も確率値が高いカテゴリ（○または×）を選択する。選択されたものを判定結果９として出力する。 For example, if the number of patent publications with a category of 30 is 30 and the number of patent publications with a category of 970 is 970, the ratio of x is much higher than ○, so the probability value calculation method selection unit 6 Then, (Formula 4) is selected, and the inter-attribute category / probability value calculator 5 is instructed.
The probability value E calculated by the attribute category / probability value calculation unit 5 is registered in the probability value dictionary 7.
The FT extracted by the attribute extraction unit 2 from the classified patent publication data 1 is input to the category determination unit 8. The category determination unit 8 refers to the probability value dictionary 7, creates a list of probability values related to the FT, and selects the category (◯ or ×) having the highest probability value among them. The selected item is output as the determination result 9.

図２に、従来手法と本実施例の手法とをそれぞれ用いて特許公報を自動区分した結果の一例を示す。具体的には、図２に示した結果は、ある技術分野に属する特許公報について、Ａ〜Ｇの担当者が設定したある所定の注目技術に関連するか否かの属否をそれぞれ○、×で分類した場合に、○である特許公報の自動区分結果である。なお、担当者Ａ〜Ｇは各々異なる観点で評価を行っているため、各特許公報には異なる観点で○×のカテゴリ付与がなされている。 FIG. 2 shows an example of the result of automatic classification of patent gazettes using the conventional method and the method of this embodiment. Specifically, the results shown in FIG. 2 indicate whether or not a patent publication belonging to a certain technical field is related to whether or not it is related to a predetermined technology of interest set by the person in charge of A to G. It is the automatic classification result of the patent gazette that is ◯ when classified by. In addition, since the persons in charge A to G perform evaluation from different viewpoints, each patent publication is given a category of XX from different viewpoints.

図中、ＤＬは非特許文献１の手法を示している。ＤＬ−ｂｅｔａは非特許文献２の手法を示しており、事前分布にベータ分布を用いたものである。
図２より、本発明の手法を用いた場合には他の手法を用いた場合よりも再現率が格段に向上していることがわかる。これは、本発明の手法が、検索漏れを極力なくすことが重要である特許公報の自動分類の性能を格段に向上させることを示している。 In the figure, DL indicates the method of Non-Patent Document 1. DL-beta shows the method of Non-Patent Document 2 and uses a beta distribution as a prior distribution.
From FIG. 2, it can be seen that when the method of the present invention is used, the recall rate is remarkably improved as compared with the case where other methods are used. This shows that the method of the present invention greatly improves the performance of automatic classification of patent publications in which it is important to eliminate search omissions as much as possible.

実施例２は実施例１に対し、確率値Ｅを計算する計算式が異なっている。確率値Ｅを計算する計算式は（式６）を用いる。その他の処理は全て同等である。 Example 2 differs from Example 1 in the calculation formula for calculating the probability value E. The formula for calculating the probability value E uses (Formula 6). All other processing is equivalent.

ここで、α、βは、α＋β＝１を満たすパラメータであり、本実施例２ではカテゴリ既知特許公報におけるカテゴリ○の比率（ｒ）を変数にした（式７）を用いる。 Here, α and β are parameters satisfying α + β = 1, and in the second embodiment, the ratio (r) of the category ○ in the category known patent publication is used as a variable (Formula 7).

ここで（式６）のＥは、α＝１の場合はＦＴ依存の確率値とＦＴ非依存の確率値の比率を表し、β＝１の場合は、ＦＴ依存のカテゴリと前記カテゴリの補集合の比率を表している。 Here, E in (Equation 6) represents the ratio between the FT-dependent probability value and the FT-independent probability value when α = 1, and when β = 1, the FT-dependent category and the complement of the category Represents the ratio.

実施例３は実施例１に対し、カテゴリ数が異なり、Ａ、Ｂ、Ｃ、の３カテゴリを用いる。カテゴリ判定部８での判定処理以外の処理は全て同等である。Ａ、Ｂ、Ｃのいずれのカテゴリに属するか否かは２段階に分けて処理がなされる。１段目では、Ａであるか否か（¬Ａ＝｛Ｂ，Ｃ｝）を判定し、２段目で¬Ａに対し、Ｂであるか否か（¬Ｂ＝Ｃ）を判定する。 The third embodiment differs from the first embodiment in the number of categories, and uses three categories of A, B, and C. All processes other than the determination process in the category determination unit 8 are the same. Whether it belongs to any category of A, B and C is processed in two stages. In the first stage, it is determined whether or not A (¬A = {B, C}), and in the second stage, it is determined whether or not B (¬B = C) with respect to ¬A.

実施例４は実施例１に対し、カテゴリ数が異なり、Ａ、Ｂ、Ｃ、の３カテゴリを用いる。カテゴリ判定部８での判定処理以外の処理は全て同等である。Ａ、Ｂ、Ｃのいずれのカテゴリに属するか否かは各々のカテゴリ、¬カテゴリ、についての確率値を用いて処理がなされる。すなわち３つのカテゴリとその補集合、Ａ、Ｂ、Ｃ、¬Ａ、¬Ｂ、¬Ｃ、に対する確率値を用いて、（式５）、（式６）で計算される値を用いて１段階で判定する。 The fourth embodiment is different from the first embodiment in the number of categories, and uses three categories of A, B, and C. All processes other than the determination process in the category determination unit 8 are the same. Whether any of the categories A, B, and C belongs is processed using the probability value for each category and category. That is, using the probability values for the three categories and their complements, A, B, C, ¬A, ¬B, and ¬C, one level using the values calculated in (Equation 5) and (Equation 6). Judge with.

以上説明したように、本実施例の手法を用いることで再現率を高めた特許公報の自動区分が可能になるため、区分結果を精査するような特許公報の二次検索、二次調査に活用可能となる。なお、特許公報の属性は、Ｆタームに限らず、ＩＰＣ（International Patent Classification：国際特許分類）、ＦＩ（ＩＰＣの完全記号（サブグループまでの記号）＋３桁の数字および／または１桁のアルファベット、特許庁内のサーチファイルの編成に用いられる分類）、あるいは、特許公報のデータ中に現れる単語であってもよい。また、文書データとは特許公報のデータに限るものではなく、アンケート、ウェブテキスト、論文、書籍、広告紙等の文字情報を含むものであればよい。 As explained above, automatic classification of patent gazettes with high recall is possible by using the method of this embodiment, so it can be used for secondary searches and secondary investigations of patent gazettes that scrutinize the classification results. It becomes possible. The attributes of patent gazettes are not limited to F-term, but IPC (International Patent Classification), FI (IPC complete symbol (symbol up to subgroup) + 3-digit number and / or 1-digit alphabet, (Classification used for organizing search files in the Patent Office) or words appearing in the data of patent gazettes. Further, the document data is not limited to the data of the patent gazette, and may be any data including character information such as questionnaires, web texts, papers, books, and advertising papers.

本発明の実施例１における文書データ区分装置の構成図である。It is a block diagram of the document data classification device in Example 1 of this invention. 従来手法と実施例１の手法とをそれぞれ用いて特許公報を自動区分した結果の説明図である。It is explanatory drawing of the result of having classified the patent gazette automatically using the conventional method and the method of Example 1, respectively. 従来における特許公報の自動区分に関する技術を説明するための図である。It is a figure for demonstrating the technique regarding the automatic classification | category of the patent gazette in the past. 精度および再現率の定義を説明するための図である。It is a figure for demonstrating the definition of a precision and a recall.

Explanation of symbols

１被区分特許公報データ
２属性抽出部
３カテゴリ既知特許公報データ
４辞書作成用属性抽出部
５属性カテゴリ間・確率値計算部
６確率値計算方法選択部
７確率値辞書
８カテゴリ判定部 DESCRIPTION OF SYMBOLS 1 Category patent gazette data 2 Attribute extraction part 3 Category known patent gazette data 4 Dictionary creation attribute extraction part 5 Inter-attribute category / probability value calculation part 6 Probability value calculation method selection part 7 Probability value dictionary 8 Category determination part

Claims

In a document data classification device that classifies document data into predetermined categories,
Attribute extraction means for extracting attributes of the input classified document data;
A dictionary creating attribute extracting means for extracting attributes from known document data that has been classified into predetermined categories in advance;
Between attribute categories / probability value calculating means for calculating a probability value representing a relationship between a category of known document data classified into the predetermined category and an attribute extracted by the attribute extracting means for creating a dictionary;
Probability value calculation method selection means for selecting a calculation method to be used in the attribute category / probability value calculation means based on a ratio between categories of known document data classified into the predetermined categories;
A probability value dictionary for storing the probability values calculated by the attribute category-to-category / probability value calculating means in combination with the categories and the attributes;
Using the attribute extracted by the attribute extraction means and the probability value dictionary, a category determination means for determining a category for the input classified document data;
And a determination result output means for outputting the determined determination result.

2. The document data sorting apparatus according to claim 1, wherein the probability value calculation method selection means selects a method for calculating the probability value using (Expression 1) or (Expression 2). Sorting device.

In a document data classification device that classifies document data into predetermined categories,
Attribute extraction means for extracting attributes of the input classified document data;
A dictionary creating attribute extracting means for extracting attributes from known document data that has been classified into predetermined categories in advance;
Attribute category-to-category / probability value calculating means for calculating a probability value representing the relationship between the category of the known document data classified into the predetermined category and the attribute extracted by the dictionary creating attribute means using (Equation 3) When,
A probability value calculation parameter determining means for determining a parameter to be used in the attribute category / probability value calculating means based on a ratio between categories of known document data classified into the predetermined category;
A probability value dictionary for storing the probability values calculated by the attribute category-to-category / probability value calculating means in combination with the categories and the attributes;
Using the attribute extracted by the attribute extraction means and the probability value dictionary, a category determination means for determining a category for the input classified document data;
And a determination result output means for outputting the determined determination result.

4. The document data sorting apparatus according to claim 1, wherein the inputted document data to be classified is data of a patent publication, and the predetermined category is whether or not it belongs to a predetermined technical field. The attribute extracted by the attribute extraction means and the dictionary creation attribute extraction means is at least one of F-term, IPC (International Patent Classification), FI, and a word appearing in the data of the patent gazette. A document data sorting device characterized by being.

In a document data classification method executed by a computer to classify document data into predetermined categories,
An attribute extraction step for extracting the attributes of the input classified document data;
An attribute extraction step for creating a dictionary that extracts attributes from known document data that has been previously classified into predetermined categories;
A calculation method for calculating a probability value representing a relationship between a category of the known document data and an attribute extracted in the attribute extracting step for creating a dictionary based on a ratio between categories of the known document data classified into the predetermined category. A probability value calculation method selection step of selecting
Between attribute categories and probability value calculating step for calculating the probability value using the selected calculation method;
A probability value dictionary storing step of storing the calculated probability value in a probability value dictionary in combination with the category and the attribute;
A category determination step of determining a category for the input classified document data using the attribute of the classified document data and the probability value dictionary;
And a determination result output step of outputting the determined determination result.

6. The document data classification method according to claim 5, wherein in the probability value calculation method selection step, a method for calculating the probability value is selected using (Expression 1) or (Expression 2). Classification method.

In a document data classification method executed by a computer to classify document data into predetermined categories,
An attribute extraction step for extracting the attributes of the input classified document data;
An attribute extraction step for creating a dictionary that extracts attributes from known document data that has been previously classified into predetermined categories;
Calculating a probability value representing a relationship between the category of the known document data and the attribute extracted in the attribute extracting step for creating a dictionary based on a ratio between the categories of the known document data classified into the predetermined category; A parameter determination step for determining parameters in (Equation 3);
An attribute category / probability value calculation step of calculating a probability value representing a relationship between a category of known document data classified into the predetermined category and the dictionary attribute, using (Equation 3);
A probability value dictionary storing step of storing the calculated probability value in the probability value dictionary in combination with the category and the attribute;
A category determination step of determining a category for the input classified document data using the attribute of the classified document data and the probability value dictionary;
And a determination result output step of outputting the determined determination result.

The program for making a computer perform the method described in any one of Claims 5-7.