JP2010118050A

JP2010118050A - System and method for automatically searching patent literature

Info

Publication number: JP2010118050A
Application number: JP2009239922A
Authority: JP
Inventors: Shigeru Masuyama; 繁増山; Hiroyuki Sakai; 浩之酒井; Hiroshi Nonaka; 尋史野中
Original assignee: Toyohashi University of Technology NUC
Current assignee: Toyohashi University of Technology NUC
Priority date: 2008-10-17
Filing date: 2009-10-17
Publication date: 2010-05-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system and method for automatically searching patent literatures by automatically searching relevant patents based on a small quantity of searched data and extracting a desired patent. <P>SOLUTION: The system and method for automatically searching patent literatures include: a relevant patent database which stores searched patents belonging to a technical field related to a search output as samples; a non-relevant patent database which stores searched patents belonging to technical fields other than that related to the search output; a morphological analysis means which performs morphological analysis to the patent literature data stored in the databases; a statistic information calculation means which determines a weight for the morphologically analyzed data from word appearance frequency and entropy to select a characteristic word; a discriminator which mechanically learns the characteristic word to discriminate patent literatures; and an extraction means which extracts all patent literatures belonging to the technical field related to the search output from the discrimination results of the discriminator. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、特許文献自動検索システムおよび特許文献自動検索方法に関するものである。 The present invention relates to a patent document automatic search system and a patent document automatic search method.

特許情報は膨大な量が存在し、また特許情報を提供するデータベースも複数存在する。このため、従来、特許情報、技術情報を検索するための技術としては様々なものが開発されている。 There is an enormous amount of patent information, and there are multiple databases that provide patent information. For this reason, conventionally, various techniques for searching for patent information and technical information have been developed.

例えば、特許文献１は、国際特許分類を用いて、文献内容に対して最も適切なものを選定して検索を行うための技術である。 For example, Patent Document 1 is a technique for performing a search by selecting the most appropriate document content using international patent classification.

また、特許文献２は、調査を行なう場合に、技術用語、ＩＰＣ、ＦＩ、Ｆターム、出願人等の検索キーを適宜組み合わせた論理演算検索式を作成する工程等を、過去の調査ナレッジを活用して支援することにより、精度の高い調査が効率良く行なうことができるようにしたものである。 In addition, Patent Document 2 uses past research knowledge to create a logical operation search expression that appropriately combines search terms of technical terms, IPC, FI, F-term, applicant, etc. when conducting a search. As a result, high-accuracy surveys can be conducted efficiently.

しかしながら、特許文献１や特許文献２は、最終的に、検索結果として得られた特許全てを確認しなければ、ユーザが所望する特許群を得ることはできない。 However, Patent Document 1 and Patent Document 2 cannot finally obtain a patent group desired by a user unless all patents obtained as search results are confirmed.

特に環境分野など幅広い技術要素からなる分野の特許を検索し、所望とする特許を抽出する場合、ユーザは検索された大量の特許を読み、関連しているかどうかを判断しなければならず、大変な労力が必要となる。
特開２０００−３２２４４７公報特開２００７−２４２００４公報 In particular, when searching for patents in a wide range of technical elements such as the environmental field and extracting desired patents, the user must read a large number of searched patents and determine whether they are related. Effort is required.
JP 2000-322447 A JP 2007-24204 A

上記従来技術を鑑み、本発明が解決しようとする課題は、少量の調査済みデータを元に関連特許を自動的に検索し所望とする特許を抽出する特許文献自動検索システムおよび特許文献自動検索方法を提供することである。 In view of the above prior art, the problem to be solved by the present invention is an automatic patent document search system and a patent document automatic search method for automatically searching related patents based on a small amount of searched data and extracting desired patents. Is to provide.

特許文献データを対象とする自動検索システムであって、検索出力に係る技術分野に属する調査済みの特許をサンプルとして蓄える関連特許データベースと、検索出力に係る技術分野以外に属する調査済みの特許をサンプルとして蓄える非関連特許データベースと、前記関連特許データベースと非関連特許データベースに格納されている特許文献データに対し形態素解析を行う形態素解析部と、前記形態素解析されたデータに対して出現頻度とエントロピーから計算される重みを用いて選択された特徴語を素性として機械学習し検索出力に係る技術分野に属する特許文献を判別する特許識別部と、前記特許識別部の判別結果から検索出力に係る技術分野に属する全ての特許文献データを抽出する抽出部と、を備えることを特徴とする特許文献自動検索システムからなる。 An automatic search system for patent document data, in which related patent databases that store searched patents belonging to the technical field related to search output are sampled, and searched patents that belong outside the technical field related to search output are sampled A morpheme analysis unit that performs morphological analysis on patent document data stored in the related patent database and the unrelated patent database, and an appearance frequency and entropy for the morpheme analyzed data Patent discriminating unit for discriminating patent documents belonging to the technical field related to search output by machine learning using feature words selected using calculated weights, and technical field related to search output from the discrimination result of the patent discriminating unit Patent document characterized by comprising: an extraction unit that extracts all patent document data belonging to Consisting of dynamic search system.

とくに特許識別部において、正例および負例に出現する全ての語について、出現頻度とエントロピーから重みを求め、語の正例における重みが前記語の負例における重みの２倍より大きければ、当該語を特徴語として機械学習の素性として用いることを特徴としている。 In particular, in the patent identification unit, for all words appearing in the positive example and the negative example, the weight is obtained from the appearance frequency and entropy, and if the weight in the positive example of the word is greater than twice the weight in the negative example of the word, It is characterized by using words as feature words as features of machine learning.

特許文献データを対象とする自動検索方法であって、検索出力に係る技術分野に属する調査済みの特許をサンプルとして関連特許データベースに蓄える手段と、検索出力に係る技術分野以外に属する調査済みの特許をサンプルとして非関連特許データベースに蓄える手段と、前記関連特許データベースと非関連特許データベースに格納されている特許文献データに対し形態素解析を行う手段と、前記形態素解析されたデータに対して出現頻度とエントロピーから計算される重みを用いて選択された特徴語を素性として機械学習し検索出力に係る技術分野に属する特許文献を判別する手段と、前記識別手段の結果から検索出力に係る技術分野に属する全ての特許文献データを抽出する手段と、を備えることを特徴とする特許文献自動検索方法である。 An automatic search method for patent document data, a means for storing, as a sample, a searched patent belonging to a technical field related to search output in a related patent database, and a searched patent belonging to a field other than the technical field related to search output Means for storing in a non-related patent database as a sample, means for performing morphological analysis on patent document data stored in the related patent database and the non-related patent database, and an appearance frequency for the data subjected to morphological analysis A means for discriminating patent documents belonging to the technical field related to the search output by machine learning using the feature word selected using the weight calculated from the entropy as a feature, and a technical field related to the search output from the result of the identification means Means for extracting all patent document data, and an automatic patent document search method comprising: That.

とくに特許文献を判別する手段において、正例および負例に出現する全ての語について、出現頻度とエントロピーから重みを求め、語の正例における重みが前記語の負例における重みの２倍より大きければ、当該語を特徴語として機械学習の素性として用いることを特徴としている。
In particular, the means for discriminating patent documents obtains weights from the appearance frequency and entropy for all words appearing in the positive and negative examples, and the weights in the positive examples of the words are greater than twice the weights in the negative examples of the words. For example, the word is used as a feature word for machine learning as a feature word.

上記本発明のシステムおよび方法によれば、あらかじめ調査済みの少量のサンプルデータをもとにユーザが所望とする特許文献を自動的に検索し抽出でき、特許検索に要する作業時間を大幅に削減する。
According to the above-described system and method of the present invention, it is possible to automatically search and extract a patent document desired by a user based on a small amount of sample data that has been investigated in advance, thereby greatly reducing the work time required for patent search. .

本発明に係る特許文献自動検索システムは、すでに調査済みのサンプル特許を蓄える関連特許データベースならびに非関連特許データベースと、形態素解析を行う形態素解析部、蓄積された特許文献データに出現する語の偏りとエントロピーから重みを計算する統計量計算部、さらに、前記語をもとに機械学習させ、生成される識別器により関連特許を判別し抽出する特許識別部と、前記特許識別部の結果から特許文献の抽出を行う抽出部からなる。 The patent document automatic search system according to the present invention includes a related patent database and an unrelated patent database that store sample patents that have already been investigated, a morpheme analysis unit that performs morpheme analysis, and a bias of words that appear in the accumulated patent document data. A statistic calculation unit for calculating weights from entropy, a patent identification unit for machine learning based on the words, and identifying and extracting related patents by a generated classifier, and a patent document based on the results of the patent identification unit It consists of the extraction part which extracts.

本発明に係る特許文献自動検索システムの手続き、すなわちフローチャートを図１に示す。以下、図１を利用しながら動作を説明していく。 FIG. 1 shows a procedure, that is, a flowchart of the patent document automatic search system according to the present invention. The operation will be described below with reference to FIG.

Ｓｔｅｐ１：関連特許データベース部および非関連特許データベース部から、教師データとなるサンプルデータを同数選択する。以下、関連特許データベース部に属するサンプルを正例、非関連特許データベース部に属するサンプルを負例と定義する。 Step 1: The same number of sample data as teacher data is selected from the related patent database section and the unrelated patent database section. Hereinafter, a sample belonging to the related patent database part is defined as a positive example, and a sample belonging to the unrelated patent database part is defined as a negative example.

Ｓｔｅｐ２：選択した正例、負例それぞれのデータ毎に、形態素解析部により形態素解析を行う。形態素解析とは、文章を単語単位に分割することを意味する。たとえば、「本発明は、冷蔵庫のリサイクルに関連するものである。」という表現が出現した場合、この表現は、「本発明／は／、／冷蔵庫／の／リサイクル／に／関連／する／もの／で／ある。」と単語単位に分割される。ここで、／は区切り文字を表現している。 Step 2: The morpheme analysis unit performs morpheme analysis for each selected positive example and negative example data. Morphological analysis means dividing a sentence into words. For example, when the expression “the present invention is related to recycling of a refrigerator” appears, this expression is expressed as “the present invention //// refrigerator / of / recycled / related / related / thing. It is divided into word units. Here, / represents a delimiter.

Ｓｔｅｐ３：統計量計算部により抽出・計算した正例にのみ多く含まれている語、すなわち特徴語として用いて、文書中の出現確率を学習の素性として特許識別部により機械学習を行う。機械学習の手法としては、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ（以下、ＳＶＭと略す）などがあるが、２値分類できる手法であればよい。なお、本発明では、語の統計量のみを素性としているため、特定の言語に依存しない。よって、日本語だけでなく、すべての言語で記された特許文献を対象とすることができる。 Step 3: Machine learning is performed by the patent discriminating unit using the appearance probability in the document as a learning feature, using words that are included only in positive examples extracted and calculated by the statistic calculation unit, that is, feature words. As a method of machine learning, there is a Support Vector Machine (hereinafter abbreviated as SVM), but any method capable of binary classification may be used. In the present invention, only the statistic of a word is used as a feature, so that it does not depend on a specific language. Therefore, patent documents written in all languages as well as Japanese can be targeted.

また、本発明では、特徴語という語情報のみを使用している点にも特長がある。Ｗｅｂ検索など他の分野の検索システムでは、語以外に、その文書固有の情報、例えば、Ｗｅｂ検索においては、タグ構造やリンク構造なども検索時に使用しているが、特許文献には前記タグ構造やリンク構造に相当するものが存在せず、語情報のみで検索し抽出する必要があった。 The present invention is also characterized in that only word information called feature words is used. In search systems in other fields such as Web search, in addition to words, information unique to the document, for example, in Web search, tag structure, link structure, etc. are used at the time of search. There was no equivalent to the link structure, and it was necessary to search and extract only word information.

よって、特許文献の検索システムにおいては、Ｗｅｂなど他分野の検索システムをそのまま使用することはできず、語情報に基づく評価式を作成することが求められており、本発明は、特徴語選択により語情報に基づく評価を実現する。 Therefore, in a search system for patent documents, search systems in other fields such as the Web cannot be used as they are, and it is required to create an evaluation formula based on word information. Realize evaluation based on word information.

特徴語としては、検索出力に係る特許技術に関連しない特許群には頻出せず、当該技術に関連する特許群に多く出現する語が抽出される。例えば、ユーザが所望とする特許技術がリサイクルに関連する特許群に属するものの場合には、「リサイクル」、「再利用」などが選択される。 As feature words, words that do not appear frequently in the patent group not related to the patent technology related to the search output but appear frequently in the patent group related to the technology are extracted. For example, if the patent technology desired by the user belongs to a group of patents related to recycling, “recycle”, “reuse”, etc. are selected.

特徴語の抽出は、次のようにして行う。まず、正例に含まれている内容語（以降、語とする）に対して重み付けを行い、重みが上位半分となる語を抽出する。正例の重み係数は数１で表される。 The feature word is extracted as follows. First, weighting is performed on the content words (hereinafter referred to as words) included in the positive example, and words whose weights are in the upper half are extracted. The weighting factor of the positive example is expressed by Equation 1.

ここで、右辺のＰ（ｔ_ｉ、Ｓ_ｐ）およびＨ（ｔ_ｉ、Ｓ_ｐ）は、それぞれ数２および数３となる。 Here, P (t _i , S _p ) and H (t _i , S _p ) on the right side are expressed by _Equation 2 and Equation 3, respectively.

ここでＰ（ｔ_ｉ、Ｓ_ｐ）は、正例の文書集合Ｓ_ｐにおける語ｔ_ｉの出現確率を意味する。 Here _{_P (t} i, _{S p)} refers to the probability of occurrence of the word _{t i} in the document set _{S p} of the positive cases.

Ｈ（ｔ_ｉ、Ｓ_ｐ）は、正例の文書集合Ｓ_ｐに含まれる各文書における語ｔ_ｉの出現確率に基づくエントロピーを表し、エントロピーが高い語ほど、正例の文書集合に均一に分布している語である。 H (t i, _S _p) represents the entropy based on occurrence probabilities of words t _i of each document included in the document set S _p of the positive sample, the more entropy is high word, uniformly distributed document set positive cases It is a word.

この指標を導入した理由は、正例の文書集合中でも多くの文書に分散して出現している語の方が、少数の文書に集中して出現している語と比較して、よりその文書集合の特徴を表し、素性としても有効であるからである。 The reason for introducing this index is that words that appear dispersed in many documents in a positive document set are more likely to appear in documents compared to words that appear concentrated in a small number of documents. This is because it represents the characteristics of the set and is also effective as a feature.

このエントロピーを使った手法は特許文献自動抽出タスクに最適である。なぜならば、特許文献は、決まりきった専門技術用語や定型表現、例えば「本発明は〜」など、を使うことが多く、口語表現を用いたＷｅｂ情報と比較し、語彙の個人差が少ないからである。語彙の個人差が存在すると、同じ意味の単語でも、違う単語と形態素解析され、その結果、正しいエントロピーが求まらない可能性が高いが、特許文献においては、このような問題があまり存在しない。 This method using entropy is optimal for the patent document automatic extraction task. This is because patent documents often use fixed technical terms and fixed expressions, such as “the present invention is”, and there are few individual differences in vocabulary compared to Web information using colloquial expressions. It is. If there are individual differences in vocabulary, even words with the same meaning are morphologically analyzed with different words, and as a result, there is a high possibility that correct entropy will not be obtained, but such problems do not exist in the patent literature. .

次に、正例の場合と同様に、負例に含まれる語に対して次の数５を用いて重み付けを行い、重みが上位半分となる語を抽出する。 Next, as in the case of the positive example, the words included in the negative example are weighted by using the following formula 5, and the word whose weight is the upper half is extracted.

ただし、Ｓ_ｎは訓練データにおいて負例に属する文書集合である。そして、ある語ｔ_ｉの正例における重みＷ_ｐ（ｔ_ｉ、Ｓ_ｐ）が負例における重みＷ_ｎ（ｔ_ｉ、Ｓ_ｐ）の２倍より大きければ、その語ｔ_ｉを素性として選択する。すなわち、以下の数６が成り立つ語ｔ_ｉを素性として選択する。 However, _Sn is a document set belonging to a negative example in the training data. Then, the weight _{_{_{W p (t i, S p}}} ) in the positive example of a word _{t i} weight _{_{_{W n (t i, S p}}} ) is in the negative examples is greater than 2 times of selecting the word _{t i} as features . That is, a word t _i that satisfies the following formula 6 is selected as a feature.

数６の条件を使用する理由を説明する。数１で表した重みでは、一般的な語であれば関連特許とは関係のない語でも高い重みが付与される。しかし、そのような語は負例においても高い重みが与えられる可能性がある。例えば、「本発明」や「特許」という語は、正例と負例を問わず、ほぼ全ての特許文献中で存在し、正例を表す特徴語としては不適切である。 The reason for using the condition of Equation 6 will be described. With the weight expressed by Equation 1, a high weight is given even to a word that is not related to a related patent if it is a general word. However, such words can be given high weight even in negative examples. For example, the words “present invention” and “patent” are present in almost all patent documents regardless of positive examples and negative examples, and are inappropriate as feature words representing positive examples.

ここで、特徴語選択の基準をＷ_ｐ（ｔ_ｉ、Ｓ_ｐ）がＷ_ｎ（ｔ_ｉ、Ｓ_ｐ）の２倍とした根拠を、図２のグラフで説明しておく。図２は、重みを横軸、縦軸にＦ−ＳＣＯＲＥをとったものである。Ｆ−ＳＣＯＲＥは、精度と再現率より計算される総合的な精度を示す指標である。これを見ると、重みが２倍付近でＦ−ＳＣＯＲＥのピークがきており、精度の点で最適な数値となっていることがわかる。このような最適値が存在するのは、抽出例の正確度を意味する精度と抽出例の網羅性を意味する再現率はトレードオフの関係にあるためである。 Here, the reason why W _p (t _i , S _p ) is twice as large as W _n (t _i , S _p ) as a criterion for feature word selection will be described with reference to the graph of FIG. In FIG. 2, the horizontal axis represents the weight and F-SCORE is plotted on the vertical axis. F-SCORE is an index indicating overall accuracy calculated from accuracy and recall. From this, it can be seen that F-SCORE peaks when the weight is about twice, and is an optimal numerical value in terms of accuracy. The reason why such an optimal value exists is that there is a trade-off relationship between the accuracy that means the accuracy of the extraction example and the recall that means the completeness of the extraction example.

そこで、ある語ｔ_ｉに対する正例における重みＷ_ｐ（ｔ_ｉ、Ｓ_ｐ）と負例における重みＷ_ｎ（ｔ_ｉ、Ｓ_ｎ）を比較し、Ｗ_ｐ（ｔ_ｉ、Ｓ_ｐ）の方がＷ_ｎ（ｔ_ｉ、Ｓ_ｎ）の２倍よりも大きい語を選択することで、一般的な語が素性として選択されることを防ぐ。すなわち、「本発明」や「特許」という特許文献一般に出現する語は特徴語にならない。 Therefore, the weight W _p (t _i , S _p ) in the positive example for a certain word t _i is compared with the weight W _n (t _i , S _n ) in the negative example, and W _p (t _i , S _p ) is more _{_{W n (t i, S n}} ) by selecting the larger word than twice the prevents common words are selected as features. That is, words that appear in general patent documents such as “present invention” and “patent” are not characteristic words.

検索システムでよく使用されている、統計量Ｔｆ＊Ｉｄｆを単純に使用しただけでは、一般的な語が素性として選択されることを防ぐことはできず、特許文献一般に出現する語を特徴語として選択しないことは本発明の特徴の一つとなっている。Ｔｆ＊Ｉｄｆは文章に固有の表現を特徴として抽出しやすく、そのため、正例全般に偏っている特徴語は低く評価され、抽出されない可能性がある。さらに、わずかな文章集合に頻繁に繰り返される特殊な語を抽出しやすい問題点もあった。これに対して、本発明の重みは、正例全般のみに偏る語を特徴語として抽出することができ、特許抽出に最適である。
By simply using the statistic Tf * Idf, which is often used in search systems, it is not possible to prevent a general word from being selected as a feature. Not selecting is one of the features of the present invention. In Tf * Idf, it is easy to extract an expression unique to a sentence as a feature. Therefore, feature words that are biased toward positive examples in general are evaluated low and may not be extracted. In addition, there is a problem that it is easy to extract special words that are frequently repeated in a small set of sentences. On the other hand, the weights of the present invention can extract words that are biased toward only positive examples as feature words, and are optimal for patent extraction.

本発明を実施例にもとづき更に詳細に説明するが、本発明はこれらの実施例のみに限定されるものではない。 EXAMPLES Although this invention is demonstrated still in detail based on an Example, this invention is not limited only to these Examples.

本発明の実施形態の一様は、図３のように、検索質問を入力する検索入力部と検索対象の特許が蓄積されている特許データベースと、すでに調査済みのサンプル特許を蓄える関連特許データベースと、ユーザが所望する技術分野に属さないもので、すでに調査済みのサンプル特許を蓄える非関連特許データベースと、前記関連特許データベースと前記非関連特許データベースのそれぞれに蓄積された特許データに出現する語の偏りから技術分野に属する特許文献データを識別する特許識別部と、前記特許識別部によりユーザが所望する技術分野の特許と判断された特許文献データを蓄積する検索結果データベースと前記検索結果データベースに保存された特許文献データを表示する表示部からなる。なお、図４は、実施例に係る処理の手順と概念を表す図面である。 As shown in FIG. 3, the embodiment of the present invention includes a search input unit for inputting a search question, a patent database in which patents to be searched are stored, and a related patent database in which sample patents already searched are stored. , A non-related patent database that does not belong to the technical field desired by the user and stores already searched sample patents, and words that appear in the patent data stored in each of the related patent database and the non-related patent database Patent identification unit for identifying patent document data belonging to the technical field from bias, search result database for storing patent document data determined by the patent identification unit as a patent in the technical field desired by the user, and saving in the search result database It comprises a display unit for displaying the patent document data. FIG. 4 is a diagram illustrating a processing procedure and a concept according to the embodiment.

図３記載の検索入力部は、関連特許と非関連特許のサンプルを収集するためにユーザが検索質問を入力し、特許文献を検索する部分である。 The search input unit shown in FIG. 3 is a part in which a user inputs a search question and searches patent documents in order to collect samples of related patents and unrelated patents.

また、図３記載の関連特許データベースには、ユーザが所望とする特許群のうち少数のサンプルデータが蓄積される。この処理は図１中のＳＴＥＰ1に相当する。蓄積されるデータの形式としては、請求項および明細書部分が含まれるデータであればよく、テキスト形式、ＰＤＦ（ＰｏｒｔａｂｌｅＤｏｃｕｍｅｎｔＦｏｒｍａｔ）形式、所定の仕様に基づくＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）形式など、形式を問わない。一例を図５に示す。 Further, in the related patent database shown in FIG. 3, a small number of sample data of the patent group desired by the user is accumulated. This process corresponds to STEP1 in FIG. The format of the accumulated data may be any data including claims and specification parts, and is in a format such as a text format, a PDF (Portable Document Format) format, an XML (Extensible Markup Language) format based on a predetermined specification, etc. It doesn't matter. An example is shown in FIG.

図３記載の非関連特許データベースには、ユーザが所望とする技術分野と非関連特許群のうち少数のサンプルデータが蓄積される。蓄積されるデータの形式としては、請求項および明細書部分が含まれるデータであればよく、テキスト形式、ＰＤＦ形式、所定の仕様に基づくＸＭＬ形式など、形式を問わない。一例を図５に示す。 In the unrelated patent database shown in FIG. 3, a small number of sample data of the technical field desired by the user and the unrelated patent group are accumulated. The format of the accumulated data is not limited as long as it includes data including claims and specification parts, and may be in any format such as a text format, a PDF format, or an XML format based on a predetermined specification. An example is shown in FIG.

図３記載の特許識別部は、まず、すでに調査済みのサンプル特許を蓄える非関連特許データベースと前記関連特許データベースと前記非関連特許データベース、それぞれに蓄積された特許データに対して、図1中のＳＴＥＰ２を行い、語にパーズする次に、図１中のＳＴＥＰ３、ＳＴＥＰ４、ＳＴＥＰ５で計算されるそれぞれに出現する語の偏り（図４中のＳＴＥＰ３の処理結果に相当）を計算し、特徴語を選択する。そして、特徴語を素性として、図１中のＳＴＥＰ６にあるように、機械学習を行う。その結果、生成された識別器をもとにして、関連特許を抽出する。例えば、図４中に示すようにＳＶＭで機械学習を行い、それによって生成された識別器により関連特許を識別し抽出する。 First, the patent identification unit shown in FIG. 3 applies the unrelated patent database storing the already-searched patents, the related patent database, and the unrelated patent database to the patent data stored in FIG. Perform STEP2 and parse into words. Next, calculate the bias of the words that appear in each of STEP3, STEP4, and STEP5 in FIG. 1 (corresponding to the processing result of STEP3 in FIG. 4), and select the feature word. select. Then, machine learning is performed using the feature word as a feature as in STEP 6 in FIG. As a result, related patents are extracted based on the generated discriminator. For example, as shown in FIG. 4, machine learning is performed by SVM, and related patents are identified and extracted by a classifier generated thereby.

本発明を「冷蔵庫の製造過程における技術のうちリサイクルに関連した特許を抽出すること」に使うことに適用した場合の結果を使い本発明の流れを、図１および図４を用いて説明する。 The flow of the present invention will be described with reference to FIG. 1 and FIG. 4 using results obtained when the present invention is applied to “extracting patents related to recycling among technologies in the manufacturing process of refrigerators”.

まず、検索入力部より、ユーザは、計算機を用いて、検索キーワードないしはＩＰＣ分類などを質問入力して、サンプル特許を図３中の特許データベースより検索し収集する。この処理は図１中のＳＴＥＰ1に相当する。その結果、図３中の関連特許データベースに正例と判断した特許文書を格納し、図３中の非関連特許データベースに負例と判断した特許文書を格納する。図３中の関連特許データベースおよび図３中の非関連特許データベースは、端末計算機とネットワークを介して、または、直接接続された、外部記憶装置（ストレージ等）からなる。蓄積されるデータの形式は、ＰＤＦなどの形式を問わず、特許文書が記載できるものであればよい。本例では、「冷蔵庫の製造過程における技術のうちリサイクルに関連した特許」や図４中に示しているような、「リサイクル装置に関する特許」が正例であり、例えば、「リサイクルしやすい冷蔵庫の扉の製造に関わる装置・方法特許」などが該当する。負例には、「冷蔵庫の鮮度維持に関わる装置特許」や図４中に示しているような、「脱臭装置に関する特許」などリサイクルに関連しない特許が該当する。なお、ここで、サンプル特許、つまり教師データは２００３年、２００６年の公開特許を対象として、正負７６件ずつを収集したものである。 First, from the search input unit, a user uses a computer to input a query for a search keyword or IPC classification, and searches and collects sample patents from the patent database in FIG. This process corresponds to STEP1 in FIG. As a result, the patent document determined as a positive example is stored in the related patent database in FIG. 3, and the patent document determined as a negative example is stored in the unrelated patent database in FIG. The related patent database in FIG. 3 and the unrelated patent database in FIG. 3 are composed of an external storage device (storage or the like) that is directly connected to the terminal computer via a network. The format of the accumulated data is not limited to a format such as PDF, and any format can be used as long as it can describe a patent document. In this example, “Patents related to recycling among technologies in the manufacturing process of refrigerators” and “Patents related to recycling equipment” as shown in FIG. "Apparatus and method patents related to door manufacturing" apply. Negative examples include patents not related to recycling, such as “apparatus patent relating to maintaining the freshness of refrigerators” and “patent relating to deodorizing equipment” as shown in FIG. Here, sample patents, that is, teacher data, are collected for 76 positive and negative cases for 2003 and 2006 published patents.

次に関連特許データベースおよび非関連特許データベースより、正例と判断した特許文書を抽出し、形態素解析を行う。この処理は、図1中のＳＴＥＰ２に相当する。これにより、特許文書は、単語単位に分割される。具体的には、図４中に例示したように、特許文書に「本発明は、冷蔵庫のリサイクルに関連するものである。」という表現が出現した場合、この表現は、「本発明／は／、／冷蔵庫／の／リサイクル／に／関連／する／もの／で／ある。」と単語単位に分割される。ここで、／は区切りを表現している。形態素解析は、図３中の中央演算処理装置により行い、解析結果は、図３中の主記憶装置内の特許識別部に蓄積される。 Next, a patent document judged as a positive example is extracted from the related patent database and the unrelated patent database, and morphological analysis is performed. This process corresponds to STEP 2 in FIG. Thereby, the patent document is divided into words. Specifically, as illustrated in FIG. 4, when an expression “the present invention relates to recycling of a refrigerator” appears in a patent document, the expression “the present invention / , / Refrigerator / of / recycle / to / relevant / related / thing / is / is ”. Here, / represents a break. The morpheme analysis is performed by the central processing unit in FIG. 3, and the analysis result is accumulated in the patent identification unit in the main storage device in FIG.

形態素解析により、分割された語より正例に特徴的な語を選択する。この処理は、図１中のＳＴＥＰ３、ＳＴＥＰ４、ＳＴＥＰ５の処理に相当する。特徴語の選択は、前記数６を用いて行う。本実施例では、図４中にしめすように｛「リサイクル」、「廃棄」、「リユース」・・・｝などが特徴語として選択された。特徴語選択は、図３中の中央演算処理装置により行い、結果は、図３中の主記憶装置内の特許識別部に蓄積される。 A morphological analysis selects words that are characteristic of positive examples from the divided words. This processing corresponds to the processing of STEP3, STEP4, and STEP5 in FIG. The feature word is selected by using the above equation (6). In this embodiment, as shown in FIG. 4, {“recycle”, “discard”, “reuse”... The feature word selection is performed by the central processing unit in FIG. 3, and the result is accumulated in the patent identification unit in the main storage device in FIG.

特徴語を素性ベクトル、つまり学習用のベクトルの基底として、その出現頻度により、素性ベクトルを生成する。例えば、ある特許文書において、「リサイクル」が３回、「廃棄」が１回、「再利用」が１回、他の特徴語は０回出現したとすると、その素性ベクトルは、（３、１、１、０、・・・、０）となる。素性ベクトル生成は、図３中の中央演算処理装置により行い、結果は、図３中の主記憶装置内の特許識別部に蓄積される。 Using feature words as feature vectors, that is, bases for learning vectors, feature vectors are generated based on their appearance frequencies. For example, in a certain patent document, if “recycle” occurs 3 times, “dispose” 1 time, “reuse” 1 time, and other feature words appear 0 times, the feature vector is (3, 1 , 1, 0, ..., 0). The feature vector generation is performed by the central processing unit in FIG. 3, and the result is accumulated in the patent identifying unit in the main storage device in FIG.

なお、統計量計算におけるエントロピーの計算時間は、計算機のハードウェア構成に依存するが、１０年分の公開特許広報を用いたとき、数分程度であって、大量の特許文献データの検索に十分使用できる。 Note that the entropy calculation time in the statistic calculation depends on the hardware configuration of the computer, but when using public patent information for 10 years, it is only a few minutes and is sufficient for searching a large amount of patent document data. Can be used.

その後、図１中のＳＴＥＰ６にあるように、素性ベクトルを用いて機械学習を行う。機械学習の手法としては、図４に例示したように、今回、ＳＶＭを用いる。ＳＶＭは、マージン最大化とよばれる技術により、過学習を防止するため、２値分類には、最適な手法である。ＳＶＭによる機械学習は、中央演算処理装置により行い、結果は、主記憶装置に蓄積される。 Thereafter, as in STEP 6 in FIG. 1, machine learning is performed using the feature vector. As a method of machine learning, SVM is used this time as illustrated in FIG. SVM is an optimal technique for binary classification because it prevents overlearning by a technique called margin maximization. Machine learning by SVM is performed by a central processing unit, and the results are accumulated in the main memory.

機械学習後、図４中にしめしたように、学習により、正例と判断した特許は、検索結果データベースに蓄積される。蓄積された結果は、検索結果表示部に表示される。表示は、出願番号などのリスト形式などにより行われる。検索結果データベースは、端末計算機とネットワークを介して、または、直接接続された、外部記憶装置（ストレージ等）からなる。蓄積されるデータの形式は、ＰＤＦなどの形式を問わず、特許文献が記載できるものであればよい。 After machine learning, as shown in FIG. 4, patents determined to be positive examples by learning are accumulated in the search result database. The accumulated results are displayed on the search result display section. The display is performed in a list format such as an application number. The search result database is composed of an external storage device (storage or the like) that is directly connected to the terminal computer via a network. The format of the stored data is not limited to a format such as PDF, and any format can be used as long as it can describe a patent document.

実施例については、検索と抽出性能の評価を行った。評価に使用した特許数は７５件（正例３１件、負例４４件）である。また、機械学習の手法としては、ＳＶＭを使用した。とくにカーネルは線形カーネルを用いている。 For the examples, search and extraction performance were evaluated. The number of patents used in the evaluation is 75 (31 positive examples, 44 negative examples). Moreover, SVM was used as a method of machine learning. In particular, the kernel uses a linear kernel.

表１の適合率は、正例と本発明のシステムが判断した例のうち真に正例であったものの数を正例と本発明のシステムが判断した例で除したものであり、正例の精度を示す。 The accuracy rate in Table 1 is obtained by dividing the number of positive examples and examples determined by the system of the present invention that were truly positive examples by the number of positive examples and the example determined by the system of the present invention. The accuracy of.

表１の再現率は、正例と本発明のシステムが判断した例のうち真に正例であったものの数を正例の全体数で除したものであり、正例の網羅性を示す。 The recall shown in Table 1 is the total number of positive examples divided by the total number of positive examples among the positive examples and examples determined by the system of the present invention, and indicates the completeness of the positive examples.

本結果により、無駄が１割程度しかなく、精度よく抽出できていることがわかる。このことは、その後のスクリーニングは、ほとんど必要ない、言い換えれば、大まかに統計データとして使用する場合、スクリーニングに要する時間が削減できることを意味する。具体的には、サンプル特許収集にあたっては、キーワードのチューニングや特許文書の精読などで、１週間の時間を要したが、本発明においては、使用する計算機の機種にもよるが、数分程度で特許文献収集が完了する。 From this result, it can be seen that there is only about 10% of waste and that extraction can be performed with high accuracy. This means that subsequent screening is rarely necessary. In other words, when it is used roughly as statistical data, the time required for screening can be reduced. Specifically, collecting a sample patent took a week to tune keywords and carefully read patent documents, but in the present invention, it takes only a few minutes depending on the type of computer used. Collection of patent documents is completed.

このようにして、１９９３年から２００３年の特許データを抽出したところ、１６０件が該当した。なお、思考錯誤によるキーワード検索式チューニング（前記特許文献２のノーリッジとなる）では、２３１件検索された。適合率と再現率から評価すると、キーワード検索式チューニングによる推定の無駄な特許は７０件程度（全体の３割）と推定され、一方、本発明では、無駄な特許文献（検索誤差）は１０件程度である。 In this way, when patent data from 1993 to 2003 was extracted, 160 cases were found. In addition, in the keyword search type tuning based on thought and error (which is the norridge of Patent Document 2), 231 cases were searched. When evaluated from the relevance rate and the recall rate, the number of useless patents estimated by keyword search formula tuning is estimated to be about 70 (30% of the total). On the other hand, in the present invention, there are 10 useless patent documents (search errors). Degree.

結果より、非常に少ないサンプル数でも、良好な精度、再現率で自動的に関連特許を抽出しているのがわかる。すなわち、ユーザが所望する特許文献を、多大な労力を払うことなく、機械的に抽出している。 From the results, it can be seen that even with a very small number of samples, related patents are automatically extracted with good accuracy and recall. That is, the patent documents desired by the user are mechanically extracted without much effort.

図３に示す検索結果データベースは、ユーザが所望する技術分野の特許と判断された特許データを蓄積するものである。このように蓄積したデータは再度学習データとして使用することもできる。一例を図５に示す。 The search result database shown in FIG. 3 stores patent data determined as patents in the technical field desired by the user. The accumulated data can be used again as learning data. An example is shown in FIG.

図３に示す検索結果表示部は、前記検索結果データベースに保存された特許データを表示するものである。たとえば、ブラウザを利用して結果をわかりやすく表示できる。
The search result display unit shown in FIG. 3 displays patent data stored in the search result database. For example, the results can be displayed in an easy-to-understand manner using a browser.

以下に、実施例２についての説明を行う。具体的には、図３で示されるような実施の態様において、「樹脂の技術のうち環境配慮型樹脂に関連した特許を抽出すること」に適用した場合について説明を行う。 Hereinafter, the second embodiment will be described. Specifically, in the embodiment as shown in FIG. 3, a case where the present invention is applied to “extracting patents related to environmentally conscious resins out of resin technologies” will be described.

実施例２として、「樹脂の技術のうち環境配慮型樹脂に関連した特許を抽出すること」に適用した場合を選択した理由は、実施例１で選択した「冷蔵庫の製造過程における技術のうちリサイクルに関連した特許を抽出すること」に適用した場合では、主に機械・電気・電子分野に技術が偏ったいため、これに該当しない化学・材料分野の特許文書においても本発明が良好な結果を示すことを証明するためである。機械・電気・電子分野と合わせて化学・材料分野でも結果が良好な場合、本発明は幅広い技術分野の特許自動検索に極めて有効であることを示唆できる。 The reason for selecting the case of applying to “extracting patents related to environmentally conscious resins out of resin technologies” as Example 2 is “Recycling of technologies in the manufacturing process of refrigerators” selected in Example 1 When it is applied to `` extracting patents related to ``, the technology is mainly biased to the mechanical, electrical, and electronic fields. This is to prove that it is shown. If the result is good in the chemical / material field as well as the mechanical / electrical / electronic field, it can be suggested that the present invention is extremely effective for automatic patent search in a wide range of technical fields.

以下、実施例２の手順に従う。まず、検索入力部より、ユーザは、計算機を用いて、検索キーワードないしはＩＰＣ分類などを質問入力して、図３中の特許データベースよりサンプル特許を検索し収集する。この処理は図１中のＳＴＥＰ1に相当する。その結果、図３中の関連特許データベースに正例と判断した特許文書を格納し、図３中の非関連特許データベースに負例と判断した特許文書を格納する。関連特許データベースおよび非関連特許データベースは、端末計算機とネットワークを介して、または、直接接続された、外部記憶装置（ストレージ等）からなる。蓄積されるデータの形式は、ＰＤＦなどの形式を問わず、特許文書が記載できるものであればよい。本例では、「樹脂の技術のうち環境配慮型樹脂に関連した特許」が正例であり、例えば、「生分解樹脂およびその製造方法特許」などが該当する。負例には、「樹脂の収率を向上させる特許」など環境配慮型樹脂に関連しない特許が該当する。なお、ここで、サンプル特許、つまり教師データは２００３年、２００６年の公開特許を対象として、正負７０件ずつを収集したものである。 Hereinafter, the procedure of Example 2 is followed. First, from the search input unit, the user uses a computer to input a search keyword or IPC classification as a query, and searches and collects sample patents from the patent database in FIG. This process corresponds to STEP1 in FIG. As a result, the patent document determined as a positive example is stored in the related patent database in FIG. 3, and the patent document determined as a negative example is stored in the unrelated patent database in FIG. The related patent database and the non-related patent database are made up of external storage devices (storage, etc.) connected directly to the terminal computer via a network or directly. The format of the accumulated data is not limited to a format such as PDF, and any format can be used as long as it can describe a patent document. In this example, “patent related to environmentally friendly resin among resin technologies” is a positive example, and for example, “patent for biodegradable resin and its manufacturing method” is applicable. Negative examples include patents not related to environmentally friendly resins such as “patents that improve resin yield”. Here, sample patents, that is, teacher data, are collected for 70 patents each for 2003 and 2006 published patents.

次に関連特許データベースおよび非関連特許データベースより、正例と判断した特許文書を抽出し、形態素解析を行う。この処理は、図1中のＳＴＥＰ２に相当する。これにより、特許文書は、単語単位に分割される。例えば、特許文書に「本発明は、生分解樹脂に関連するものである。」という表現が出現した場合、この表現は、図４中のＳＴＥＰ２の例「本発明／は／、／冷蔵庫／の／リサイクル／に／関連／する／もの／で／ある。」と同様に「本発明／は／、／生分解／樹脂／に／関連／する／もの／で／ある。」と単語単位に分割される。ここで、／は区切りを表現している。前記形態素解析は、図３中の中央演算処理装置により行い、前記解析結果は、図３中の主記憶装置内の特許識別部に蓄積される。 Next, a patent document judged as a positive example is extracted from the related patent database and the unrelated patent database, and morphological analysis is performed. This process corresponds to STEP 2 in FIG. Thereby, the patent document is divided into words. For example, when the expression “the present invention relates to a biodegradable resin” appears in a patent document, this expression is an example of STEP 2 in FIG. 4 “invention / has // refrigerator / Similarly to “/ recycle / to / related / related / things / is / is” ”,“ invention / has /, / biodegradation / resin / to / related / related / things / is / ”is divided into word units. Is done. Here, / represents a break. The morphological analysis is performed by the central processing unit in FIG. 3, and the analysis result is stored in the patent identification unit in the main storage device in FIG.

前記形態素解析により、分割された語より正例に特徴的な語を選択する。特徴語の選択は、前記数６を用いて行う。この処理は、図１中のＳＴＥＰ３、ＳＴＥＰ４（冷蔵庫のリサイクルを抽出対象とする図４では｛「リサイクル」、「廃棄」、「リユース」・・・｝）、ＳＴＥＰ５の処理に相当する。冷蔵庫のリサイクルを抽出対象とする図４では｛「リサイクル」、「廃棄」、「リユース」・・・｝だったが、本例では、｛「リサイクル」、「生分解」、「環境」・・・｝などが特徴語として選択された。前記特徴語選択は、図３中の中央演算処理装置により行い、前記結果は、図３中の主記憶装置内の特許識別部に蓄積される。 By the morphological analysis, words characteristic of positive examples are selected from the divided words. The feature word is selected by using the above equation (6). This processing corresponds to the processing of STEP 3 and STEP 4 in FIG. 1 ({“Recycling”, “Discarding”, “Reusing”... In FIG. 4 where extraction of refrigerator recycling is the target, {"recycling", "disposal", "reuse" ...}, but in this example, {"recycling", "biodegradation", "environment" ...・} Etc. were selected as feature words. The feature word selection is performed by the central processing unit shown in FIG. 3, and the result is stored in the patent identification unit in the main storage unit shown in FIG.

次に、特徴語を素性ベクトル、つまり学習用のベクトルの基底として、その出現頻度により、素性ベクトルを生成する。前記素性ベクトル生成は、図３中の中央演算処理装置により行い、結果は、図３中の主記憶装置内の特許識別部に蓄積される。 Next, using the feature word as a feature vector, that is, the basis of a learning vector, a feature vector is generated based on the appearance frequency. The feature vector generation is performed by the central processing unit in FIG. 3, and the result is accumulated in the patent identification unit in the main storage device in FIG.

前記素性ベクトル生成後、図１のＳＴＥＰ６にあるように、前記素性ベクトルを用いて機械学習を行う。機械学習の手法としては、図４に例示したように、ＳＶＭを用いる。ＳＶＭは、マージン最大化とよばれる技術により、過学習を防止するため、２値分類には、最適な手法である。前記ＳＶＭによる機械学習は、図３中の中央演算処理装置により行い、結果は、図３中の主記憶装置内の特許識別部に蓄積される。 After generating the feature vector, machine learning is performed using the feature vector as shown in STEP 6 of FIG. As a machine learning technique, SVM is used as illustrated in FIG. SVM is an optimal technique for binary classification because it prevents overlearning by a technique called margin maximization. The machine learning by the SVM is performed by the central processing unit in FIG. 3, and the result is accumulated in the patent identification unit in the main storage device in FIG.

前記学習により、正例と判断した特許は、図３中の検索結果データベースに蓄積される。これら蓄積された結果は、検索結果表示部に表示される。表示は、出願番号などのリスト形式などにより行われる。検索結果データベースは、端末計算機とネットワークを介して、または、直接接続された、外部記憶装置（ストレージ等）からなる。蓄積されるデータの形式は、ＰＤＦなどの形式を問わず、特許文献が記載できるものであればよい。 Patents determined to be positive examples by the learning are accumulated in the search result database in FIG. These accumulated results are displayed on the search result display section. The display is performed in a list format such as an application number. The search result database is composed of an external storage device (storage or the like) that is directly connected to the terminal computer via a network. The format of the stored data is not limited to a format such as PDF, and any format can be used as long as it can describe a patent document.

上記の実施例２について、検索と抽出性能の評価を行った。評価に使用した特許数は１００件（正例５０件、負例５０件）である。また、ＳＶＭの手法として、線形カーネルを用いている。 About said Example 2, search and evaluation of extraction performance were performed. The number of patents used for the evaluation is 100 (50 positive examples, 50 negative examples). In addition, a linear kernel is used as the SVM method.

前記結果である表２と実施例１の結果である表１との比較により、実施例１とほぼ変わらない精度で特許検索が行えることが示された。したがって、実施例１と実施例２に限らず、幅広い技術分野の特許自動検索に応用できる。

Comparison of Table 2 as the result and Table 1 as the result of Example 1 showed that patent search can be performed with almost the same accuracy as Example 1. Therefore, the present invention is not limited to the first and second embodiments, and can be applied to automatic patent searches in a wide range of technical fields.

本発明に係る特許識別部に関するフローチャートを示す。The flowchart regarding the patent identification part which concerns on this invention is shown. 本発明で数６の条件を使用する根拠を示すグラフである。It is a graph which shows the basis which uses the conditions of several 6 by this invention. 本発明の実施例に係る特許文献自動検索システムの構成を示す図である。It is a figure which shows the structure of the patent document automatic search system which concerns on the Example of this invention. 本発明の実施例に係る手順と概念を示す図である。It is a figure which shows the procedure and concept which concern on the Example of this invention. 本発明におけるデータベースの例を示すものである。The example of the database in this invention is shown.

Claims

An automatic search system for patent document data,
A related patent database that stores, as a sample, patents that have already been searched in the technical field related to search output
An unrelated patent database that stores, as a sample, patents that have already been investigated that belong outside the technical field related to search output;
A morpheme analyzer that performs morphological analysis on patent document data stored in the related patent database and the unrelated patent database;
A patent identifying unit that discriminates patent documents belonging to the technical field related to search output by machine learning using feature weights selected using appearance frequencies and weights calculated from entropy for the morpheme-analyzed data; and
An automatic patent document search system comprising: an extraction unit that extracts all patent document data belonging to a technical field related to a search output from a determination result of the patent identification unit.

In the patent identification unit according to claim 1, for all words appearing in the positive example and the negative example, a weight is obtained from the appearance frequency and entropy,
If the weight in the positive example of a word is larger than twice the weight in the negative example of the said word, the said word is used as a feature word of machine learning as a feature word, The patent document automatic search system characterized by the above-mentioned.

An automatic search method for patent document data,
Means for storing patents already in the technical field related to search output in the related patent database as samples;
Means for storing, as a sample, an unpatented patent database of searched patents belonging to a field other than the technical field related to search output;
Means for performing morphological analysis on patent document data stored in the related patent database and the unrelated patent database;
A means for discriminating patent documents belonging to a technical field related to search output by machine learning using feature weights selected using appearance frequencies and weights calculated from entropy for the morpheme-analyzed data;
And a means for extracting all patent document data belonging to a technical field related to search output from the result of the identifying means.

In the means for discriminating the patent document according to claim 3, for all words appearing in the positive example and the negative example, a weight is obtained from the appearance frequency and entropy,
If the weight in the positive example of a word is larger than twice the weight in the negative example of the said word, the said word is used as a feature word of machine learning as a feature word, The patent document automatic search method characterized by the above-mentioned.