JP2003223465A

JP2003223465A - Patent document retrieval method

Info

Publication number: JP2003223465A
Application number: JP2002023650A
Authority: JP
Inventors: Yoichi Nakatani; 洋一中谷; Kotaro Takada; 広太郎高田; Michihiro Isoda; 道弘磯田
Original assignee: NTT Data Technology Corp
Current assignee: NTT Data Technology Corp
Priority date: 2002-01-31
Filing date: 2002-01-31
Publication date: 2003-08-08

Abstract

<P>PROBLEM TO BE SOLVED: To provide a patent document retrieval system or a patent document retrieval method classifying words into a synonym dictionary and a homonym dictionary essential in a retrieval using a natural language as a key and reducing leak and noise. <P>SOLUTION: As for index terms for every documents accumulated along with field information in a text index file 10, the synonym is standardized and classified by a synonym dictionary 11 by field, to which the document is pertained, and the emergence frequencies of the index terms for every fields and documents are accumulated in an index term emergence frequency file 12. A clustering/ranking arithmetic part 14 receives a retrieval response set of the classified retrieval stored in a retrieval set storage part 8, calls emergence frequency data of the index terms for every hit document by the index term emergence frequency file 12, and performs clustering with a source document or raking using the source document as an index according to the command of a user. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は特許文献検索にお
いて、検索結果である検索集合をランキング表示、また
はクラスタリング表示できるようにした特許文献検索シ
ステムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a patent document search system capable of displaying a search set which is a search result in a ranking display or a clustering display in a patent document search.

【０００２】[0002]

【従来の技術】特許文献の本格的な調査では、分類検索
キーは不可欠なものとして利用されている。分類検索キ
ーは、個々の言語に依存することなく、概念の一致によ
り、特許文献を検索するため、情報検索における言語特
有の問題、即ち、表記の揺れ、同義語及び異義語によ
る、ノイズ及びモレを回避できるので、ノイズ及びモレ
の少ない精度の高い検索が可能であるとされている。2. Description of the Related Art In a full-scale search of patent documents, a classification search key is used as an indispensable item. Since the classification search key searches patent documents by concept matching without depending on individual languages, language-specific problems in information search, i.e., fluctuations in writing, synonyms and synonyms, noise and more. Since it is possible to avoid, it is said that a highly accurate search with less noise and leakage is possible.

【０００３】しかしながら他方では、分類による検索に
おいては、技術の変遷に伴って特許出願の傾向が変化す
るため、それまでの特許分類の細展開、分類体系のリフ
ォームが必要となり、それに伴ってこれまでに分類付与
された文献に対して新しい分類を再付与しなければなら
ないという負担が付きまとうことになる。このため、分
類による検索の絞り込みを適正量に維持することは、経
済的にもまた物理的にも限界が有るとされている。On the other hand, however, in the search by classification, the tendency of patent applications changes with the change of technology, so that it is necessary to develop the patent classification up to that point and reform the classification system. The burden of having to re-assign a new classification to the documents that have been classified into. For this reason, it is said that there is a limit economically and physically in maintaining an appropriate amount of narrowing down the search by classification.

【０００４】このような状況を考慮して、特願２００１
−３５７４６９号（以下、先行出願という。）にみられ
るように、分類検索の長所を生かすと共に自然語による
検索キーの高い経済性を併用するハイブリッド型の検索
手法が提案されている。In consideration of such a situation, Japanese Patent Application No. 2001
As disclosed in Japanese Patent Application No. 357469 (hereinafter referred to as a prior application), a hybrid-type search method has been proposed which utilizes the advantages of a classification search and combines the high economic efficiency of a search key in natural language.

【０００５】上記先行出願では、図１に示すようにＦタ
ーム検索（特許庁採用の特許文献検索システム）によっ
て作成された回答集合を調査対象である種文書を指標と
して各文献と種文書の類似度を比較し、類似度に応じて
ランキングまたはクラスタリング表示が行われる。な
お、上記種文書（テーマ文）は、検索したい事項を自然
言語を用いて普通の文章で自由に表現したものであり、
特許出願の審査、異議申し立ての先行技術調査の場合は
審査案件から抽出した説明部分、全文、あるいはこれか
ら作った文を種文書とすることができる。In the above-mentioned prior application, as shown in FIG. 1, the answer set created by the F-term search (patent document search system adopted by the Patent Office) is similar to each document and the seed document by using the seed document to be searched as an index. The degrees are compared, and ranking or clustering display is performed according to the degree of similarity. Note that the seed document (theme sentence) is a free expression of the item to be searched for in ordinary sentences using natural language.
In the case of examination of a patent application or prior art search of an objection, the explanation part extracted from the examination case, the whole sentence, or the sentence made from this can be used as the seed document.

【０００６】ところで、自然語からなる検索キーは、分
類検索のような、予め分類表を作成し、分類を付与する
等の経済的負担が無い代わりに、文書作成者の自由な表
現に対応しなければならないという問題があり、これは
同義語と異義語とを如何に整理するかの問題に集約する
ことができる。[0006] By the way, the search key consisting of natural language corresponds to a free expression of the creator of the document, instead of the economical burden of creating a classification table in advance and assigning a classification like classification search. There is a problem that it has to be, and this can be summarized as a problem of how to arrange synonyms and synonyms.

【０００７】[0007]

【発明が解決しようとする課題】同義語の簡単な一例を
挙げる。ある特許明細書において、「ゴルフクラブのヘ
ッド、シャフト」が、別の特許明細書では「ゴルフクラ
ブの頭部、柄部」、あるいは、「遊技具の頭部、柄部」
と別の形の表記（自然語、単語）で表わされる。A simple example of synonyms will be given. In one patent specification, "golf club head, shaft" is used in another patent specification, "golf club head, handle" or "game equipment head, handle".
And another form of notation (natural language, word).

【０００８】同様に特許明細書には、表記は同じであっ
ても意味するものが異なる語、即ち異義語も多く用いら
れている。「ゴルフクラブのヘッド、シャフト」、「エ
ンジンの（シリンダ）ヘッド、（クランク）シャフ
ト」、「ハードディスクの（シーク）ヘッド、シークヘ
ッドをスライドさせるシャフト」等はその一例である。
つまり、同じ「シャフト」、「ヘッド」という表記であ
っても技術分野を越えれば指すもの（意味するもの）が
異なる。専門用語には特にこのような異義語の例を非常
に多く見ることができる。Similarly, in the patent specification, words having the same notation but different meanings, that is, synonyms are often used. "Golf club head, shaft", "engine (cylinder) head, (crank) shaft", "hard disk (seek) head, shaft for sliding seek head" and the like are examples thereof.
In other words, even if the same notations such as “shaft” and “head” are used, what is meant (meaning) is different beyond the technical field. There are so many examples of such synonyms, especially in terminology.

【０００９】自然語をキーとする検索では、検索対象文
献に上記同義語、異義語がそのままの形で含まれている
ため、同義語はモレの、また、異義語はノイズの原因と
なる。先に述べたように、自然語をキーとする検索にお
いてはモレとノイズを防止するため、同義語辞書と異義
語辞書を整理することが不可欠である。しかし、例え
ば、仮にヘッド、シャフトをそれぞれ頭部、柄部と同義
語としたとしても、それはゴルフクラブの分野における
同義語であって、エンジンの分野、ハードディスクの分
野に当てはめることはできない。このことは同義語の整
理は分野毎に行わなければならないことを意味してい
る。つまり、分野毎の同義語辞書を作成し、同辞書を用
いて当該分野に属する文献の同義語を統一化しなけれ
ば、同義語による検索のモレは解消できないことを意味
している。In a search using a natural language as a key, the above-mentioned synonyms and synonyms are included in the documents to be searched as they are, so that the synonyms cause more and the synonyms cause noise. As described above, in a search using a natural language as a key, it is indispensable to organize a synonym dictionary and a synonym dictionary in order to prevent leakage and noise. However, even if the head and the shaft are synonymous with the head and the handle, respectively, they are synonyms in the field of golf clubs and cannot be applied to the fields of engines and hard disks. This means that synonyms must be organized by field. In other words, it means that unless a synonym dictionary is created for each field and the synonyms of documents belonging to the field are unified using the dictionary, the leak of search by the synonym cannot be eliminated.

【００１０】また異義語についてみると、上述の異義語
の一例からも判るように技術用語の異義語の多くはかけ
離れた異なる技術分野間に存在している。現在利用され
ている国際特許分類（ＩＰＣ）のメイングループ及びサ
ブグループ（以下、両者を総称してグループという。）
及びＦタームの１テーマ分野についてみると、その中で
は異義語がほとんど混入しない一つの技術分野が形成さ
れており、上記異義語対策を考慮する必要が実質的には
ない。ＩＰＣグループまたはＦタームテーマ分野を単位
として分野別同義語辞書を作成し、当該分野に属する特
許文献について、そこで出現する単語を統一化整理すれ
ば、先行出願技術におけるよりも、いっそう精度の高い
ランキングまたはクラスタリングが可能である。Regarding the synonyms, as can be seen from the above-mentioned examples of synonyms, most of the synonyms of technical terms exist between different technical fields. The main groups and subgroups of the International Patent Classification (IPC) currently in use (both are collectively referred to as groups).
Looking at one theme field of F term and F term, one technical field in which synonyms are hardly mixed is formed therein, and it is not substantially necessary to consider the measures against synonyms. By creating a synonym dictionary for each field with the IPC group or the F-term theme field as a unit and unifying and organizing the words that appear therein for the patent documents belonging to that field, the ranking is more accurate than in the prior application technology. Or clustering is possible.

【００１１】したがってこの発明は、分類による検索の
回答集合をランキング・クラスタリング表示する特許文
献検索において、精度の高いランキング・クラスタリン
グを可能とした特許文献検索システムあるいは特許文献
検索方法を提供することを課題とするものである。Therefore, the present invention aims to provide a patent document search system or a patent document search method that enables highly accurate ranking / clustering in a patent document search for displaying a ranking / clustering display of answer sets of a search by classification. It is what

【００１２】[0012]

【課題を解決するための手段】上記課題は以下の手段に
よって解決される。すなわち、第１の発明の解決手段
は、分類による検索によって作成された回答集合を、調
査対象である種文書を指標として類似度を比較し、類似
度に応じてランキングまたはクラスタリング表示するよ
うにしたハイブリッド検索において、上記回答集合を形
成する文献の全単語に対し当該分野専用の同義語辞書を
参照して同義語を統一化整理した後、類似度比較を行
い、この類似度に応じてランキングまたはクラスタリン
グすることを特徴とする特許文献検索方法である。The above-mentioned problems can be solved by the following means. That is, the solution means of the first invention compares the similarity of the answer set created by the search by classification with the seed document that is the survey target as an index, and displays the ranking or clustering according to the similarity. In the hybrid search, after synonyms are unified and organized by referring to the synonym dictionary dedicated to the field for all the words of the documents forming the answer set, similarity comparison is performed, and ranking or ranking is performed according to the similarity. A patent document search method characterized by clustering.

【００１３】第２の発明の解決手段は、分類による検索
によって作成された回答集合を、調査対象である種文書
を指標として類似度を比較し、類似度に応じてランキン
グまたはクラスタリング表示するようにしたハイブリッ
ド検索において、上記回答集合を形成する文献の全単語
に対し当該分野専用の同義語辞書を参照して同義語を統
一化整理し、この統一化整理した単語の群からその技術
分野内においてはもはや特徴を示していないものとなっ
た単語を除去した後、類似度比較を行い、この類似度に
応じてランキングまたはクラスタリングすることを特徴
とする特許文献検索方法である。According to a second aspect of the present invention, the answer sets created by the search by classification are compared for similarity by using the seed document that is the object of survey as an index, and ranked or clustered according to the similarity. In the hybrid search, the synonyms are unified and organized by referring to the synonym dictionary dedicated to the relevant field for all the words of the document forming the answer set, and the group of the unified and organized words is used in the technical field. Is a patent document search method characterized by performing similarity comparison after removing words that no longer show characteristics, and ranking or clustering according to this similarity.

【００１４】第３の発明の解決手段は、分類による検索
によって作成された回答集合を、調査対象である種文書
を指標として類似度を比較し、類似度に応じてランキン
グまたはクラスタリング表示するようにしたハイブリッ
ド検索において、上記回答集合を形成する各文献を特徴
づける索引語に対し当該分野専用の同義語辞書を参照し
て同義語を統一化整理した後、類似度比較を行い、この
類似度に応じてランキングまたはクラスタリングするこ
とを特徴とする特許文献検索方法である。The solution means of the third invention compares the similarity of the answer sets created by the search by classification with the seed document to be surveyed as an index, and displays the ranking or clustering according to the similarity. In the hybrid search, the synonyms are unified and organized by referring to the synonym dictionary dedicated to the relevant field with respect to the index words that characterize each document forming the answer set, and then the similarity comparison is performed to determine the similarity. According to the patent document search method, ranking or clustering is performed according to the ranking.

【００１５】第４の発明の解決手段は、分類による検索
によって作成された回答集合を、調査対象である種文書
を指標として類似度を比較し、類似度に応じてランキン
グまたはクラスタリング表示するようにしたハイブリッ
ド検索において、上記回答集合を形成する各文献を特徴
づける索引語に対し当該分野専用の同義語辞書を参照し
て同義語を統一化整理し、この統一化整理した索引語の
群からその技術分野内においてはもはや特徴を示してい
ないものとなった索引語を除去した後、類似度比較を行
い、この類似度に応じてランキングまたはクラスタリン
グすることを特徴とする特許文献検索方法である。According to a fourth aspect of the invention, the answer sets created by the search by classification are compared for similarity by using the seed document which is the survey target as an index, and ranked or clustered according to the similarity. In the hybrid search, the synonyms are unified and organized by referring to the synonym dictionary dedicated to the relevant field with respect to the index terms that characterize each document that forms the answer set, and the synonyms are grouped into This is a patent document search method characterized by removing index words that no longer show features in the technical field, performing similarity comparison, and ranking or clustering according to this similarity.

【００１６】第５の発明の解決手段は、分類によって特
許文献を検索するための手段と、上記検索の回答集合を
形成する特許文献の全単語に対し当該分野専用の同義語
辞書を参照して同義語を統一化整理するための手段と、
調査対象である種文書を指標として類似度を比較するた
めの手段と、上記類似度に応じてランキングまたはクラ
スタリングするための手段と、ランキングまたはクラス
タリングされた特許文献あるいは特許文献と種文書を表
示するための手段とを備えたことを特徴とする特許文献
検索システムである。The solution means of the fifth invention refers to a means for searching patent documents by classification and a synonym dictionary dedicated to the relevant field for all the words of the patent documents forming the answer set of the above search. Means for unifying and organizing synonyms,
A means for comparing the similarities by using the seed document that is the object of the survey as an index, a means for ranking or clustering according to the similarity, and a patent document or a patent document and a patent document that are ranked or clustered are displayed. It is a patent document search system characterized by comprising:

【００１７】第６の発明の解決手段は、分類によって特
許文献を検索するための手段と、上記検索の回答集合を
形成する特許文献の全単語に対し当該分野専用の同義語
辞書を参照して同義語を統一化整理するための手段と、
上記統一化整理をした単語の群からその技術分野内にお
いてはもはや特徴を示していないものとなった単語を除
去するための手段と、調査対象である種文書を指標とし
て類似度を比較するための手段と、上記類似度に応じて
ランキングまたはクラスタリングするための手段と、ラ
ンキングまたはクラスタリングされた特許文献あるいは
特許文献と種文書を表示するための手段とを備えたこと
を特徴とする特許文献検索システムである。The solution means of the sixth invention refers to a means for searching patent documents by classification and a synonym dictionary dedicated to the relevant field for all the words of the patent documents forming the answer set of the search. Means for unifying and organizing synonyms,
In order to compare the degree of similarity with the means for removing the words that have no characteristics in the technical field from the group of words that have been unified and arranged, and the seed document that is the object of the survey as an index. And a means for ranking or clustering according to the degree of similarity, and a patent literature or patent literature that has been ranked or clustered and means for displaying a seed document. System.

【００１８】第７の発明の解決手段は、分類によって特
許文献を検索するための手段と、上記検索の回答集合を
形成する各特許文献の全単語の内、当該特許文献を特徴
づける索引語に対し当該分野専用の同義語辞書を参照し
て同義語を統一化整理するための手段と、調査対象であ
る種文書を指標として類似度を比較するための手段と、
上記類似度に応じてランキングまたはクラスタリングす
るための手段と、ランキングまたはクラスタリングされ
た特許文献あるいは特許文献と種文書を表示するための
手段とを備えたことを特徴とする特許文献検索システム
である。The solution means of the seventh invention is a means for searching a patent document by classification and an index word characterizing the patent document among all the words of each patent document forming the answer set of the search. On the other hand, a means for unifying and organizing synonyms by referring to a synonym dictionary dedicated to the field concerned, and a means for comparing the similarities using the seed document as the survey target as an index,
A patent document search system comprising means for ranking or clustering according to the degree of similarity and means for displaying ranked or clustered patent documents or patent documents and seed documents.

【００１９】第８の発明の解決手段は、分類によって特
許文献を検索するための手段と、上記検索の回答集合を
形成する各特許文献の全単語の内、当該特許文献を特徴
づける索引語に対し当該分野専用の同義語辞書を参照し
て同義語を統一化整理するための手段と、上記統一化整
理をした索引語の群からその技術分野内においてはもは
や特徴を示していないものとなった索引語を除去するた
めの手段と、調査対象である種文書を指標として類似度
を比較するための手段と、上記類似度に応じてランキン
グまたはクラスタリングするための手段と、ランキング
またはクラスタリングされた特許文献あるいは特許文献
と種文書を表示するための手段とを備えたことを特徴と
する特許文献検索システムである。The solution means of the eighth invention is a means for searching a patent document by classification and an index word characterizing the patent document among all the words of each patent document forming the answer set of the search. On the other hand, the means for unifying and organizing synonyms by referring to the synonym dictionary dedicated to the relevant field, and the group of index terms that have been unified and unorganized, no longer show the characteristics within the technical field. Means for removing the index words, means for comparing the similarities using the seed document that is the survey target as an index, means for ranking or clustering according to the similarities, and ranking or clustering. A patent document or a patent document retrieval system comprising a patent document and means for displaying a seed document.

【００２０】[0020]

【発明の実施の形態】この明細書に開示する一つの発明
によれば、分類による検索によって作成された回答集合
を、調査対象である種文書（特許出願の審査、異議申し
立ての先行技術調査の場合は審査案件）を指標として各
文献と種文書との類似度を比較し、類似度に応じてラン
キングまたはクラスタリング表示するようにしたハイブ
リッド検索において、回答集合を形成した分野専用の同
義語辞書を参照して当該分野に属する文献の索引語を統
一化整理することによりランキングまたはクラスタリン
グ精度の向上がはかられる。DETAILED DESCRIPTION OF THE INVENTION According to one invention disclosed in this specification, an answer set created by a search by classification is used to search a seed document (examination of a patent application, prior art search of an objection) to be searched. In the hybrid search that compares the degree of similarity between each document and the seed document by using (in case of examination) as an index, and displays ranking or clustering according to the degree of similarity, a synonym dictionary dedicated to the field that formed the answer set The accuracy of ranking or clustering can be improved by unifying and organizing the index terms of documents belonging to the relevant field.

【００２１】回答文献集合中の索引語によっては同義語
辞書により統一化整理した結果、回答集合を形成するほ
とんど全文献にわたって出現する単語も存在しうる。つ
まり、全技術分野に関してみたとき、それが索引語であ
ったとしてもその技術分野内に限ると何らの特徴をも示
していないような単語が存在する。このような単語は、
回答集合中においてその文献を特徴付けるものとは言い
難く、単語としての価値は極めて低いと考えられる。こ
のような実質的に全文献（その分野の全文献）にわたっ
て出現する単語については、これを除いて類似度の比較
を行うことが好ましい。これにより各文献間の相違点を
際だたせ、より効果的な峻別を行うことができる。Depending on the index words in the answer document set, as a result of being unified and organized by the synonym dictionary, some words may appear in almost all documents forming the answer set. In other words, when looking at all technical fields, there are words that, even if they are index words, do not show any characteristics within the technical field. Such words are
It is hard to say that the document is characterized in the answer set, and the value as a word is considered to be extremely low. It is preferable to perform similarity comparison for words that appear in substantially all documents (all documents in the field) except for these. As a result, the differences between the documents can be highlighted and more effective distinction can be performed.

【００２２】この明細書に開示する他の発明によれば、
分類検索によって作成された回答集合を、調査対象であ
るテーマ文（特許出願の審査、異議申し立ての先行技術
調査の場合は審査案件）を指標として類似度を比較し、
ランキングまたはクラスタリング表示するようにしたハ
イブリッド検索において、回答集合を形成した分野の同
義語辞書により、当該分野に属する文献の索引語を統一
化整理した後、再度文献を特徴付ける単語のみを抽出し
て、つまり、統一化整理した上記索引語の群から、その
技術分野内において特徴を示していない単語を除去し、
除去後の単語を新しい索引語（分野索引語）の群とする
ことにより、ランキングまたはクラスタリング精度の向
上がはかられる。According to another invention disclosed in this specification,
The answer sets created by the classification search are compared for similarity using the subject sentence (examination of patent application, examination case in the case of prior art search of opposition) as an index,
In a hybrid search that displays a ranking or clustering, the synonym dictionary of the field that formed the answer set unifies and sorts the index terms of documents that belong to the field, and then extracts only the words that characterize the document again, In other words, words that do not show features in the technical field are removed from the group of index words that are unified and organized,
By using the removed words as a group of new index words (field index words), the accuracy of ranking or clustering can be improved.

【００２３】図２は第１実施例のシステム概要図であ
る。図２に沿って説明する。特許公報、特許公開公報１
等は入力部２より特許文献蓄積ファイル３に入力され、
蓄積される。分類検索インデックス作成部４において
は、特許文献蓄積ファイル３より文献番号（出願番号ま
たは公報番号等が利用される）毎に特許分類に関する記
事を抽出し、文献番号毎の分類インデックスを作成す
る。作成された文献番号毎の分類インデックスは、分類
インデックスファイル５にインバーテッドファイル形式
で蓄積される。入・出力部６は検索システムにおいて、
検索質問式及び種文書等の入力、検索結果の回答を表示
する通常のコンピュータシステムの入出力端末に相当す
るものである。FIG. 2 is a schematic diagram of the system of the first embodiment. It will be described with reference to FIG. Patent publication, patent publication 1
Are input to the patent document accumulation file 3 from the input unit 2,
Accumulated. The classification search index creation unit 4 extracts articles related to patent classification for each document number (application number, publication number, etc.) from the patent document storage file 3 and creates a classification index for each document number. The created classification index for each document number is accumulated in the classification index file 5 in the inverted file format. In the search system, the input / output unit 6
This is equivalent to an input / output terminal of a normal computer system for inputting search queries and seed documents, and displaying answers of search results.

【００２４】演算部７は、入・出力部６より入力された
質問式を受けて、質問に該当する分類を含む文献番号を
分類インデックスファイル５のインバーテッドファイル
より算出し検索集合保存部８に出力する。検索集合保存
部８は検索回答集合を保存すると共に、入・出力部６か
らのユーザの文献出力要求に応じて、特許文献蓄積ファ
イル３に蓄積されている特許文献を、出力表示する。The arithmetic unit 7 receives the query expression input from the input / output unit 6, calculates the document number including the classification corresponding to the question from the inverted file of the classification index file 5, and stores it in the search set storage unit 8. Output. The search set storage unit 8 stores the search answer set, and outputs and displays the patent documents accumulated in the patent document accumulation file 3 in response to the user's document output request from the input / output unit 6.

【００２５】一方、テキスト検索インデックス作成部９
は、特許公報、特許公開公報等より文献番号毎に発明に
関する記事部分を抽出し、形態素解析ソフトにより、単
語の切り出しを行う。切り出された単語中からＴＦ−Ｉ
ＤＦ法により文献毎の索引語の抽出及び索引語の出現頻
度が算出されて分野情報と共に索引語インデックスが作
成される。索引語インデックスはテキストインデックス
ファイル１０にインバーテッドファイル形式で蓄積され
る。On the other hand, the text search index creating section 9
Extracts an article part relating to the invention for each document number from patent publications, patent publications, etc., and cuts out words using morphological analysis software. TF-I out of the extracted words
The DF method extracts the index word for each document and calculates the appearance frequency of the index word to create the index word index together with the field information. The index word index is stored in the text index file 10 in the inverted file format.

【００２６】ここにおいて、煩雑になることを避けるた
めに図示していないが、テキスト検索に際しては、入・
出力部６からの索引語の組み合わせによる検索質問を受
けて、演算部７は質問に該当する索引語の組み合わせを
含む文献番号をテキストインデックスファイル１０より
選択する。選択された検索結果である文献番号は、検索
集合保存部８において保存されると共に、入・出力部６
からのユーザの文献出力要求に応じて、特許文献蓄積フ
ァイル３に蓄積されている特許文献を、出力表示する。Although not shown here in order to avoid complication, when entering a text,
In response to the search query by the combination of the index words from the output unit 6, the calculation unit 7 selects the document number including the combination of the index words corresponding to the query from the text index file 10. The document number which is the selected search result is stored in the search set storage unit 8 and is also input / output unit 6.
In response to the user's request for document output from, the patent documents stored in the patent document storage file 3 are output and displayed.

【００２７】テキストインデックスファイル１０におい
て分野情報と共に蓄積されている文献毎の索引語は当該
文献が属する分野別同義語辞書１１により同義語が統一
化整理され、分野・文献毎の索引語の出現頻度が索引語
出現頻度情報として、索引語出現頻度ファイル１２に蓄
積される。The index words for each document stored together with the field information in the text index file 10 are synonymously organized by the field-specific synonym dictionary 11 to which the document belongs, and the frequency of appearance of the index word for each field / document. Is stored in the index word appearance frequency file 12 as index word appearance frequency information.

【００２８】索引語出現頻度算出部１３では、ユーザが
入・出力部６より入力した種文書についてテキスト検索
インデックス作成部９において用いた形態素解析ソフト
により単語の切り出しを行うと共に索引語出現頻度ファ
イル１２を参照して索引語の選定、索引語の出現頻度を
算出し、クラスタリング・ランキング演算部１４に送
る。The index word appearance frequency calculation unit 13 cuts out words from the seed document input by the user from the input / output unit 6 by using the morphological analysis software used in the text search index creation unit 9 and the index word appearance frequency file 12 The index word is selected and the appearance frequency of the index word is calculated with reference to, and sent to the clustering / ranking calculation unit 14.

【００２９】一方、クラスタリング・ランキング演算部
１４は検索集合保存部８に保存されている分類検索の検
索回答集合（ヒット文献）を受けて索引語出現頻度ファ
イル１２よりヒット文献毎の索引語の出現頻度データ呼
び出し、ユーザの指示に従って種文書と共にクラスタリ
ングまたは種文書を指標としてランキングを行う。On the other hand, the clustering / ranking calculation unit 14 receives the search answer set (hit documents) of the classification search stored in the search set storage unit 8 and the index word appearance for each hit document from the index word appearance frequency file 12. The frequency data is called, and clustering is performed together with the seed document or ranking is performed using the seed document as an index according to a user's instruction.

【００３０】クラスタリングまたはランキング結果は入
・出力部６に表示され、ユーザの利用に供される。ユー
ザが精査した結果、ヒット文献中に種文書より求める情
報に近い特許文献（種特許文献）を発見した場合は、入
・出力部６よりその種特許文献を指定すれば、クラスタ
リング・ランキング演算部１４において、その種特許文
献を用いて再度類似度計算が行われランキング表示され
る。また、ユーザがクラスタリングを希望する場合に
は、種文書を除いた集合について再度クラスタリングが
行われ、その結果が表示される。The clustering or ranking result is displayed on the input / output unit 6 and is used by the user. If a user finds a patent document (seed patent document) that is closer to the information requested from the seed document in the hit documents as a result of the user's scrutiny, if the seed patent document is designated by the input / output unit 6, the clustering / ranking calculation unit In 14, the similarity calculation is performed again using the patent document of that kind, and the ranking is displayed. If the user desires clustering, clustering is performed again on the set excluding the seed document, and the result is displayed.

【００３１】以上に説明したように、種文書には、検索
したい事柄を文章によって記述し、これを種文書とする
以外に、通常の特許審査、異議申し立て、無効審判等の
様に、審査対象としての出願明細書が定まっており、且
つ、公報として公開されている場合には、分類検索の
後、その文献番号を指定することにより、これを種文書
とすることができる。As described above, in the seed document, the matters to be searched are described in sentences, and in addition to the seed document, the subject of examination such as ordinary patent examination, opposition, invalidation trial, etc. If the application description as is determined and is published as a publication, it can be used as a seed document by specifying the document number after the classification search.

【００３２】図３は第２実施例のシステム概要図であ
る。図２の実施例と同一のブロックは同一符号で示され
ており、重複する部分については説明を省略する。第２
実施例（図２）に追加された分野索引語出現頻度算出部
１５は、検索集合保存部８に保存されている検索回答集
合及び種文書よりなる文献集合の各文献の索引語出現頻
度データを索引語出現頻度ファイル１２及び索引語出現
頻度算出部１３より呼び出し、再度、前記ＴＦ−ＩＤＦ
法を用いて、分野索引語の選定、分野索引語の出現頻度
算出を行い、分野索引語出現頻度データをクラスタリン
グ・ランキング演算部１４に送るように機能する。クラ
スタリング・ランキング演算部１４は分野索引語出現頻
度データを用いてクラスタリング・ランキングを行う。FIG. 3 is a schematic diagram of the system of the second embodiment. The same blocks as those in the embodiment of FIG. 2 are denoted by the same reference numerals, and the description of the overlapping portions will be omitted. Second
The field index word appearance frequency calculation unit 15 added to the embodiment (FIG. 2) stores the index word appearance frequency data of each document of the document set including the search answer set and the seed document stored in the search set storage unit 8. It is called from the index word appearance frequency file 12 and the index word appearance frequency calculation unit 13, and the TF-IDF is called again.
Using the method, it selects the field index word, calculates the frequency of the field index word, and sends the field index word appearance frequency data to the clustering / ranking calculation unit 14. The clustering / ranking calculation unit 14 uses the field index word appearance frequency data to perform clustering / ranking.

【００３３】なお、以上の説明は索引語を対象として同
義語を統一化整理するものとして説明したが、これを回
答集合を形成する文献の全ての単語について同義語辞書
によって同義語を統一化整理することも可能である。In the above description, synonyms are unified and organized for the index words, but the synonyms are unified and organized by the synonym dictionary for all the words of the document forming the answer set. It is also possible to do so.

【００３４】図４は分野別同義語整理特許辞書作成の手
順をイメージで示したものである。国際特許分類のサブ
グループまたはメイングループ（ＩＰＣグループ）を単
位として上部の表のように特許文献―単語行列が作成さ
れる。この表にはサブグループＡ６３Ｂ５３／０２を単
位とした特許文献―単語行列の仮想図で文献毎に出現す
る単語の出現頻度が示されている。これら単語の中に
は、Ａ６３Ｂ５３／０２（ゴルフクラブのヘッドとシャ
フトの結合構造）の分野においては同義語として取り扱
って差し支えのない同義語候補が存在する。下の分野別
同義語特許辞書はこの分野における同義語の出現頻度
（個数）を加えて、同義語候補を統一化整理したもので
ある。FIG. 4 shows an image of the procedure for creating a synonym organization patent dictionary by field. The patent document-word matrix is created as shown in the table above with the subgroup or main group (IPC group) of the international patent classification as a unit. This table shows the frequency of appearance of words that appear in each document in a virtual document of patent document-word matrix in units of subgroup A63B53 / 02. Among these words, there are synonym candidates that can be treated as synonyms in the field of A63B53 / 02 (a golf club head-shaft coupling structure). The synonym patent dictionary by field below is a unified and organized list of synonym candidates by adding the frequency (number) of occurrences of synonyms in this field.

【００３５】統一化整理は、Ａ６３Ｂ５３／０２分野に
おいて特定の少数の文献に偏って出現する出現頻度の多
い単語（これら単語は、Ａ６３Ｂ５３／０２分野におい
て文献を特徴付ける索引語つまり、分野索引語となる）
を対象に行う。具体的な分野索引語の抽出は、Ａ６３Ｂ
５３／０２分野に属する文献集合を対象にしたＴＦ−Ｉ
ＤＦ法により行う。In the unified organization, words with a high frequency of appearance that appear in a biased manner in a specific small number of documents in the A63B53 / 02 field (these words are index words that characterize the documents in the A63B53 / 02 field, that is, field index words). )
To target. The extraction of specific field index words is A63B
TF-I targeting the literature set belonging to 53/02 field
The DF method is used.

【００３６】抽出された分野索引語に対し、同義語辞書
または類語辞典を用いて同義語展開し、Ａ６３Ｂ５３／
０２分野に出現する全ての単語より該当する同義語群を
抽出し、これらを統一化整理する。ただし、同義語辞書
または類語辞書には存在しない特許特有の同義語が存在
するので、目視チェック工程を加えて分野別同義語整理
辞書を完成する。A synonym expansion is performed on the extracted field index word using a synonym dictionary or a thesaurus, and A63B53 /
The corresponding synonym groups are extracted from all the words that appear in the 02 field, and these are unified and organized. However, there are patent-specific synonyms that do not exist in the synonym dictionary or thesaurus, so a visual check process is added to complete the field-specific synonym organization dictionary.

【００３７】分野別同義語整理辞書に示すように、同義
語を統一化整理した結果、遊技具をゴルフクラブ、頭部
をヘッド、柄部をシャフトと統一化した結果、Ａ６３Ｂ
５３／０２分野の全文献に出現する単語となる場合があ
る。このような単語はＡ６３Ｂ５３／０２の分野におい
ては索引語としての価値は極めて低いものとなる。ま
た、Ａ６３Ｂ５３／０２及びその上位分類のタイトルに
出現する単語も同様に当該分野において文献を特徴付け
る単語としての価値は低い。As shown in the synonym organization dictionary by field, as a result of unifying and synonymous synonyms, as a result of unifying the play equipment with the golf club, the head with the head, and the handle with the shaft, A63B
It may be a word that appears in all documents in the 53/02 field. Such a word has extremely low value as an index word in the field of A63B53 / 02. Similarly, the words appearing in the titles of A63B53 / 02 and their upper classifications are also low in value as words that characterize documents in the field.

【００３８】なお、ここではＩＰＣサブグループを単位
としているが、メイングループまたはＦタームの１テー
マを一つの分野単位として同義語を統一化することがで
きる。このようにしても、メイングループまたはＦター
ムの１テーマは均一の技術分野を構成しているので何ら
問題はない。Although the IPC subgroup is used as a unit here, synonyms can be unified by using one theme of the main group or the F term as one field unit. Even in this case, there is no problem because one theme of the main group or F term constitutes a uniform technical field.

【００３９】図５は、サブグループＡ６３Ｂ５３／０２
の下位に当該分野において文献を特徴付ける単語即ち分
野索引語を展開した特許辞書と前述の目視工程の結果、
確定した分野別同義語辞書のイメージを示したものであ
る。なお、図５中、＊で示す単語は上述のＡ６３Ｂ５３
／０２分野のほぼ全文献にわたって出現する単語である
ことを示している。FIG. 5 shows the subgroup A63B53 / 02.
As a result of the patent dictionary in which the words that characterize the literature in the relevant field, that is, the field index word, and the above-mentioned visual process are subordinated to,
It shows an image of a confirmed synonym dictionary for each field. In FIG. 5, the word indicated by * is the above A63B53.
/ 02 indicates that the word appears in almost all documents in the field.

【００４０】[0040]

【発明の効果】本発明にかかる特許文献検索方法あるい
は特許文献検索システムでは、最初特許分類による分類
検索が行われ、回答集合を形成した分野専用の同義語辞
書を参照して当該分野に属する文献の索引語を統一化整
理された後、類似度比較が行われてランキングまたはク
ラスタリングされるので、モレ及びノイズの発生が少な
い特許文献検索を行うことができる。また、同義語辞書
が技術分野専用で整理されるので、モレ及びノイズの発
生が少ない特許文献検索を行うことができる。In the patent document search method or the patent document search system according to the present invention, the classification search is first performed by the patent classification, and the documents belonging to the relevant field are referred to by referring to the synonym dictionary dedicated to the field in which the answer set is formed. After the index terms of are unified and organized, the similarity is compared and ranked or clustered, so that patent document search with less occurrence of leakage and noise can be performed. Further, since the synonym dictionary is organized for the technical field only, it is possible to perform a patent document search with less leakage and noise.

[Brief description of drawings]

【図１】先行出願について説明するための簡単な説明図
である。FIG. 1 is a simple explanatory diagram for explaining a prior application.

【図２】第１実施例のシステム概要図である。FIG. 2 is a system schematic diagram of the first embodiment.

【図３】第２実施例のシステム概要図である。FIG. 3 is a system schematic diagram of a second embodiment.

【図４】分野別同義語整理特許辞書作成の手順をイメー
ジで示した図である。FIG. 4 is a diagram showing an image of a procedure for creating a synonym organization patent dictionary by field.

【図５】確定した分野別同義語辞書のイメージを示した
ものである。FIG. 5 shows an image of a determined synonym dictionary for each field.

[Explanation of symbols]

１特許公開公報２入力部３特許文献蓄積ファイル４分類検索インデックス作成部５分類インデックスファイル６入・出力部７演算部８検索集合保存部９テキスト検索インデックス作成部１０テキストインデックスファイル１１分野別同義語辞書１２索引語出現頻度ファイル１３索引語出現頻度算出部１４クラスタリング・ランキング演算部１５分野索引語出現頻度算出部 1 Patent publication 2 Input section 3 Patent document accumulation file 4 Classification search index creation section 5 Classification index file 6 Input / output section 7 operation part 8 Search set storage 9 Text search index creation section 10 Text index file 11 Synonym dictionaries by field 12 Index word frequency file 13 Index word frequency calculator 14 Clustering / Ranking calculator 15 Field index word frequency calculator

───────────────────────────────────────────────────── フロントページの続き (72)発明者磯田道弘東京都港区赤坂２丁目２番12号エヌ・ティ・ティ・データ・テクノロジ株式会社内Ｆターム(参考） 5B075 ND20 ND34 NK35 QM08 UU06 UU40 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Michihiro Isoda 2-2-12 Akasaka, Minato-ku, Tokyo NT IT Data Technology Co., Ltd. F term (reference) 5B075 ND20 ND34 NK35 QM08 UU06 UU40

Claims

[Claims]

1. A hybrid search in which an answer set created by a classification search is compared in similarity by using a seed document that is a survey target as an index, and ranking or clustering is displayed according to the similarity. The feature is that after all synonyms are unified and organized by referring to the synonym dictionary dedicated to the relevant field for all the words of the documents forming the set, similarity comparison is performed, and ranking or clustering is performed according to this similarity. Patent literature search method.

2. A hybrid search in which the similarity of an answer set created by a classification search is compared using a seed document that is a survey target as an index, and ranking or clustering is displayed according to the similarity. The synonyms are unified and organized by referring to the synonym dictionary dedicated to the relevant field for all the words of the documents that form the set, and the group of the unified and organized words no longer show the characteristics in the technical field. A patent document search method characterized by performing similarity comparison after removing unacceptable words, and ranking or clustering according to the similarity.

3. A hybrid search in which an answer set created by a classification search is compared for similarity using a seed document that is a survey target as an index, and ranking or clustering is displayed according to the similarity. For the index words that characterize each document forming a set, refer to the synonym dictionary dedicated to the relevant field to unify and organize synonyms, then compare the similarities, and rank or cluster according to the similarities. Patent document search method characterized by the following.

4. A hybrid search in which an answer set created by a search by classification is compared for similarity using a seed document that is a survey target as an index, and ranking or clustering is displayed according to the similarity. For index words that characterize each document that forms a set, synonyms are unified and organized by referring to a synonym dictionary dedicated to the relevant field, and from this group of index words that have been unified and organized, it is no longer characteristic within the technical field. A patent document search method characterized by performing a similarity comparison after removing index words that do not indicate "," and ranking or clustering according to the similarity.

5. A means for searching for patent documents by classification and a synonym dictionary dedicated to the relevant field for all words of the patent documents forming the answer set of the search is unified and organized. Means, means for comparing the similarities using the seed document that is the object of the survey as an index, means for ranking or clustering according to the similarity, and patent documents or patent documents ranked or clustered And a means for displaying a seed document.

6. A means for searching for patent documents by classification and a synonym dictionary dedicated to the field for all the words of the patent documents forming the answer set of the search is unified and organized. Means for removing words that are no longer characteristic in the technical field from the group of words that have been unified and organized as described above, and are similar using the seed document that is the survey target as an index. A means for comparing degrees, a means for ranking or clustering according to the degree of similarity, and a means for displaying patent documents or patent documents and seed documents that have been ranked or clustered, and means for displaying seed documents Patent literature search system.

7. A means for searching for patent documents by classification, and a reference to a synonym dictionary dedicated to the relevant field for index words that characterize each patent document forming the answer set of the search, to unify the synonyms. A means for organizing and organizing, a means for comparing the similarities using the seed document that is the object of the survey as an index, a means for ranking or clustering according to the similarities, a patent document ranked or clustered, or A patent document search system comprising a patent document and means for displaying a seed document.

8. A means for searching for patent documents by classification, and a synonym dictionary for each index document that forms an answer set of the above search is referred to a synonym dictionary dedicated to the relevant field to unify the synonyms. A means for organizing and organizing, and a means for removing the index terms that are no longer characteristic in the technical field from the group of index terms that have been unified and organized, and the species to be investigated. A means for comparing the similarities using the documents as indexes, a means for ranking or clustering according to the above-mentioned similarities, and a means for displaying patent documents or patent documents and patent documents and seed documents that have been ranked or clustered A patent document retrieval system characterized by being provided.