JP2020135246A

JP2020135246A - Document search apparatus, document search method, and document search program

Info

Publication number: JP2020135246A
Application number: JP2019026082A
Authority: JP
Inventors: 崇志三上; Takashi Mikami; 鬼塚　真; Makoto Onizuka; 真鬼塚
Original assignee: Osaka University NUC; AI Samurai Inc
Current assignee: Osaka University NUC; AI Samurai Inc
Priority date: 2019-02-15
Filing date: 2019-02-15
Publication date: 2020-08-31
Anticipated expiration: 2039-02-15
Also published as: JP6675742B1

Abstract

To provide a document search apparatus and the like reducing search omissions with respect to a search condition.SOLUTION: A document search apparatus according to one embodiment of the current invention provided with a search part searching for a document corresponding to a received search condition includes: a similarity calculation part which obtains similarity between one document included in a set of documents and a plurality of other documents different from the one document included in the set of documents; and a document extension part which uses a similar document to be another document whose similarity to the one document is a prescribed value or larger among the plurality of other documents to generate one extended document extending the one document. The document extension part creates an extended document about all documents on the basis of similarities of all documents included in the set of documents calculated by the similarity calculation part. The search part searches for an extended document corresponding to the search condition in the extended document with respect to all the documents, and outputs one document corresponding to the searched extended document as a search result.SELECTED DRAWING: Figure 1

Description

本発明は、文書検索装置、文書検索方法、及び、文書検索プログラムに関し、特に、検索対象である文書集合における文書同士の類似性を利用して、一の文書に含まれるテキストと、当該一の文書の類似文書のテキストとを予め関連付けた拡張文書集合を生成し、検索条件に対する検索漏れを低減させる文書検索装置等に関する。 The present invention relates to a document retrieval device, a document retrieval method, and a document retrieval program, and in particular, utilizing the similarity between documents in a document set to be searched, a text contained in one document and the said one. The present invention relates to a document retrieval device that generates an extended document set in which a text of a similar document of a document is associated in advance and reduces search omissions for search conditions.

近年、出願前の発明を評価する装置が開発されている。特許文献１には、知的財産に関する情報を記憶し、出願前の出願書類についての評価、及び、作成の支援を行う、特許明細書評価・作成作業支援装置が開示されている。出願前の発明の評価には、新規性・進歩性といった登録の実体的要件を満足するか否かや、侵害の有無等を判定するため、通常、先行技術文献調査やクリアランス調査等の検索処理を行うことが求められる。 In recent years, devices for evaluating inventions before filing have been developed. Patent Document 1 discloses a patent specification evaluation / preparation work support device that stores information on intellectual property, evaluates application documents before filing, and supports preparation. In the evaluation of an invention before filing an application, in order to determine whether or not the substantive requirements of registration such as novelty and inventive step are satisfied and whether or not there is infringement, etc., a search process such as a prior art document search or a clearance search is usually performed. Is required to do.

特開２０１０−２２４９８４号公報JP-A-2010-224984

従来、検索処理には、検索条件（検索クエリ）として入力された検索テキスト内のキーワードを、同義語辞書等により同義語で展開し、検索クエリを再構成する手法が存在する。しかしながら、上記の手法では適切に辞書を作成するのが困難であり、同義語として展開した（追加した）単語がノイズとなって、検索クエリとは関係のない文書を検索してしまうという問題がある。 Conventionally, in the search process, there is a method of reconstructing a search query by expanding keywords in a search text input as a search condition (search query) with synonyms using a synonym dictionary or the like. However, it is difficult to properly create a dictionary with the above method, and there is a problem that the expanded (added) word as a synonym becomes noise and searches a document unrelated to the search query. is there.

また、従来、単語の分散表現、ベクトル表現などを利用し、単語の文字列ではなく単語の意味を抽象化したベクトル空間上で類似度（距離）を計算して、類似文献を検索する、いわゆる概念検索が行われている。概念検索では、入力された検索テキスト中の単語そのものをベクトル空間上で類似する単語で拡張する方法と、文書をあらかじめベクトル空間上に変換し特徴量として持っておく方法とがある。いずれの場合も、ベクトル化は一般に機械学習により行われるが、全く関係のない文書同士又は単語同士で類似度が高くなり、ノイズとなる場合がある。また、一般に単語は対象とする分野によって意味が異なる場合が多く、検索対象とする分野毎にベクトル空間を構築する必要がある。 In addition, conventionally, using distributed expressions and vector expressions of words, the degree of similarity (distance) is calculated in a vector space that abstracts the meaning of words instead of character strings of words, and similar documents are searched, so-called. A concept search is being performed. In the concept search, there are a method of expanding the word itself in the input search text with similar words in the vector space, and a method of converting the document into the vector space in advance and having it as a feature quantity. In either case, vectorization is generally performed by machine learning, but the similarity between documents or words that are completely unrelated to each other becomes high, which may cause noise. In general, words often have different meanings depending on the target field, and it is necessary to construct a vector space for each target field.

さらに、文書間の類似度を文書に含まれる単語等で分類（クラスタリング）し、キーワード一致による文書の周辺文書を検索結果とする手法がある。この手法では、文書同士の類似度をあらかじめ算出しておく必要があるが、その際の類似度の求め方と、実際に検索する際に求められる類似度の求め方が異なる場合があり、全く関係のない文書を検索してしまうという問題があった。 Further, there is a method of classifying (clustering) the similarity between documents by words included in the documents and using the peripheral documents of the documents by keyword matching as search results. In this method, it is necessary to calculate the similarity between documents in advance, but the method of obtaining the similarity at that time may differ from the method of obtaining the similarity required when actually searching, so it is completely different. There was a problem of searching for unrelated documents.

本発明は上記に鑑みてなされたものであり、検索条件に対する検索漏れを低減させる文書検索装置等を提供する。 The present invention has been made in view of the above, and provides a document retrieval device and the like that reduce search omissions for search conditions.

本発明の一実施形態による文書検索装置は、受け付けた検索条件に対応する文書を検索する検索部を備える文書検索装置であって、文書集合に含まれる一の文書と、文書集合に含まれる、一の文書とは異なる複数の他の文書との間の類似度を求める類似度算出部と、複数の他の文書のうち、一の文書に対する類似度が所定値以上の他の文書である類似文書を用いて、一の文書を拡張した一の拡張文書を生成する文書拡張部と、をさらに備え、文書拡張部は、類似度算出部によって算出された、文書集合に含まれるすべての文書についての類似度に基づいて、当該すべての文書についての拡張文書を作成し、検索部は、すべての文書についての拡張文書から、検索条件に対応する拡張文書を検索し、検索された拡張文書に対応する一の文書を、検索結果として出力する。 The document search device according to the embodiment of the present invention is a document search device including a search unit for searching a document corresponding to the received search condition, and includes one document included in the document set and a document included in the document set. A similarity calculation unit that obtains the similarity between a plurality of other documents different from one document, and a similarity that is another document having a similarity to one document of a predetermined value or more among a plurality of other documents. It further includes a document extension unit that uses a document to generate one extended document that extends one document, and the document extension unit is for all documents included in the document set calculated by the similarity calculation unit. Based on the similarity of, the extended document is created for all the relevant documents, and the search unit searches the extended document corresponding to the search condition from the extended document for all the documents, and corresponds to the searched extended document. Output one document to be done as a search result.

本発明の一実施形態による文書検索装置において、類似度算出部は、ｋ近傍法により、類似度が所定値以上の他の文書を分類してもよい。 In the document retrieval apparatus according to the embodiment of the present invention, the similarity calculation unit may classify other documents having a similarity equal to or greater than a predetermined value by the k-nearest neighbor method.

本発明の一実施形態による文書検索装置において、文書拡張部は、一の文書のグラフ構造と、一の文書の類似文書のグラフ構造とを比較し、類似度の高い類似文書のグラフ要素を一の文書に統合して、一の文書の拡張文書を生成してもよい。 In the document retrieval apparatus according to the embodiment of the present invention, the document extension unit compares the graph structure of one document with the graph structure of a similar document of one document, and selects one graph element of a similar document having a high degree of similarity. It may be integrated into one document to generate an extended document of one document.

本発明の一実施形態による文書検索装置において、文書拡張部は、一の文書と、一の文書の類似文書とを、所定ウィンドウ幅に含まれるグラフ要素で比較し、一致度が高いウィンドウに含まれる類似文書のグラフ要素を、一の文書に統合してもよい。 In the document retrieval apparatus according to the embodiment of the present invention, the document extension unit compares one document and a similar document of one document by graph elements included in a predetermined window width, and includes the document in a window having a high degree of matching. The graph elements of similar documents may be integrated into one document.

本発明の一実施形態による文書検索装置において、検索部は、検索条件としての検索テキストを受け付け、検索テキストのグラフ単位での一致度が所定値以上の拡張文書を検索し、検索された拡張文書に対応する一の文書を、検索結果として出力してもよい。 In the document search device according to the embodiment of the present invention, the search unit accepts the search text as the search condition, searches for the extended document in which the degree of matching of the search text in graph units is equal to or more than a predetermined value, and the searched extended document. One document corresponding to may be output as a search result.

本発明の一実施形態による文書検索装置において、検索部は、検索条件としての検索テキストを受け付け、検索テキストに含まれるキーワードを含む拡張文書を検索し、検索された拡張文書に対応する一の文書を、検索結果として出力してもよい。 In the document search device according to the embodiment of the present invention, the search unit accepts a search text as a search condition, searches for an extended document containing a keyword included in the search text, and one document corresponding to the searched extended document. May be output as a search result.

本発明の一実施形態による文書検索装置において、一の文書は複数の項目から成り、文書拡張部は、一の文書のうち所定の項目を対象として拡張文書を生成してもよい。 In the document retrieval apparatus according to the embodiment of the present invention, one document may be composed of a plurality of items, and the document extension unit may generate an extended document for a predetermined item in the one document.

本発明の一実施形態による文書検索装置において、文書集合は、書誌情報を有する特許文献の集合であって、類似度算出部は、書誌情報の一致度に応じて、一の文書と複数の他の文書との間の類似度を求めてもよい。 In the document retrieval apparatus according to the embodiment of the present invention, the document set is a set of patent documents having bibliographic information, and the similarity calculation unit is one document and a plurality of others according to the degree of matching of the bibliographic information. You may find the similarity with the document of.

本発明の一実施形態による、受け付けた検索条件に対応する文書を検索する文書検索方法は、コンピュータが、文書集合に含まれる一の文書と、文書集合に含まれる、一の文書とは異なる複数の他の文書との間の類似度を求める類似度算出ステップと、複数の他の文書のうち、一の文書に対する類似度が所定値以上の他の文書である類似文書を用いて、一の文書を拡張した一の拡張文書を生成する文書拡張ステップと、を実行し、文書拡張ステップは、類似度算出ステップによって算出された、文書集合に含まれるすべての文書についての類似度に基づいて、当該すべての文書についての拡張文書を作成し、コンピュータは、すべての文書についての拡張文書から、検索条件に対応する拡張文書を検索し、検索された拡張文書に対応する一の文書を、検索結果として出力する検索ステップをさらに実行する。 According to one embodiment of the present invention, a document search method for searching a document corresponding to a received search condition is performed by a computer using one document included in a document set and a plurality of different documents included in the document set. Using the similarity calculation step for obtaining the similarity with other documents, and the similarity document which is another document whose similarity to one document is equal to or more than a predetermined value among a plurality of other documents, one The document extension step, which generates one extended document that extends the document, is executed, and the document extension step is based on the similarity of all the documents contained in the document set calculated by the similarity calculation step. An extended document is created for all the relevant documents, the computer searches for the extended document corresponding to the search condition from the extended document for all the documents, and the search result is one document corresponding to the searched extended document. Further execute the search step that is output as.

本発明の一実施形態による、受け付けた検索条件に対応する文書を検索する文書検索プログラムは、コンピュータに、文書集合に含まれる一の文書と、文書集合に含まれる、一の文書とは異なる複数の他の文書との間の類似度を求める類似度算出機能と、複数の他の文書のうち、一の文書に対する類似度が所定値以上の他の文書である類似文書を用いて、一の文書を拡張した一の拡張文書を生成する文書拡張機能と、を実現させ、文書拡張機能は、類似度算出機能によって算出された、文書集合に含まれるすべての文書についての類似度に基づいて、当該すべての文書についての拡張文書を作成し、コンピュータに、すべての文書についての拡張文書から、検索条件に対応する拡張文書を検索し、検索された拡張文書に対応する一の文書を、検索結果として出力する文書検索機能をさらに実現させる。 According to an embodiment of the present invention, a document search program for searching a document corresponding to a received search condition is a computer in which one document included in a document set and a plurality of documents included in the document set are different from each other. Using the similarity calculation function for finding the similarity with other documents, and the similarity document that is another document whose similarity to one document is equal to or more than a predetermined value among a plurality of other documents, one A document extension function that generates one extended document that extends a document is realized, and the document extension function is based on the similarity of all documents included in the document set calculated by the similarity calculation function. Create an extended document for all the documents, search the computer for the extended document corresponding to the search condition from the extended document for all documents, and search the search result for one document corresponding to the searched extended document. Further realize the document search function that outputs as.

本発明の一実施形態によれば、検索条件に対する検索漏れを低減させる文書検索装置等を提供することができる。 According to one embodiment of the present invention, it is possible to provide a document retrieval device or the like that reduces search omissions with respect to search conditions.

本発明の一実施形態に係る文書検索装置の機能ブロック図の一例である。It is an example of the functional block diagram of the document retrieval apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る文書検索装置（コンピュータ）のハードウェア構成の一例である。This is an example of the hardware configuration of the document retrieval device (computer) according to the embodiment of the present invention. 文書データベースの一例である。This is an example of a document database. 文書距離データベースの一例である。This is an example of a document distance database. （ａ）は、拡張文書の生成例、（ｂ）は、拡張文書データベースの一例である。(A) is an example of generating an extended document, and (b) is an example of an extended document database. 本発明の一実施形態に係る検索処理のフロー図である。It is a flow chart of the search process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る知的財産創出支援システム構成の概略図である。It is a schematic diagram of the intellectual property creation support system configuration which concerns on one Embodiment of this invention.

以降、諸図面を参照しながら、本発明の一実施形態を詳細に説明する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

＜ハードウェア構成＞
まず、図２を用いて、本発明の一実施形態に係る文書検索装置（コンピュータ）１００のハードウェア構成について説明する。文書検索装置１００は、プロセッサ１０１と、メモリ１０２と、ストレージ１０３と、入出力インタフェース（Ｉ／Ｆ）１０４と、通信Ｉ／Ｆ１０５とを備え、これらの協働により、本実施形態に記載される機能や方法を実現する。例えば、本開示の機能又は方法は、メモリ１０２に読み込まれたプログラムに含まれる命令をプロセッサ１０１が実行することによって実現される。 <Hardware configuration>
First, the hardware configuration of the document retrieval device (computer) 100 according to the embodiment of the present invention will be described with reference to FIG. The document retrieval device 100 includes a processor 101, a memory 102, a storage 103, an input / output interface (I / F) 104, and a communication I / F 105, and is described in the present embodiment in cooperation with these. Realize functions and methods. For example, the function or method of the present disclosure is realized by the processor 101 executing an instruction included in a program read in the memory 102.

プロセッサ１０１は、ストレージ１０３に記憶されるプログラムに含まれるコード又は命令によって実現する機能、及び／又は、方法を実行する。プロセッサ１０１は、例えば、中央処理装置（ＣＰＵ）、ＭＰＵ（Micro Processing Unit）、ＧＰＵ（Graphics Processing Unit）、マイクロプロセッサ（microprocessor）、プロセッサコア（processor core）、マルチプロセッサ（multiprocessor）、ＡＳＩＣ（Application-Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等を含み、集積回路（ＩＣ（Integrated Circuit）チップ、ＬＳＩ（Large Scale Integration））等に形成された論理回路（ハードウェア）や専用回路によって各実施形態に開示される各処理を実現してもよい。また、これらの回路は、１又は複数の集積回路により実現されてよく、各実施形態に示す複数の処理を１つの集積回路により実現されることとしてもよい。また、ＬＳＩは、集積度の違いにより、ＶＬＳＩ、スーパーＬＳＩ、ウルトラＬＳＩ等と呼称されることもある。 The processor 101 executes a function and / or a method realized by a code or an instruction contained in a program stored in the storage 103. The processor 101 includes, for example, a central processing unit (CPU), an MPU (Micro Processing Unit), a GPU (Graphics Processing Unit), a microprocessor (microprocessor), a processor core (processor core), a multiprocessor (multiprocessor), and an ASIC (Application-). Each implementation includes a specific integrated circuit), FPGA (Field Programmable Gate Array), etc., and is implemented by a logic circuit (hardware) or a dedicated circuit formed in an integrated circuit (IC (Integrated Circuit) chip, LSI (Large Scale Integration)), etc. Each process disclosed in the form may be realized. Further, these circuits may be realized by one or a plurality of integrated circuits, and a plurality of processes shown in each embodiment may be realized by one integrated circuit. Further, the LSI may be referred to as a VLSI, a super LSI, an ultra LSI, or the like depending on the degree of integration.

メモリ１０２は、ストレージ１０３からロードしたプログラムを一時的に記憶し、プロセッサ１０１に対して作業領域を提供する。メモリ１０２には、プロセッサ１０１がプログラムを実行している間に生成される各種データも一時的に格納される。メモリ１０２は、例えば、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）等を含む。 The memory 102 temporarily stores the program loaded from the storage 103 and provides a work area to the processor 101. Various data generated while the processor 101 is executing the program are also temporarily stored in the memory 102. The memory 102 includes, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), and the like.

ストレージ１０３は、プログラムを記憶する。ストレージ１０３は、例えば、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリ等を含む。 The storage 103 stores the program. The storage 103 includes, for example, an HDD (Hard Disk Drive), an SSD (Solid State Drive), a flash memory, and the like.

通信Ｉ／Ｆ１０５は、ネットワークアダプタ等のハードウェアや通信用ソフトウェア、及びこれらの組み合わせとして実装され、ネットワークＮＥＴを介して各種データの送受信を行う。当該通信は、有線、無線のいずれで実行されてもよく、互いの通信が実行できるのであれば、どのような通信プロトコルを用いてもよい。通信Ｉ／Ｆ１０５は、ネットワークＮＥＴを介して、例えばユーザ端末のゆな他の情報処理装置との通信を実行する。通信Ｉ／Ｆ１０５は、各種データをプロセッサ１０１からの指示に従って、他の情報処理装置に送信する。また、通信Ｉ／Ｆ１０５は、他の情報処理装置から送信された各種データを受信し、プロセッサ１０１に伝達する。 The communication I / F 105 is implemented as hardware such as a network adapter, communication software, and a combination thereof, and transmits and receives various data via the network NET. The communication may be executed by wire or wirelessly, and any communication protocol may be used as long as mutual communication can be executed. The communication I / F 105 executes communication with, for example, Yuna, another information processing device of the user terminal, via the network NET. The communication I / F 105 transmits various data to another information processing device according to an instruction from the processor 101. Further, the communication I / F 105 receives various data transmitted from other information processing devices and transmits the various data to the processor 101.

入出力Ｉ／Ｆ１０４は、文書検索装置１００に対する各種操作を入力する入力装置、及び、文書検索装置１００で処理された処理結果を出力する出力装置を含む。入出力Ｉ／Ｆ１０４は、入力装置と出力装置が一体化していてもよいし、入力装置と出力装置とに分離していてもよい。入力装置は、ユーザからの入力を受け付けて、当該入力に係る情報をプロセッサ１０１に伝達できる全ての種類の装置のいずれか、又は、その組み合わせにより実現される。入力装置は、例えば、タッチパネル、タッチディスプレイ、キーボード等のハードウェアキーや、マウス等のポインティングデバイス、カメラ（画像を介した操作入力）、マイク（音声による操作入力）を含む。出力装置は、プロセッサ１０１で処理された処理結果を出力する。出力装置は、例えば、タッチパネル、スピーカ等を含む。 The input / output I / F 104 includes an input device for inputting various operations to the document retrieval device 100 and an output device for outputting the processing result processed by the document retrieval device 100. The input / output I / F 104 may be integrated with the input device and the output device, or may be separated into the input device and the output device. The input device is realized by any one of all kinds of devices capable of receiving an input from a user and transmitting information related to the input to the processor 101, or a combination thereof. The input device includes, for example, a hardware key such as a touch panel, a touch display, and a keyboard, a pointing device such as a mouse, a camera (operation input via an image), and a microphone (operation input by voice). The output device outputs the processing result processed by the processor 101. The output device includes, for example, a touch panel, a speaker, and the like.

＜機能構成＞
次に、図１を用いて、本発明の一実施形態に係る文書検索装置１００の機能構成について説明する。なお、図１に記載の各機能部が必須ではなく、また、これ以外の機能部を備えてもよい。また、各機能部の機能又は処理は、実現可能な範囲において、機械学習又はＡＩ（Artificial Intelligence）により実現されてもよい。 <Functional configuration>
Next, the functional configuration of the document retrieval device 100 according to the embodiment of the present invention will be described with reference to FIG. It should be noted that each functional unit shown in FIG. 1 is not essential, and other functional units may be provided. Further, the function or processing of each functional unit may be realized by machine learning or AI (Artificial Intelligence) to the extent feasible.

文書検索装置１００は、類似度算出部１１０、文書拡張部１２０、検索部１３０、文書距離ＤＢ（データベース）１４０、拡張文書データベース１５０を備える。なお、文書距離データベース１４０及び拡張文書データベース１５０は、文書検索装置１００の外部、例えばクラウド上に設けられ、ネットワークを介してアクセス可能であってもよい。本発明の一実施形態によれば、文書検索装置１００は、検索対象である文書データベース２００を拡張して拡張文書データベース１５０を生成し、拡張文書データベース１５０を検索対象とすることで、検索漏れを低減させることができる。 The document retrieval device 100 includes a similarity calculation unit 110, a document expansion unit 120, a search unit 130, a document distance DB (database) 140, and an extended document database 150. The document distance database 140 and the extended document database 150 may be provided outside the document search device 100, for example, on the cloud, and may be accessible via a network. According to one embodiment of the present invention, the document retrieval device 100 expands the document database 200 to be searched to generate the extended document database 150, and sets the extended document database 150 as the search target to prevent omission of search. It can be reduced.

類似度算出部１１０は、文書集合（文書データベース２００）に含まれる一の文書と、文書集合に含まれる、一の文書とは異なる複数の他の文書との間の類似度を求める。なお、類似度算出部１１０は、文書データベース２００に含まれるすべての文書に対し、他の文書との間の類似度を算出することが好ましい。類似度算出部１１０は、例えば、ｋ近傍法を用いて、文書データベース２００に含まれる文書を分類してもよい。ここで、文書データベース２００が例えば特許文献を格納する特許庁のデータベースである場合、文書間の類似度（すなわち、距離）の算出には、特許文献における書誌情報を利用することができる。例えば、類似度算出部１１０は、例えば、発明者の一致度、ＩＰＣ等の特許分類の一致度などを類似度として用いることができる。また、類似度算出部１１０は、一の文書に含まれるテキスト内のキーワードと、他の文書に含まれるテキスト内のキーワードとの共有数に基づいて、一の文書と他の文書との類似度を算出してもよい。なお、類似度の算出方法は上述したものに限られるものではなく、既存のクラスタリング手法を用いることができる。 The similarity calculation unit 110 obtains the similarity between one document included in the document set (document database 200) and a plurality of other documents included in the document set and different from the one document. It is preferable that the similarity calculation unit 110 calculates the similarity between all the documents included in the document database 200 and other documents. The similarity calculation unit 110 may classify the documents included in the document database 200 by using, for example, the k-nearest neighbor method. Here, when the document database 200 is, for example, a database of the Japan Patent Office that stores patent documents, bibliographic information in the patent documents can be used to calculate the degree of similarity (that is, distance) between documents. For example, the similarity calculation unit 110 can use, for example, the degree of agreement of the inventor, the degree of agreement of patent classifications such as IPC, and the like as the degree of similarity. Further, the similarity calculation unit 110 has a similarity between one document and another document based on the number of shares of the keyword in the text included in one document and the keyword in the text included in the other document. May be calculated. The method of calculating the similarity is not limited to the above-mentioned method, and an existing clustering method can be used.

類似度算出部１１０は、算出した各文書間の類似度（距離）を、文書距離データベース１４０に格納する。 The similarity calculation unit 110 stores the calculated similarity (distance) between documents in the document distance database 140.

ここで、図を用いて、文書データベース２００と文書距離データベース１４０の一例について説明する。図３は、文書データベース２００の概念図、図４は、文書距離データベースの概念図である。文書データベース２００は、各文書を識別する識別子（文書ＩＤ）毎に、複数のテキストの集合（テキスト群）が関連付けられているとみなすことができる。例えば、文書ＩＤ「ＩＤ＿Ｘ」の文書は、「テキストａ」、「テキストｂ」、「テキストｃ」、「テキストｄ」…が関連付けられている。 Here, an example of the document database 200 and the document distance database 140 will be described with reference to the figures. FIG. 3 is a conceptual diagram of the document database 200, and FIG. 4 is a conceptual diagram of the document distance database. The document database 200 can consider that a set of a plurality of texts (text group) is associated with each identifier (document ID) that identifies each document. For example, the document with the document ID "ID_X" is associated with "text a", "text b", "text c", "text d", and so on.

類似度算出部１１０は、文書データベース２００に含まれる文書間の類似度を求め、距離が近い文書同士を文書群として関連付ける。文書距離データベース１４０は、距離が近い（類似度が高い）文書群を格納する。図４の例では、文書ＩＤ「ＩＤ＿Ｘ」、「ＩＤ＿Ｙ」、「ＩＤ＿Ｈ」、「ＩＤ＿Ｊ」…が、類似度が高い文書群として格納されている。なお、図は一例であって、データベースの構造はこれに限られるものではない。例えば、一の文書に対し、類似度に応じて他の文書が関連付けられてもよい。 The similarity calculation unit 110 obtains the similarity between documents included in the document database 200, and associates documents having a short distance with each other as a document group. The document distance database 140 stores a group of documents that are close in distance (high in similarity). In the example of FIG. 4, the document IDs “ID_X”, “ID_Y”, “ID_H”, “ID_J” ... Are stored as a group of documents having a high degree of similarity. The figure is an example, and the structure of the database is not limited to this. For example, one document may be associated with another document according to the degree of similarity.

文書拡張部１２０は、文書データベース２００に含まれる一の文書と異なる複数の他の文書のうち、一の文書に対する類似度が所定値以上の他の文書である類似文書を用いて、一の文書を拡張した一の拡張文書を生成する。 The document extension unit 120 uses a similar document, which is another document whose similarity to the one document is equal to or more than a predetermined value, among a plurality of other documents different from the one document included in the document database 200, to form one document. Generates one extended document that extends.

文書拡張部１２０による拡張処理について、図５を用いて具体的に説明する。図５の例では、拡張する対象を文書ＩＤ「ＩＤ＿Ｘ」の文書であるとし、その類似文書を、文書ＩＤ「ＩＤ＿Ｙ」の文書であるとする。文書拡張部１２０は、文書ＩＤ「ＩＤ＿Ｘ」に含まれるテキストａと、文書ＩＤ「ＩＤ＿Ｙ」に含まれるテキストｑとを、ＢａｇＯｆＷｏｒｄｓにより、それぞれグラフ集合Ｇａ、Ｇｑに変換する。図５（ａ）の例では、テキストａは「そういえば、今日の天気は晴れのち曇りです。」であり、テキストｑは「台風の接近により、明日の天気は晴れのち雨です。」である。 The expansion process by the document expansion unit 120 will be specifically described with reference to FIG. In the example of FIG. 5, it is assumed that the target to be expanded is the document with the document ID “ID_X”, and the similar document is the document with the document ID “ID_Y”. The document extension unit 120 converts the text a included in the document ID “ID_X” and the text q included in the document ID “ID_Y” into graph sets Ga and Gq by Bag Of Words, respectively. In the example of FIG. 5 (a), the text a is "By the way, today's weather is sunny and then cloudy", and the text q is "Tomorrow's weather is sunny and then rainy due to the approaching typhoon." ..

ウィンドウ幅を３とすると、グラフ集合Ｇａは（そういえば，今日の，天気は）（今日の，天気は，晴れ）（天気は，晴れ，のち）（晴れ，のち，曇り）（のち，曇り，です）となり、グラフ集合Ｇｑは（台風の，接近により，明日の）（接近により，明日の，天気は）（明日の，天気は，晴れ）（天気は，晴れ，のち）（晴れ，のち，雨）（のち，雨，です）となる。文書拡張部１２０は、グラフ集合Ｇａ及びＧｑを比較し、グラフ集合Ｇｑのうち所定数（ここでは２）の要素を共有するグラフ（明日の，天気は，晴れ）、（晴れ，のち，雨）及び（のち，雨，です）を、グラフ集合Ｇａに追加（統合）することで、拡張グラフ集合Ｇａ′を生成する。すなわち、拡張グラフ集合Ｇａ′は、（そういえば，今日の，天気は）（今日の，天気は，晴れ）（天気は，晴れ，のち）（晴れ，のち，曇り）（のち，曇り，です）（明日の，天気は，晴れ）（晴れ，のち，雨）（のち，雨，です）となる。そして、文書拡張部１２０は、拡張グラフ集合Ｇａ′を、文書ＩＤ「ＩＤ＿Ｘ」の検索特徴量として、拡張文書データベース１５０に格納する。 Assuming that the window width is 3, the graph set Ga (by the way, today's weather is) (today's weather is sunny) (weather is sunny, later) (sunny, later, cloudy) (later cloudy, The graph set Gq is (by typhoon, approaching, tomorrow) (by approaching, tomorrow, weather) (tomorrow, weather is sunny) (weather is sunny, later) (sunny, later, (Rain) (later, rain). The document extension unit 120 compares the graph sets Ga and Gq, and shares a predetermined number (here, 2) of elements in the graph set Gq. Graph (tomorrow, weather is fine), (sunny, later, rain) And (later, rain) are added (integrated) to the graph set Ga to generate the extended graph set Ga'. That is, the extended graph set Ga'is (by the way, today's weather is) (today's weather is sunny) (weather is sunny, later) (sunny, later, cloudy) (later cloudy). (Tomorrow, the weather will be fine) (sunny, then rain) (later, rain). Then, the document expansion unit 120 stores the expanded graph set Ga'in the extended document database 150 as a search feature amount of the document ID "ID_X".

文書拡張部１２０は、文書データベース２００に含まれるすべての文書について、上述のように拡張文書を作成し、拡張文書データベース１５０に格納する。これにより、各文書に含まれる各テキストについて拡張文書が生成され、図５（ｂ）に示すような拡張文書データベース１５０が生成される。 The document extension unit 120 creates an extended document as described above for all the documents included in the document database 200, and stores the extended document in the extended document database 150. As a result, an extended document is generated for each text included in each document, and an extended document database 150 as shown in FIG. 5B is generated.

なお、上述ではウインドウ幅を３としたが、これに限られるものではなく、ウインドウ幅は任意に設定してよい。また、統合するグラフを抽出するための判断基準となる、共有するグラフの要素の数も、任意に設定してもよい。なお、グラフ集合への変換には、ＢａｇＯｆＷｏｒｄｓではなく、テキストの係り受け構造を利用した木構造グラフを用いてもよい。木構造グラフを用いて、テキスト間のグラフ構造の類似度に応じて文書を拡張することにより、より関連性の高いグラフによって一の文書を拡張することができる。 In the above description, the window width is set to 3, but the present invention is not limited to this, and the window width may be set arbitrarily. In addition, the number of shared graph elements, which is a criterion for extracting the graph to be integrated, may be arbitrarily set. For the conversion to the graph set, a tree structure graph using a text dependency structure may be used instead of Bag Of Words. A tree structure graph can be used to extend a document according to the similarity of the graph structure between texts, thereby extending a document with more relevant graphs.

検索部１３０は、検索条件３００が入力されると、文書データベース２００ではなく、拡張文書データベース１５０に格納されたグラフ集合を検索対象とした検索を行う。拡張文書データベース１５０に対する検索は、検索条件としての検索テキストを、文書拡張部１２０と同じ方式でグラフ化し、グラフの一致度を判定することで行うことができる。なお、検索は、各グラフ要素からキーワードの一致に応じて判定するものであってもよい。検索部１３０は、拡張文書データベース１５０から検索された拡張文書に対応する一の文書（すなわち、拡張前の元文書）を、文書データベース２００から取得し、検索結果として出力する。 When the search condition 300 is input, the search unit 130 performs a search using the graph set stored in the extended document database 150 instead of the document database 200 as a search target. The search for the extended document database 150 can be performed by graphing the search text as a search condition in the same manner as in the document extension unit 120 and determining the degree of matching of the graphs. The search may be determined according to the matching of keywords from each graph element. The search unit 130 acquires one document (that is, the original document before expansion) corresponding to the extended document searched from the extended document database 150 from the document database 200 and outputs it as a search result.

なお、検索部１３０は、検索された文書が拡張文書であった場合、検索結果としての重み付けを弱くしてもよい。これにより、検索結果の正確性を担保することができる。 In addition, when the searched document is an extended document, the search unit 130 may weaken the weighting as a search result. As a result, the accuracy of the search result can be guaranteed.

＜検索処理＞
次に、文書検索装置１００による検索処理について、図６のフロー図を用いて説明する。 <Search process>
Next, the search process by the document retrieval device 100 will be described with reference to the flow chart of FIG.

まず、類似度算出部１１０は、文書集合に含まれる一の文書と、文書集合に含まれる複数の他の文書との間の類似度を求める（ステップＳ１１）。文書拡張部１２０は、複数の他の文書のうち、一の文書に対する類似度が所定値以上である他の文書である類似文書を用いて、一の文書を拡張した拡張文書を生成する（ステップＳ１２）。検索部１３０は、検索条件を受け付け（ステップＳ１３）、検索条件に対応する拡張文書を検索する（ステップＳ１４）。検索部１３０は、検索された拡張文書に対応する文書を、検索結果３１０として出力する（ステップＳ１５）。 First, the similarity calculation unit 110 obtains the similarity between one document included in the document set and a plurality of other documents included in the document set (step S11). The document extension unit 120 generates an extended document in which one document is extended by using a similar document which is another document having a similarity degree to one document of a predetermined value or more among a plurality of other documents (step). S12). The search unit 130 accepts the search condition (step S13) and searches for the extended document corresponding to the search condition (step S14). The search unit 130 outputs the document corresponding to the searched extended document as the search result 310 (step S15).

上述のように、本発明の一実施形態によれば、一の文書内のキーワードに共起するキーワードや、類似するテキストが拡張文書データベース１５０に格納され、検索条件である検索テキストと合致する確率が高くなる。また、拡張文書は、類似度算出部１１０によって予め類似文書として分類された文書であるため、既存方式の精度以上の検索率を確保することができる。また、類似文書の中から、さらに、グラフ類似度により拡張するテキストが選定されるので、検索精度（適合率）を落とすことなく、検索範囲を広げることが可能となる。 As described above, according to one embodiment of the present invention, the probability that a keyword co-occurring with a keyword in one document or similar text is stored in the extended document database 150 and matches the search text which is a search condition. Will be higher. Further, since the extended document is a document classified as a similar document in advance by the similarity calculation unit 110, it is possible to secure a search rate higher than the accuracy of the existing method. Further, since the text to be expanded according to the graph similarity is selected from the similar documents, the search range can be expanded without lowering the search accuracy (matching rate).

なお、文書拡張部１２０によって拡張文書を生成する対象を、一の文書のうち所定の構成部のみとしてもよい。例えば、一の文書が特許文書の場合、請求項のみを拡張対象とし、拡張グラフを取得する対象テキストとして、同一文書内の明細書および類似文書内の明細書を対象としてもよい。これにより、検索対象を絞りつつ類義語で展開した検索を実施できるようになる。 It should be noted that the target for generating the extended document by the document expanding unit 120 may be only a predetermined constituent unit in one document. For example, when one document is a patent document, only the claims may be expanded, and the specification in the same document and the specification in a similar document may be targeted as the target text for acquiring the expanded graph. As a result, it becomes possible to carry out a search developed by synonyms while narrowing down the search target.

なお、上述した本発明の一実施形態による文書検索装置１００を、知的財産創出支援システムに適用してもよい。図７に、知的財産創出支援システムの構成例を示す。図７に示すように、知的財産創出支援システム５００は、ネットワークＮＥＴを介して互いに接続された、文書検索装置１００と、文書データベース（ＤＢ）２００と、通信端末４００（４００Ａ〜４００Ｄ）とを含む。なお、通信端末４００の数は、図示したものに限られるものではない。 The document retrieval device 100 according to the embodiment of the present invention described above may be applied to the intellectual property creation support system. FIG. 7 shows a configuration example of the intellectual property creation support system. As shown in FIG. 7, the intellectual property creation support system 500 includes a document retrieval device 100, a document database (DB) 200, and communication terminals 400 (400A to 400D) connected to each other via a network NET. Including. The number of communication terminals 400 is not limited to the one shown in the figure.

通信端末４００Ａ〜４００Ｄは、知的財産創出支援システム５００によって提供される知的財産創出支援サービスを利用するユーザの通信端末である。知的財産創出支援システム５００において、文書検索装置１００は、先行技術文献の検索を行うことができる。図７において、通信端末４００はノートパソコンやデスクトップパソコンを示してあるが、通信端末４００としては、ネットワークＮＥＴを介して文書検索サービスを利用可能とするものであれば、その種類は問わない。通信端末４００は、例えば、スマートフォン、携帯電話（フィーチャーフォン）、ハンドヘルドコンピュータデバイス（例えば、ＰＤＡ（Personal Digital Assistant）等）、ウェアラブル端末（例えば、メガネ型デバイス、時計型デバイス、ヘッドマウントディスプレイ（ＨＭＤ：Head-Mounted Display等）、他種のコンピュータ、又はコミュニケーションプラットホームを含んでよい。通信端末４００は、ユーザからの入力操作を受け付けて、ネットワークＮＥＴを介して、検索条件を文書検索装置１００へ送信する。 The communication terminals 400A to 400D are communication terminals of users who use the intellectual property creation support service provided by the intellectual property creation support system 500. In the intellectual property creation support system 500, the document retrieval device 100 can search for prior art documents. In FIG. 7, the communication terminal 400 shows a notebook personal computer or a desktop personal computer, but the communication terminal 400 may be of any type as long as the document search service can be used via the network NET. The communication terminal 400 includes, for example, a smartphone, a mobile phone (feature phone), a handheld computer device (for example, PDA (Personal Digital Assistant), etc.), a wearable terminal (for example, a glasses-type device, a clock-type device, a head-mounted display (HMD:). A head-mounted display, etc.), another type of computer, or a communication platform may be included. The communication terminal 400 receives an input operation from the user and transmits the search condition to the document search device 100 via the network NET. ..

文書データベース２００は、例えば特許庁のデータベースとすることができる。特許庁のデータベースは、１庁でも複数庁を含んでいてもよい。なお、米国、欧州、日本、中国、および韓国の５庁のデータベースを含むことで世界の特許の約９０％を網羅することができるため、先行技術文献の検索精度を上げるためには、これらの５庁のデータベースを含んでいるとよい。なお、データベースとしては、上述のものに限られるものでなく、インターネット上に存在する情報であってもよい。 The document database 200 can be, for example, a database of the Japan Patent Office. The JPO database may include one or more offices. By including the databases of the five offices of the United States, Europe, Japan, China, and South Korea, it is possible to cover about 90% of the world's patents. It is good to include the database of 5 agencies. The database is not limited to the above-mentioned one, and may be information existing on the Internet.

ネットワークＮＥＴは、無線ネットワークや有線ネットワークを含んでよい。具体的には、ネットワークＮＥＴは、ワイヤレスＬＡＮ（wireless LAN：ＷＬＡＮ）や広域ネットワーク（wide area network：ＷＡＮ）、ＩＳＤＮｓ（integrated service digital networks）、無線ＬＡＮｓ、ＬＴＥ（long term evolution）、ＬＴＥ−Ａｄｖａｎｃｅｄ、第４世代（４Ｇ）、第５世代（５Ｇ）、ＣＤＭＡ（code division multiple access）等である。なお、ネットワークＮＥＴは、これらの例に限られず、例えば、公衆交換電話網（Public Switched Telephone Network：ＰＳＴＮ）やブルートゥース（Bluetooth（登録商標））、光回線、ＡＤＳＬ（Asymmetric Digital Subscriber LINE）回線、衛星通信網等であってもよい。また、ネットワークＮＥＴは、これらの組み合わせであってもよい。 The network NET may include a wireless network or a wired network. Specifically, network NETs include wireless LAN (WLAN), wide area network (WAN), ISDNs (integrated service digital networks), wireless LANs, LTE (long term evolution), LTE-Advanced, 4th generation (4G), 5th generation (5G), CDMA (code division multiple access) and the like. The network NET is not limited to these examples, and is, for example, a public switched telephone network (PSTN), Bluetooth (Bluetooth (registered trademark)), an optical line, an ADSL (Asymmetric Digital Subscriber LINE) line, and a satellite. It may be a communication network or the like. Further, the network NET may be a combination of these.

知的財産創出支援システム５００において、ユーザは、先行技術文献調査、無効調査、クリアランス調査を少なくとも実行することができる。ユーザは、通信端末４００を介して、検索条件を文書検索装置１００へ送信する。例えば、先行技術文献調査の場合、ユーザは、検索条件として、自身の発明に関する情報として、例えば、発明を記載した発明文章を送信する。検索部１３０は、発明文章に含まれる発明を代表するキーワードを検索キーワードとして抽出する。そして、抽出された検索キーワードを基に、文書データベース２００から類似特許文章群を抽出する。この際、文書検索装置１００は、上述のように文書データベース２００を拡張し、検索部１３０は、拡張文書データベース１５０を検索対象とする。 In the intellectual property creation support system 500, the user can at least perform a prior art document search, an invalidity search, and a clearance search. The user transmits the search condition to the document retrieval device 100 via the communication terminal 400. For example, in the case of a prior art document search, the user transmits, for example, an invention sentence describing the invention as information about his / her invention as a search condition. The search unit 130 extracts keywords representing the invention included in the invention text as search keywords. Then, a group of similar patent sentences is extracted from the document database 200 based on the extracted search keywords. At this time, the document retrieval device 100 expands the document database 200 as described above, and the search unit 130 searches the extended document database 150 as a search target.

本発明を諸図面や実施例に基づき説明してきたが、当業者であれば本開示に基づき種々の変形や修正を行うことが容易であることに注意されたい。従って、これらの変形や修正は本発明の範囲に含まれることに留意されたい。例えば、各構成部、各ステップ等に含まれる機能等は論理的に矛盾しないように再配置可能であり、複数の構成部やステップ等を１つに組み合わせたり、或いは分割したりすることが可能である。また、上記実施の形態に示す構成を適宜組み合わせることとしてもよい。例えば、文書検索装置１００が備えるとして説明した各構成部は、物理的に複数のコンピュータによって分散されて実現されてもよいし、一のコンピュータとして実現されてもよい。 Although the present invention has been described based on the drawings and examples, it should be noted that those skilled in the art can easily make various modifications and modifications based on the present disclosure. Therefore, it should be noted that these modifications and modifications are included in the scope of the present invention. For example, the functions included in each component, each step, etc. can be rearranged so as not to be logically inconsistent, and a plurality of components, steps, etc. can be combined or divided into one. Is. In addition, the configurations shown in the above embodiments may be combined as appropriate. For example, each component described as being included in the document retrieval device 100 may be physically distributed by a plurality of computers, or may be realized as a single computer.

本開示の各実施形態のプログラムは、コンピュータに読み取り可能な記憶媒体に記憶された状態で提供されてもよい。記憶媒体は、「一時的でない有形の媒体」に、プログラムを記憶可能である。プログラムは、例えば、ソフトウェアプログラムやコンピュータプログラムを含む。 The program of each embodiment of the present disclosure may be provided stored in a computer-readable storage medium. The storage medium can store the program in a "non-temporary tangible medium". Programs include, for example, software programs and computer programs.

記憶媒体は適切な場合、１つ又は複数の半導体ベースの、又は他の集積回路（ＩＣ）（例えば、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）、特定用途向けＩＣ（ＡＳＩＣ）等）、ハード・ディスク・ドライブ（ＨＤＤ）、ハイブリッド・ハード・ドライブ（ＨＨＤ）、光ディスク、光ディスクドライブ（ＯＤＤ）、光磁気ディスク、光磁気ドライブ、フロッピィ・ディスケット、フロッピィ・ディスク・ドライブ（ＦＤＤ）、磁気テープ、固体ドライブ（ＳＳＤ）、ＲＡＭドライブ、セキュア・デジタル・カードもしくはドライブ、任意の他の適切な記憶媒体、又はこれらの２つ以上の適切な組合せを含むことができる。記憶媒体は、適切な場合、揮発性、不揮発性、又は揮発性と不揮発性の組合せでよい。 Storage media, where appropriate, are one or more semiconductor-based or other integrated circuits (ICs) (eg, field programmable gate arrays (FPGAs), application-specific ICs (ASICs), etc.), hard disks. Disk drive (HDD), hybrid hard drive (HHD), optical disk, optical disk drive (ODD), magneto-optical disk, magneto-optical drive, floppy diskette, floppy disk drive (FDD), magnetic tape, solid drive It can include (SSD), RAM drives, secure digital cards or drives, any other suitable storage medium, or any suitable combination of two or more of these. The storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile, where appropriate.

また、本開示のプログラムは、当該プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して、情報処理装置に提供されてもよい。 Further, the program of the present disclosure may be provided to the information processing apparatus via an arbitrary transmission medium (communication network, broadcast wave, etc.) capable of transmitting the program.

また、本開示の各実施形態は、プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 Each embodiment of the present disclosure can also be realized in the form of a data signal embedded in a carrier wave, in which the program is embodied by electronic transmission.

なお、本開示のプログラムは、例えば、ＪａｖａＳｃｒｉｐｔ（登録商標）、Ｐｙｔｈｏｎ等のスクリプト言語、Ｃ言語、Ｇｏ言語、Ｓｗｉｆｔ，Ｋｏｌｔｉｎ、Ｊａｖａ（登録商標）等を用いて実装される。 The program of the present disclosure is implemented using, for example, a script language such as JavaScript (registered trademark) or Python, C language, Go language, Swift, Kotlin, Java (registered trademark), or the like.

１００文書検索装置
１０１プロセッサ
１０２メモリ
１０３ストレージ
１１０類似度算出部
１２０文書拡張部
１３０検索部
１４０データベース）
１５０拡張文書データベース
１４０文書距離データベース
２００文書データベース
１５０拡張データベース
３００検索条件
３１０検索結果
５００知的財産創出支援システム
４００通信端末
４００Ａ〜４００Ｄ通信端末 100 Document retrieval device 101 Processor 102 Memory 103 Storage 110 Similarity calculation unit 120 Document extension unit 130 Search unit 140 Database)
150 Extended document database 140 Document distance database 200 Document database 150 Extended database 300 Search conditions 310 Search results 500 Intellectual property creation support system 400 Communication terminal 400A-400D Communication terminal

Claims

A document search device equipped with a search unit that searches for documents corresponding to the received search conditions.
A similarity calculation unit for obtaining the similarity between one document included in the document set and a plurality of other documents included in the document set different from the one document.
A document extension unit that generates an extended document that is an extension of the one document by using a similar document that is another document having a similarity degree to the one document or more among the plurality of other documents. ,
With more
The document extension unit creates the extension document for all the documents based on the similarity degree for all the documents included in the document set calculated by the similarity calculation unit.
The search unit is a document search device that searches for extended documents corresponding to the search conditions from the extended documents for all the documents, and outputs one document corresponding to the searched extended document as a search result. ..

The similarity calculation unit classifies other documents having the similarity equal to or greater than a predetermined value by the k-nearest neighbor method.
The document retrieval device according to claim 1.

The document extension unit compares the graph structure of the one document with the graph structure of the similar document of the one document, and integrates the graph elements of the similar document having a high degree of similarity into the one document. , Generate an extended document of the one document,
The document retrieval device according to claim 1 or 2.

The document extension unit compares the one document with the similar document of the one document by the graph element included in the predetermined window width, and compares the graph of the similar document included in the window having a high degree of matching. Integrate the elements into the one document,
The document retrieval device according to claim 3, wherein the document retrieval device is characterized.

The search unit receives the search text as the search condition, searches for the extended document in which the degree of matching of the search text in graph units is equal to or greater than a predetermined value, and obtains one document corresponding to the searched extended document. Output as search results,
The document retrieval device according to any one of claims 1 to 4, wherein the document retrieval device is characterized.

The search unit accepts the search text as the search condition, searches for the extended document containing the keyword included in the search text, and outputs one document corresponding to the searched extended document as a search result.
The document retrieval device according to any one of claims 1 to 4, wherein the document retrieval device is characterized.

The one document consists of multiple components
The document extension unit generates the extension document for a predetermined component of the one document.
The document retrieval device according to any one of claims 1 to 6, wherein the document retrieval device is characterized.

The document set is a set of patent documents having bibliographic information.
The similarity calculation unit obtains the similarity between the one document and the plurality of other documents according to the degree of agreement of the bibliographic information.
The document retrieval device according to any one of claims 1 to 7, wherein the document retrieval device is characterized.

This is a document search method that searches for documents that correspond to the accepted search conditions.
The computer
A similarity calculation step for obtaining the similarity between one document included in the document set and a plurality of other documents included in the document set different from the one document.
With the document expansion step of generating one extended document which is an extension of the one document by using a similar document which is another document whose similarity to the one document is equal to or more than a predetermined value among the plurality of other documents. ,
And
The document extension step creates the extension document for all the documents based on the similarity for all the documents included in the document set calculated by the similarity calculation step.
The computer searches the extended documents corresponding to the search conditions from the extended documents for all the documents, and executes a search step of outputting one document corresponding to the searched extended documents as a search result. , Document search method.

A document search program that searches for documents that match the accepted search conditions.
On the computer
A similarity calculation function for obtaining the similarity between one document included in the document set and a plurality of other documents included in the document set different from the one document, and
A document extension function that generates an extended document that is an extension of the one document by using a similar document that is another document having a similarity degree to the one document or more among the plurality of other documents. ,
Realized,
The document extension function creates the extension document for all the documents based on the similarity for all the documents included in the document set calculated by the similarity calculation function.
Further, the computer has a document search function that searches for extended documents corresponding to the search conditions from the extended documents for all the documents and outputs one document corresponding to the searched extended document as a search result. A document search program that makes it possible.