JP2011022630A

JP2011022630A - Information processor and information processing program

Info

Publication number: JP2011022630A
Application number: JP2009164390A
Authority: JP
Inventors: Hiroshi Masuichi; 博増市
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2009-07-13
Filing date: 2009-07-13
Publication date: 2011-02-03
Anticipated expiration: 2029-07-13
Also published as: JP5391887B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processor for suppressing the retrieval of not only a document having a similar sentence but also a document having a sentence whose syntax is not similar in an object document. <P>SOLUTION: The document storage means of an information processor stores a document; a sentence extraction means extracts a sentence from an object document; a sentence set generation means generates the set of sentences on the basis of the syntax of the sentence extracted by the sentence extraction means; a similar sentence retrieval means retrieves a second sentence similar to a first sentence in the set of sentences generated by the sentence set generation means from a sentence in the document stored in the document storage means, and a relevant document retrieval means retrieves the document relative to the object document from the document storage means on the basis of the second sentence retrieved by the similar sentence retrieval means. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、情報処理装置及び情報処理プログラムに関する。 The present invention relates to an information processing apparatus and an information processing program.

特許文献１には、多様な表現に対応して検索漏れを防ぎつつ、検索ゴミの少ない検索結果を得ることを課題とし、検索文を入力する入力ステップと、入力された検索文を解析用単語辞書に基づいて単語単位に分割する形態素解析ステップと、単語間の構文的係り受け関係を解析する構文解析ステップと、構文解析ステップの構文解析結果に基づき文書データベースを検索する一次検索ステップと、領域依存の概念知識を格納した概念知識データベースを基に検索文と一次検索ステップの検索結果との意味的な照合を行い類似あるいは非類似の検索結果を出力する意味照合ステップとを備えていることが開示されている。 Japanese Patent Laid-Open No. 2004-260688 has an object of obtaining a search result with a small amount of search dust while preventing a search omission corresponding to various expressions, an input step of inputting a search sentence, and an input search sentence as an analysis word A morphological analysis step that divides into words based on a dictionary, a syntax analysis step that analyzes syntactic dependency between words, a primary search step that searches a document database based on the syntax analysis result of the syntax analysis step, and an area It has a semantic collation step that performs semantic collation between the retrieval sentence and the retrieval result of the primary retrieval step based on the conceptual knowledge database storing the dependent conceptual knowledge, and outputs a similar or dissimilar retrieval result. It is disclosed.

特開２０００−２４２６５０号公報JP 2000-242650 A

本発明は、対象とする文書内で構文が類似していない文にまで類似している文を有する文書を検索してしまうことを抑制するようにした情報処理装置及び情報処理プログラムを提供することを目的としている。 The present invention provides an information processing apparatus and an information processing program that suppress a search for a document having a sentence similar to a sentence whose syntax is not similar in a target document. It is an object.

かかる目的を達成するための本発明の要旨とするところは、次の各項の発明に存する。
請求項１の発明は、文書を記憶する文書記憶手段と、対象とする文書から文を抽出する文抽出手段と、前記文抽出手段によって抽出された文の構文に基づいて、該文の集合を生成する文集合生成手段と、前記文集合生成手段によって生成された文の集合内の第１の文と類似する第２の文を前記文書記憶手段に記憶されている文書内の文から検索する類似文検索手段と、前記類似文検索手段によって検索された第２の文に基づいて、前記対象とする文書に関連する文書を前記文書記憶手段から検索する関連文書検索手段を具備することを特徴とする情報処理装置である。 The gist of the present invention for achieving the object lies in the inventions of the following items.
According to a first aspect of the present invention, there is provided a document storage unit that stores a document, a sentence extraction unit that extracts a sentence from a target document, and a set of the sentences based on a sentence syntax extracted by the sentence extraction unit. A sentence set generating means to be generated and a second sentence similar to the first sentence in the sentence set generated by the sentence set generating means are searched from sentences in the document stored in the document storage means. A similar sentence search unit; and a related document search unit that searches the document storage unit for a document related to the target document based on the second sentence searched by the similar sentence search unit. Is an information processing apparatus.

請求項２の発明は、前記関連文書検索手段によって検索された文書に含まれる前記第２の文の数、該第２の文と対応する前記第１の文の類似度、該第２の文に対応する前記対象文書における前記第１の文の出現順序と前記関連文書検索手段によって検索された文書における該第２の文の出現順序の比較結果、又はこれらの組み合わせに基づいて、前記関連文書検索手段によって検索された文書を順序付けする順序付手段をさらに具備することを特徴とする請求項１に記載の情報処理装置である。 The invention according to claim 2 is the number of the second sentences included in the document searched by the related document search means, the similarity of the first sentence corresponding to the second sentence, and the second sentence. The related document based on the comparison result of the appearance order of the first sentence in the target document corresponding to and the appearance order of the second sentence in the document searched by the related document search means, or a combination thereof The information processing apparatus according to claim 1, further comprising ordering means for ordering documents searched by the search means.

請求項３の発明は、前記順序付手段は、前記関連文書検索手段によって検索された文書と前記対象とする文書の類似度に基づいて、前記順序付けを行うことを特徴とする請求項２に記載の情報処理装置である。 The invention according to claim 3 is characterized in that the ordering unit performs the ordering based on the similarity between the document searched by the related document search unit and the target document. Information processing apparatus.

請求項４の発明は、コンピュータを、文書を記憶する文書記憶手段と、対象とする文書から文を抽出する文抽出手段と、前記文抽出手段によって抽出された文の構文に基づいて、該文の集合を生成する文集合生成手段と、前記文集合生成手段によって生成された文の集合内の第１の文と類似する第２の文を前記文書記憶手段に記憶されている文書内の文から検索する類似文検索手段と、前記類似文検索手段によって検索された第２の文に基づいて、前記対象とする文書に関連する文書を前記文書記憶手段から検索する関連文書検索手段として機能させることを特徴とする情報処理プログラムである。 According to a fourth aspect of the present invention, there is provided a computer based on document storage means for storing a document, sentence extraction means for extracting a sentence from a target document, and syntax of the sentence extracted by the sentence extraction means. A sentence set generating means for generating a set of sentences, and a sentence in a document in which a second sentence similar to the first sentence in the sentence set generated by the sentence set generating means is stored in the document storage means Based on the second sentence searched by the similar sentence search means and the related sentence search means for searching from the document storage means for the document related to the target document based on the second sentence searched by the similar sentence search means An information processing program characterized by this.

請求項１の情報処理装置によれば、対象とする文書内で構文が類似していない文にまで類似している文を有する文書を検索してしまうことを抑制することができる。 According to the information processing apparatus of the first aspect, it is possible to suppress searching for a document having a sentence similar to a sentence whose syntax is not similar in the target document.

請求項２の情報処理装置によれば、対象文書に関連する文書を順序付けすることができる。 According to the information processing apparatus of the second aspect, documents related to the target document can be ordered.

請求項３の情報処理装置によれば、対象文書と類似する文書と、対象文書と類似していない文書であるが集合内の文と類似する文を有する文書とを分けることができる。 According to the information processing apparatus of the third aspect, it is possible to separate a document that is similar to the target document and a document that is a document that is not similar to the target document but has a sentence similar to a sentence in the set.

請求項４の情報処理プログラムによれば、対象とする文書内で構文が類似していない文にまで類似している文を有する文書を検索してしまうことを抑制することができる。 According to the information processing program of the fourth aspect, it is possible to suppress searching for a document having a sentence similar to a sentence whose syntax is not similar in the target document.

本実施の形態の構成例についての概念的なモジュール構成図である。It is a conceptual module block diagram about the structural example of this Embodiment. 本実施の形態による処理例を示すフローチャートである。It is a flowchart which shows the process example by this Embodiment. 本実施の形態の文書格納モジュール内に記憶されている文書の例を示す説明図である。It is explanatory drawing which shows the example of the document memorize | stored in the document storage module of this Embodiment. 対象文書内の文の例を示す説明図である。It is explanatory drawing which shows the example of the sentence in a target document. 本実施の形態の関連文書ランキングモジュールによって上位にランキングされた文書内の文の例を示す説明図である。It is explanatory drawing which shows the example of the sentence in the document ranked highly by the related document ranking module of this Embodiment. 本実施の形態の関連文書ランキングモジュールによって上位にランキングされた文書内の文の例を示す説明図である。It is explanatory drawing which shows the example of the sentence in the document ranked highly by the related document ranking module of this Embodiment. 対象文書内の文の例を示す説明図である。It is explanatory drawing which shows the example of the sentence in a target document. 本実施の形態の関連文書ランキングモジュールによって上位にランキングされた文書内の文の例を示す説明図である。It is explanatory drawing which shows the example of the sentence in the document ranked highly by the related document ranking module of this Embodiment. 文・文書テーブルのデータ構造例を示す説明図である。It is explanatory drawing which shows the example of a data structure of a sentence and a document table. 文・グループテーブルのデータ構造例を示す説明図である。It is explanatory drawing which shows the data structure example of a sentence and a group table. 類似度テーブルのデータ構造例を示す説明図である。It is explanatory drawing which shows the data structure example of a similarity table. 本実施の形態を実現するコンピュータのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of the computer which implement | achieves this Embodiment.

以下、図面に基づき本発明を実現するにあたっての好適な一実施の形態の例を説明する。
図１は、本実施の形態の構成例についての概念的なモジュール構成図を示している。
なお、モジュールとは、一般的に論理的に分離可能なソフトウェア（コンピュータ・プログラム）、ハードウェア等の部品を指す。したがって、本実施の形態におけるモジュールはコンピュータ・プログラムにおけるモジュールのことだけでなく、ハードウェア構成におけるモジュールも指す。それゆえ、本実施の形態は、コンピュータ・プログラム、システム及び方法の説明をも兼ねている。ただし、説明の都合上、「記憶する」、「記憶させる」、これらと同等の文言を用いるが、これらの文言は、実施の形態がコンピュータ・プログラムの場合は、記憶装置に記憶させる、又は記憶装置に記憶させるように制御するの意である。また、モジュールは機能にほぼ一対一に対応しているが、実装においては、１モジュールを１プログラムで構成してもよいし、複数モジュールを１プログラムで構成してもよく、逆に１モジュールを複数プログラムで構成してもよい。また、複数モジュールは１コンピュータによって実行されてもよいし、分散又は並列環境におけるコンピュータによって１モジュールが複数コンピュータで実行されてもよい。なお、１つのモジュールに他のモジュールが含まれていてもよい。また、以下、「接続」とは物理的な接続の他、論理的な接続（データの授受、指示、データ間の参照関係等）の場合にも用いる。
また、システム又は装置とは、複数のコンピュータ、ハードウェア、装置等がネットワーク（一対一対応の通信接続を含む）等の通信手段で接続されて構成されるほか、１つのコンピュータ、ハードウェア、装置等によって実現される場合も含まれる。「装置」と「システム」とは、互いに同義の用語として用いる。「予め定められた」とは、対象としている処理の前に定まっていることをいい、本実施の形態による処理が始まる前はもちろんのこと、本実施の形態による処理が始まった後であっても、対象としている処理の前であれば、そのときの状況・状態に応じて、又はそれまでの状況・状態に応じて定まることの意を含めて用いる。 Hereinafter, an example of a preferred embodiment for realizing the present invention will be described with reference to the drawings.
FIG. 1 shows a conceptual module configuration diagram of a configuration example of the present embodiment.
The module generally refers to components such as software (computer program) and hardware that can be logically separated. Therefore, the module in the present embodiment indicates not only a module in a computer program but also a module in a hardware configuration. Therefore, the present embodiment also serves as an explanation of a computer program, a system, and a method. However, for the sake of explanation, the words “store”, “store”, and equivalents thereof are used. However, when the embodiment is a computer program, these words are stored in a storage device or stored in memory. It is the control to be stored in the device. In addition, the modules correspond almost one-to-one with the functions. However, in mounting, one module may be composed of one program, or a plurality of modules may be composed of one program. A plurality of programs may be used. The plurality of modules may be executed by one computer, or one module may be executed by a plurality of computers in a distributed or parallel environment. Note that one module may include other modules. Hereinafter, “connection” is used not only for physical connection but also for logical connection (data exchange, instruction, reference relationship between data, etc.).
In addition, the system or device is configured by connecting a plurality of computers, hardware, devices, and the like by communication means such as a network (including one-to-one correspondence communication connection), etc., and one computer, hardware, device. The case where it implement | achieves by etc. is included. “Apparatus” and “system” are used as synonymous terms. “Predetermined” means that the process is determined before the target process, and not only before the process according to this embodiment starts but also after the process according to this embodiment starts. In addition, if it is before the target processing, it is used in accordance with the situation / state at that time or with the intention to be decided according to the situation / state up to that point.

本実施の形態である情報処理装置は、対象とする文書に関連する文書を検索するものであって、図１に示すように、文書受付モジュール１１０、文書格納モジュール１２０、文抽出モジュール１３０、類似構文集合生成モジュール１４０、類似文検索モジュール１５０、関連文書検索モジュール１６０、関連文書ランキングモジュール１７０、関連文書出力モジュール１８０を有している。 The information processing apparatus according to the present embodiment searches for a document related to a target document. As shown in FIG. 1, a document reception module 110, a document storage module 120, a sentence extraction module 130, and the like It has a syntax set generation module 140, a similar sentence search module 150, a related document search module 160, a related document ranking module 170, and a related document output module 180.

特に、この情報処理装置は、次のような文書管理システムに適用してもよい。
近年、企業に対するコンプライアンス徹底の社会的要請を背景にして、厳密な文書管理のニーズが高まっている。例えば、ＲｏＨＳ（ＲｅｓｔｒｉｃｔｉｎｇｔｈｅｕｓｅｏｆＨａｚａｒｄｏｕｓＳｕｂｓｔａｎｃｅｓ、危険物質に関する制限）やＲＥＡＣＨ（Ｒｅｇｉｓｔｒａｔｉｏｎ，Ｅｖａｌｕａｔｉｏｎ，ＡｕｔｈｏｒｉｓａｔｉｏｎａｎｄＲｅｓｔｒｉｃｔｉｏｎｏｆＣＨｅｍｉｃａｌｓ、欧州化学品規制）といった製品中の含有物質の含有量に関する制限規則に適合していることを証明するための適合宣言書や、機密情報の機密レベルや開示範囲を示すために文書に付与する機密情報表示を正確に記述することが必要不可欠である。このために、例えば、新たに作成する適合宣言書の記述が、準拠すべきＲｏＨＳやＲＥＡＣＨ等の基準書（定義文書）や類似製品の適合宣言書等に適合しているか否かを確認するために、それらを参考情報として参照する必要が生じる。この場合に文書管理システムが用いられる。
以下、情報処理装置がこのような文書管理に用いられる場合を主に例示して説明する。 In particular, this information processing apparatus may be applied to the following document management system.
In recent years, there has been a growing need for strict document management against the background of social demands for thorough compliance with companies. For example, in the content of products such as RoHS (Restricting the use of Hazardous Substances), REACH (Registration, Evaluation, Authorization and Restriction of Chemicals) It is indispensable to accurately describe the declaration of conformity to prove that the information is confidential and the confidential information indication given to the document to indicate the confidentiality level and disclosure range of the confidential information. For this purpose, for example, to confirm whether the description of the newly created conformity declaration conforms to standards (definition documents) such as RoHS and REACH to be conformed, conformance declaration of similar products, etc. In addition, it is necessary to refer to them as reference information. In this case, a document management system is used.
Hereinafter, a case where the information processing apparatus is used for such document management will be mainly described as an example.

文書受付モジュール１１０は、文抽出モジュール１３０と接続されている。関連文書の検索を行う場合で検索対象とする文書を受け付ける。なお、文書とは、テキストデータによって構成されており、場合によっては画像、動画、音声等の電子データ、又はこれらの組み合わせを含めてもよく、記憶、編集及び検索等の対象となり、システム又は利用者間で個別の単位として交換できるものをいい、これらに類似するものを含む。対象となる文書には、文が含まれている。また、対象とする文書は、１つであってもよいし、複数の文書であってもよい。例えば、新たに作成する適合宣言書等が該当する。
文書を受け付けるとは、例えば、ハードディスク（コンピュータに内蔵されているものの他に、ネットワークを介して接続されているもの等を含む）等に記憶されている文書を読み出すこと、スキャナ、カメラ等で読み込んだ画像を文字認識すること等が含まれる。 The document reception module 110 is connected to the sentence extraction module 130. When searching for related documents, a document to be searched is received. Documents are composed of text data, and may include electronic data such as images, moving images, audio, etc., or a combination thereof, and are subject to storage, editing, search, etc. Those that can be exchanged as individual units between persons, including those similar to these. The target document contains a sentence. Further, the target document may be one or a plurality of documents. For example, this is a newly created declaration of conformity.
To accept a document is, for example, reading a document stored in a hard disk (including those connected to a computer in addition to those built in a computer), and reading with a scanner, a camera, etc. This includes recognizing characters in images.

文書格納モジュール１２０は、文抽出モジュール１３０、類似文検索モジュール１５０、関連文書検索モジュール１６０、関連文書ランキングモジュール１７０からアクセスされる。文書格納モジュール１２０が記憶する文書について、図３を用いて説明する。図３は、本実施の形態の文書格納モジュール１２０内に記憶されている文書の例を示す説明図である。文書格納モジュール１２０は、例えば、定義文書群３１０、サプライヤからの適合宣言書群３２０、設計仕様書群３３０等の過去に作成された文書を記憶しており、作成文書群３４０内の文書を対象文書として記憶する。
定義文書群３１０の文書としては「ＲｏＨＳ指令」、「ＲＥＡＣＨ改訂」等の文書があり、サプライヤからの適合宣言書群３２０の文書としては「サプライヤからの部品ａの適合宣言書」等の文書があり、設計仕様書群３３０の文書としては「○○プロダクトＡの設計仕様」等の文書があり、また、文書受付モジュール１１０が受け付けて文書格納モジュール１２０に記憶される文書としては「○○プロダクトＡのＸＸ向け適合宣言書」等がある。 The document storage module 120 is accessed from the sentence extraction module 130, the similar sentence search module 150, the related document search module 160, and the related document ranking module 170. A document stored in the document storage module 120 will be described with reference to FIG. FIG. 3 is an explanatory diagram showing an example of a document stored in the document storage module 120 of the present embodiment. The document storage module 120 stores, for example, documents created in the past, such as a definition document group 310, a conformity declaration group 320 from a supplier, a design specification group 330, and the like, and targets documents in the created document group 340. Remember as a document.
Documents of the definition document group 310 include documents such as “RoHS Directive” and “REACH revision”. Documents of the conformity declaration group 320 from the supplier include documents such as “Declaration of conformity of part a from the supplier”. Yes, as the document of the design specification group 330, there is a document such as “design specification of XX product A”, and the document received by the document reception module 110 and stored in the document storage module 120 is “XX product”. A's XX Declaration of Conformity ”.

文抽出モジュール１３０は、文書受付モジュール１１０、文書格納モジュール１２０、関連文書ランキングモジュール１７０と接続されている。文書受付モジュール１１０から受け取った対象とする文書から文を抽出し、その文を類似構文集合生成モジュール１４０へ渡す。また、文書格納モジュール１２０に記憶されている文書から文を抽出してもよく、その文を文書格納モジュール１２０に記憶させてもよい。文抽出モジュール１３０は、文書内のテキストデータを、句点、改行記号に応じて複数の文に分割する。つまり、句点、改行記号があったところを区切りとして、文を切り出す。 The sentence extraction module 130 is connected to the document reception module 110, the document storage module 120, and the related document ranking module 170. A sentence is extracted from the target document received from the document reception module 110, and the sentence is passed to the similar syntax set generation module 140. In addition, a sentence may be extracted from a document stored in the document storage module 120, and the sentence may be stored in the document storage module 120. The sentence extraction module 130 divides the text data in the document into a plurality of sentences according to the punctuation marks and line feed symbols. In other words, the sentence is cut out with the point where there is a punctuation or a line break as a break.

類似構文集合生成モジュール１４０は、文抽出モジュール１３０、類似文検索モジュール１５０と接続されている。文抽出モジュール１３０によって抽出された文の構文に基づいて、その文の集合を生成する。
前述した適合宣言書や機密情報表示のようなコンプライアンスに関わる文書では、コンプライアンス項目に適合していることを示す複数の記述が並置的に記載されているという特徴を持つ。例えば、適合宣言書では、「六価クロムは、５００ｐｐｍ以下である。ポリ臭化ジフェニルエーテルは２０ｐｐｍ以下である。…」というような並置的な記述が続く。類似構文集合生成モジュール１４０は、適合宣言書や機密情報表示のある文書を対象とした場合、複数の事実や定義が並置的に記述される文の集合を生成する。 The similar syntax set generation module 140 is connected to the sentence extraction module 130 and the similar sentence search module 150. Based on the syntax of the sentence extracted by the sentence extraction module 130, a set of the sentences is generated.
The documents related to compliance such as the declaration of conformity and the confidential information display described above have a feature that a plurality of descriptions indicating conformity with the compliance item are described side by side. For example, in the declaration of conformity, a juxtapositional description such as “hexavalent chromium is 500 ppm or less, polybrominated diphenyl ether is 20 ppm or less, etc.” continues. The similar syntax set generation module 140 generates a set of sentences in which a plurality of facts and definitions are described side by side when a conformance declaration document or a document with confidential information display is targeted.

より詳細には、例えば、類似構文集合生成モジュール１４０は、文抽出モジュール１３０から受け取った文同士の類似度を総当りで比較し、予め定められた閾値Ｔ１以上の類似度を持つ文のグループを特定する。類似度の判定には、例えば特許文献１に開示されている文比較手段を用いる。すなわち、解析用単語辞書を有しており、その解析用単語辞書に基づいて、文を単語単位に分割する。その単語間の構文的係り受け関係を解析する。その構文解析結果である構文構造に基づいて、文の集合を生成する。つまり、構文構造が閾値Ｔ１以上の類似度（一致している場合を含む）を有している文をグループ化する。さらに、シソーラス辞書を有しており、そのシソーラス辞書に基づいて、構文上の対応する単語がシソーラス上で類似していること（例えば、具体的には、シソーラス上での距離が予め定められた閾値Ｔ２以下）をグループ化の条件として加えてもよい。
また、文の集合が複数できた場合には、集合に属する文の数が、予め設定された閾値Ｔ３よりも大きい集合のみを残し、他の集合は処理対象から外す。 More specifically, for example, the similar syntax set generation module 140 compares the similarities of sentences received from the sentence extraction module 130 in a brute force manner, and selects a group of sentences having a similarity equal to or higher than a predetermined threshold T1. Identify. For example, sentence comparison means disclosed in Patent Document 1 is used for determining the similarity. That is, it has an analysis word dictionary, and a sentence is divided into words based on the analysis word dictionary. Analyze the syntactic dependency between the words. A set of sentences is generated based on the syntax structure as a result of the parsing. That is, sentences whose syntax structures have a similarity (including a case where they match) of the threshold value T1 or more are grouped. Furthermore, it has a thesaurus, and based on the thesaurus, the corresponding words in the syntax are similar on the thesaurus (for example, specifically, the distance on the thesaurus is predetermined) (Threshold value T2 or less) may be added as a grouping condition.
Further, when a plurality of sentence sets are created, only a set in which the number of sentences belonging to the set is larger than a preset threshold value T3 is left, and the other sets are excluded from processing targets.

類似文検索モジュール１５０は、文書格納モジュール１２０、類似構文集合生成モジュール１４０、関連文書検索モジュール１６０、関連文書ランキングモジュール１７０と接続されている。類似構文集合生成モジュール１４０によって生成された文の集合内の第１の文と類似する第２の文（以下、「類似文」ともいう）を文書格納モジュール１２０に記憶されている文書内の文から検索する。
例えば、具体的には、類似構文集合生成モジュール１４０から得られた同じ集合に属する各第１の文を対象として、文書格納モジュール１２０に記憶された各文書から文抽出モジュール１３０によって得られた文を対象に、それぞれの第１の文に類似する文を検索する。つまり、２つの文の類似度を算出し、その類似度が予め設定された閾値Ｔ４よりも大きい文を検索結果とする。類似構文集合生成モジュール１４０で複数の集合が得られた場合には、集合ごとにこの検索処理を実行する。 The similar sentence search module 150 is connected to the document storage module 120, the similar syntax set generation module 140, the related document search module 160, and the related document ranking module 170. A sentence in a document in which a second sentence similar to the first sentence in the sentence set generated by the similar syntax set generation module 140 (hereinafter also referred to as “similar sentence”) is stored in the document storage module 120. Search from.
For example, specifically, a sentence obtained by the sentence extraction module 130 from each document stored in the document storage module 120 for each first sentence belonging to the same set obtained from the similar syntax set generation module 140. A sentence similar to each first sentence is searched for. That is, the similarity between two sentences is calculated, and a sentence whose similarity is larger than a preset threshold T4 is used as a search result. When a plurality of sets are obtained by the similar syntax set generation module 140, this search process is executed for each set.

類似文検索モジュール１５０で用いる類似する文の検索手法は、類似構文集合生成モジュール１４０で用いた文の集合生成手法とは異なるものである。類似文検索モジュール１５０では単語の一致度を重視する検索手法を用いる。例えば、「ＦｏｕｎｄａｔｉｏｎｓｏｆＳｔａｔｉｓｔｉｃａｌＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ＴｈｅＭＩＴＰｒｅｓｓ（１９９９）」等に開示されている単語ベクトル法を用いるようにしてもよい。
ただし、数値の単語については、「５００」や「２０」といった具体的な数値を用いるのではなく、「数値表現」として統一的に扱う。 The similar sentence search method used in the similar sentence search module 150 is different from the sentence set generation method used in the similar syntax set generation module 140. The similar sentence search module 150 uses a search method that places importance on word matching. For example, the word vector method disclosed in “Fundations of Statistical Natural Language Processing, The MIT Press (1999)” may be used.
However, for numerical words, specific numerical values such as “500” and “20” are not used, but are handled as “numerical expression” in a unified manner.

関連文書検索モジュール１６０は、文書格納モジュール１２０、類似文検索モジュール１５０、関連文書ランキングモジュール１７０と接続されている。類似文検索モジュール１５０によって検索された類似文に基づいて、対象とする文書に関連する文書を文書格納モジュール１２０から検索する。例えば、具体的には、文書格納モジュール１２０に記憶されている文書のうち、類似文検索モジュール１５０から検索結果として得られた文を複数含む文書を抽出する。 The related document search module 160 is connected to the document storage module 120, the similar sentence search module 150, and the related document ranking module 170. Based on the similar sentence searched by the similar sentence search module 150, a document related to the target document is searched from the document storage module 120. For example, specifically, among the documents stored in the document storage module 120, a document including a plurality of sentences obtained as a search result from the similar sentence search module 150 is extracted.

関連文書ランキングモジュール１７０は、文書格納モジュール１２０、文抽出モジュール１３０、類似文検索モジュール１５０、関連文書検索モジュール１６０、関連文書出力モジュール１８０と接続されている。関連文書ランキングモジュール１７０は、関連文書検索モジュール１６０から得られた文書を関連度の高いものから順にランキングする。ランキングに用いる情報は以下の通りである。
（１）関連文書検索モジュール１６０によって検索された文書に含まれる類似文の数。これは、関連文書検索モジュール１６０によって検索された文書ごとに、類似文検索モジュール１５０によって検索された類似文がいくつあるかをカウントすることによって得られる。
（２）関連文書検索モジュール１６０によって検索された文書に含まれる類似文とその類似文に対応する第１の文の類似度。これは、関連文書検索モジュール１６０によって検索された文書ごとに、類似文検索モジュール１５０で算出した類似度を用いる。
（３）類似文に対応する対象文書における第１の文の出現順序と関連文書検索モジュール１６０によって検索された文書における類似文の出現順序の比較結果。これは、関連文書検索モジュール１６０によって検索された文書ごとに、その文書内での類似文の出現順序とその類似文に対応する第１の文の対象文書における出現順序を比較することによって得られる。なお、比較結果の値は、出現順序が同じ場合を高い値とし、逆順の場合を低い値とするような関数によって算出する。
（４）前述の（１）、（２）、（３）の２つ以上の組み合わせ
例えば、（１）の値、（２）の値、（３）の値のいずれか、又はこれらの値の組み合わせ（例えば、これらの値の和、各値に予め定めた重み係数を乗じた値の平均値等であってもよい）が大きい文書から順に並べる。 The related document ranking module 170 is connected to the document storage module 120, the sentence extraction module 130, the similar sentence search module 150, the related document search module 160, and the related document output module 180. The related document ranking module 170 ranks the documents obtained from the related document search module 160 in descending order of relevance. The information used for ranking is as follows.
(1) The number of similar sentences included in the document searched by the related document search module 160. This is obtained by counting the number of similar sentences retrieved by the similar sentence retrieval module 150 for each document retrieved by the related document retrieval module 160.
(2) Similarity between a similar sentence included in a document searched by the related document search module 160 and a first sentence corresponding to the similar sentence. This uses the similarity calculated by the similar sentence search module 150 for each document searched by the related document search module 160.
(3) A comparison result of the appearance order of the first sentence in the target document corresponding to the similar sentence and the appearance order of the similar sentences in the document searched by the related document search module 160. This is obtained for each document searched by the related document search module 160 by comparing the appearance order of similar sentences in the document and the appearance order of the first sentence corresponding to the similar sentence in the target document. . Note that the value of the comparison result is calculated by a function that sets a high value when the appearance order is the same and a low value when the reverse order is the same.
(4) A combination of two or more of (1), (2) and (3) described above For example, one of the values of (1), (2) and (3), or The documents are arranged in order from the document with the largest combination (for example, the sum of these values, or an average value obtained by multiplying each value by a predetermined weighting factor).

また、関連文書ランキングモジュール１７０は、関連文書検索モジュール１６０によって検索された文書と対象文書の類似度に基づいて、前述の順序付けを行うようにしてもよい。例えば、関連文書検索モジュール１６０によって検索された文書内の全ての単語と対象文書内の全ての単語の類似度を、類似文検索モジュール１５０で用いた類似度の算出と同等の方法で求めて、予め定められた閾値Ｔ５よりも高い文書に対してだけ前述の順序付けを行うようにしてもよい。つまり、文書全体が類似している文書を対象として順序付けを行う。また、予め定められた閾値Ｔ６よりも低い文書に対してだけ前述の順序付けを行うようにしてもよい。つまり、文書全体は類似していないが、集合内の文と類似している文を有する文書を対象として順序付けを行う。 Further, the related document ranking module 170 may perform the above-described ordering based on the similarity between the document searched by the related document search module 160 and the target document. For example, the similarity between all the words in the document searched by the related document search module 160 and all the words in the target document is obtained by a method equivalent to the similarity calculation used in the similar sentence search module 150. The above-described ordering may be performed only for documents higher than a predetermined threshold value T5. In other words, ordering is performed on documents that are similar to each other as a whole. In addition, the above-described ordering may be performed only for documents lower than a predetermined threshold T6. In other words, the ordering is performed for documents that have sentences that are not similar to each other but that are similar to sentences in the set.

関連文書出力モジュール１８０は、関連文書ランキングモジュール１７０と接続されている。関連文書ランキングモジュール１７０によってランキングされた文書を出力する。なお、出力する文書とは、文書そのものであってもよいし、その文書の属性（例えば、タイトル等）のリストであってもよい。また、出力するとは、例えば、ディスプレイ等の表示装置に表示すること、プリンタ等の印刷装置で印刷すること、ファックス等の画像送信装置で画像を送信すること、文書データベース等の文書記憶装置へ文書を書き込むこと、メモリーカード等の記憶媒体に記憶すること、他の情報処理装置へ渡すこと等が含まれる。 The related document output module 180 is connected to the related document ranking module 170. The documents ranked by the related document ranking module 170 are output. The output document may be the document itself or a list of attributes (for example, titles) of the document. In addition, output means, for example, displaying on a display device such as a display, printing with a printing device such as a printer, sending an image with an image transmission device such as a fax, or document to a document storage device such as a document database. , Storing in a storage medium such as a memory card, passing to another information processing apparatus, and the like.

図２は、本実施の形態による処理例を示すフローチャートである。具体例を用いて説明する。なお、この例では、文抽出モジュール１３０が予め文書格納モジュール１２０内の文書について文を抽出しておく。文の抽出結果は、例えば、文・文書テーブル９００に記憶する。図９は、文・文書テーブル９００のデータ構造例を示す説明図である。文・文書テーブル９００は、文ＩＤ欄９０２、文欄９０４、文書ＩＤ欄９０６を有している。つまり、文と文書を対応付けている。
文ＩＤ欄９０２は、抽出した文を一意に識別する文ＩＤ（ＩＤｅｎｔｉｆｉｃａｔｉｏｎ）を記憶する。
文欄９０４は、抽出した文を記憶する。
文書ＩＤ欄９０６は、その文を抽出した文書を一意に識別する文書ＩＤを記憶する。 FIG. 2 is a flowchart showing an example of processing according to this embodiment. This will be described using a specific example. In this example, the sentence extraction module 130 extracts a sentence from the document in the document storage module 120 in advance. The sentence extraction result is stored in the sentence / document table 900, for example. FIG. 9 is an explanatory diagram showing an example of the data structure of the sentence / document table 900. The sentence / document table 900 includes a sentence ID column 902, a sentence column 904, and a document ID column 906. That is, the sentence is associated with the document.
The sentence ID column 902 stores a sentence ID (IDentification) that uniquely identifies the extracted sentence.
The sentence column 904 stores the extracted sentence.
The document ID column 906 stores a document ID that uniquely identifies the document from which the sentence is extracted.

ステップＳ２０２では、文書受付モジュール１１０が、対象文書を受け付ける。
ステップＳ２０４では、文抽出モジュール１３０が、受け付けた文書内のテキストデータから文を抽出する。図４は、対象文書（適合宣言書）内の文の例を示す説明図であり、文４０２から文４１０の文を抽出した例である。図７は、対象文書（機密情報表示を含む文書）内の文の例を示す説明図であり、文７０２から文７０８の文を抽出した例である。つまり、句点又は改行記号のいずれか一方を発見するごとに文を抽出する。 In step S202, the document reception module 110 receives the target document.
In step S204, the sentence extraction module 130 extracts a sentence from the text data in the received document. FIG. 4 is an explanatory diagram illustrating an example of a sentence in the target document (conformance declaration document), in which the sentence 410 is extracted from the sentence 402. FIG. 7 is an explanatory diagram illustrating an example of a sentence in a target document (a document including confidential information display), and is an example in which sentences of the sentence 708 are extracted from the sentence 702. In other words, a sentence is extracted every time either a punctuation mark or a line feed symbol is found.

ステップＳ２０６では、類似構文集合生成モジュール１４０が、文同士の（構文構造の類似性に基づく）類似度を計算し、文をグループに分類する。例えば、図４に例示した文のうち文４０２、文４０６、文４０８は、並置的な記述の文（構文構成として「＜物質名＞を主部に含み、＜数値表現＞＜単位＞を述部に含む」）であり、これらをグループとして特定する。図７に例示した文７０２から文７０８は、並置的な記述の文であり、これらをグループとして特定する。対象文書内の文についてグループ分けした結果を、例えば、文・グループテーブル１０００に記憶する。図１０は、文・グループテーブル１０００のデータ構造例を示す説明図である。文・グループテーブル１０００は、文ＩＤ欄１００２、文欄１００４、グループ欄１００６を有している。
文ＩＤ欄１００２は、対象文書内の文を一意に識別する文ＩＤを記憶する。
文欄１００４は、抽出した文を記憶する。
グループ欄１００６は、グループ分けした結果であるグループＩＤを記憶する。図１０の例では、「Ａ−００５」と「Ａ−００７」の文は同じグループに属する。 In step S206, the similar syntax set generation module 140 calculates the similarity between sentences (based on the similarity of the syntax structure), and classifies the sentences into groups. For example, among the sentences illustrated in FIG. 4, the sentence 402, the sentence 406, and the sentence 408 include juxtaposed description sentences (syntax structure including “<substance name> in the main part and <numeric expression><unit>”. And include them as a group. Sentences 702 to 708 illustrated in FIG. 7 are juxtaposed description sentences, which are specified as a group. The result of grouping the sentences in the target document is stored in the sentence / group table 1000, for example. FIG. 10 is an explanatory diagram showing an example of the data structure of the sentence / group table 1000. The sentence / group table 1000 has a sentence ID column 1002, a sentence column 1004, and a group column 1006.
The sentence ID column 1002 stores a sentence ID that uniquely identifies a sentence in the target document.
The sentence column 1004 stores the extracted sentence.
The group column 1006 stores a group ID that is a result of grouping. In the example of FIG. 10, the sentences “A-005” and “A-007” belong to the same group.

ステップＳ２０８では、類似文検索モジュール１５０が、過去に作成された文書（文書格納モジュール１２０内の文書）を対象にして、グループに属する各文の（単語の類似性に基づく）類似文を検索する。
例えば、図４に例示した文４０６である
「ポリ臭化ジフェニルエーテルの含有は、２０ｐｐｍ以下である。」
に類似する文として、
「ポリ臭化ジフェニルエーテルは１０ｐｐｍ以下の含有量である。」
「ポリ臭化ジフェニルエーテルの含有は５０ｐｐｍ以下の含有に抑えること。」
等の文が検索結果として得られる。つまり、これらの文は、文４０６内の「ポリ臭化ジフェニルエーテル」、「含有」、「『数値表現』ｐｐｍ」、「以下」と同じ単語を含んでいるので、類似文として検索される。
類似文の検索結果を、例えば、類似度テーブル１１００に記憶する。図１１は、類似度テーブル１１００のデータ構造例を示す説明図である。類似度テーブル１１００は、対象文ＩＤ欄１１０２、類似文ＩＤ欄１１０４、文書ＩＤ欄１１０６、類似度欄１１０８を有している。
対象文ＩＤ欄１１０２は、対象文書内の文の文ＩＤを記憶する。
類似文ＩＤ欄１１０４は、検索結果である類似文の文ＩＤを記憶する。
文書ＩＤ欄１１０６は、その類似文が含まれている文書ＩＤを記憶する。
類似度欄１１０８は、対象文と類似文との間における類似度を記憶する。 In step S <b> 208, the similar sentence search module 150 searches for similar sentences (based on word similarity) of each sentence belonging to the group for documents created in the past (documents in the document storage module 120). .
For example, the sentence 406 illustrated in FIG. 4 is “the content of polybrominated diphenyl ether is 20 ppm or less.”
As a sentence similar to
“Polybrominated diphenyl ether has a content of 10 ppm or less.”
“Contain polybrominated diphenyl ether to 50 ppm or less.”
Etc. are obtained as search results. That is, since these sentences contain the same words as “polybrominated diphenyl ether”, “contained”, ““ numerical expression ”ppm”, and “below” in the sentence 406, they are searched as similar sentences.
The similar sentence search result is stored in, for example, the similarity table 1100. FIG. 11 is an explanatory diagram showing an example of the data structure of the similarity table 1100. The similarity table 1100 has a target sentence ID column 1102, a similar sentence ID column 1104, a document ID column 1106, and a similarity column 1108.
The target sentence ID column 1102 stores the sentence ID of the sentence in the target document.
The similar sentence ID column 1104 stores sentence IDs of similar sentences that are search results.
The document ID column 1106 stores the document ID including the similar sentence.
The similarity column 1108 stores the similarity between the target sentence and the similar sentence.

ステップＳ２１０では、関連文書検索モジュール１６０が、得られた類似文を含む文書を文書格納モジュール１２０から検索する。例えば、類似度テーブル１１００のその類似文ＩＤに対応する文書ＩＤ欄１１０６を用いればよい。
ステップＳ２１２では、関連文書ランキングモジュール１７０が、文書内の類似文の数等に基づき文書のランキングを決定する。例えば、図４に例示した文４０２、文４０６、文４０８のそれぞれの文と単語が類似しており、それらの文の出現順序が同じである、図５（文４０２、４０６、４０８に対応する文として文５０２、５０４、５０６）、図６（文４０２、４０６、４０８に対応する文として文６０２、６０４、６０６）に例示した文書が上位にランキングされる。また、図７に例示した文７０２から文７０８のそれぞれの文と単語が類似しており、それらの文の出現順序が同じである、図８（文７０２、７０４、７０６、７０８に対応する文として文８０２、８０６、８１０、８１４）に例示した文書が上位にランキングされる。 In step S210, the related document search module 160 searches the document storage module 120 for a document including the obtained similar sentence. For example, the document ID column 1106 corresponding to the similar sentence ID of the similarity table 1100 may be used.
In step S212, the related document ranking module 170 determines the ranking of the document based on the number of similar sentences in the document. For example, the sentence 402, the sentence 406, and the sentence 408 illustrated in FIG. 4 are similar in word to the sentences, and the appearance order of these sentences is the same, corresponding to the sentences 402, 406, and 408 in FIG. Documents exemplified in sentences 502, 504, and 506) as sentences and sentences illustrated in FIG. 6 (sentences 602, 604, and 606 as sentences corresponding to sentences 402, 406, and 408) are ranked higher. Also, the sentences corresponding to FIG. 8 (sentences 702, 704, 706, and 708 in which the sentences are similar to the sentences in the sentences 702 to 708 illustrated in FIG. 7 and the appearance order of those sentences is the same. Documents 802, 806, 810, and 814) are ranked higher.

なお、本実施の形態としてのプログラムが実行されるコンピュータのハードウェア構成は、図１２に例示するように、一般的なコンピュータであり、具体的にはパーソナルコンピュータ、サーバーとなり得るコンピュータ等である。つまり、具体例として、処理部（演算部）としてＣＰＵ１２０１を用い、記憶装置としてＲＡＭ１２０２、ＲＯＭ１２０３、ＨＤ１２０４を用いている。ＨＤ１２０４として、例えばハードディスクを用いてもよい。文抽出モジュール１３０、類似構文集合生成モジュール１４０、類似文検索モジュール１５０、関連文書検索モジュール１６０、関連文書ランキングモジュール１７０等のプログラムを実行するＣＰＵ１２０１と、そのプログラムやデータを記憶するＲＡＭ１２０２と、本コンピュータを起動するためのプログラム等が格納されているＲＯＭ１２０３と、補助記憶装置であるＨＤ１２０４と、キーボード、マウス等のデータを入力する入力装置１２０６と、ＣＲＴや液晶ディスプレイ等の出力装置１２０５と、ネットワークインタフェースカード等の通信ネットワークと接続するための通信回線インタフェース１２０７、そして、それらをつないでデータのやりとりをするためのバス１２０８により構成されている。これらのコンピュータが複数台互いにネットワークによって接続されていてもよい。 Note that the hardware configuration of the computer on which the program according to the present embodiment is executed is a general computer, specifically a personal computer, a computer that can be a server, and the like, as illustrated in FIG. That is, as a specific example, the CPU 1201 is used as a processing unit (calculation unit), and the RAM 1202, the ROM 1203, and the HD 1204 are used as storage devices. For example, a hard disk may be used as the HD 1204. A CPU 1201 that executes programs such as a sentence extraction module 130, a similar syntax set generation module 140, a similar sentence search module 150, a related document search module 160, a related document ranking module 170, a RAM 1202 that stores the programs and data, and this computer ROM 1203 that stores a program for starting up, an HD 1204 as an auxiliary storage device, an input device 1206 for inputting data such as a keyboard and a mouse, an output device 1205 such as a CRT and a liquid crystal display, and a network interface A communication line interface 1207 for connecting to a communication network such as a card, and a bus 1208 for connecting them to exchange data. A plurality of these computers may be connected to each other via a network.

前述の実施の形態のうち、コンピュータ・プログラムによるものについては、本ハードウェア構成のシステムにソフトウェアであるコンピュータ・プログラムを読み込ませ、ソフトウェアとハードウェア資源とが協働して、前述の実施の形態が実現される。
なお、図１２に示すハードウェア構成は、１つの構成例を示すものであり、本実施の形態は、図１２に示す構成に限らず、本実施の形態において説明したモジュールを実行可能な構成であればよい。例えば、一部のモジュールを専用のハードウェア（例えばＡＳＩＣ等）で構成してもよく、一部のモジュールは外部のシステム内にあり通信回線で接続しているような形態でもよく、さらに図１２に示すシステムが複数互いに通信回線によって接続されていて互いに協調動作するようにしてもよい。また、特に、パーソナルコンピュータの他、情報家電、複写機、ファックス、スキャナ、プリンタ、複合機（スキャナ、プリンタ、複写機、ファックス等のいずれか２つ以上の機能を有している画像処理装置）などに組み込まれていてもよい。 Among the above-described embodiments, the computer program is a computer program that reads the computer program, which is software, in the hardware configuration system, and the software and hardware resources cooperate with each other. Is realized.
Note that the hardware configuration shown in FIG. 12 shows one configuration example, and the present embodiment is not limited to the configuration shown in FIG. 12, but is a configuration that can execute the modules described in the present embodiment. I just need it. For example, some modules may be configured by dedicated hardware (for example, ASIC), and some modules may be in an external system and connected via a communication line. A plurality of systems shown in FIG. 5 may be connected to each other via communication lines so as to cooperate with each other. In particular, in addition to personal computers, information appliances, copiers, fax machines, scanners, printers, and multifunction machines (image processing apparatuses having two or more functions of scanners, printers, copiers, fax machines, etc.) Etc. may be incorporated.

前述の実施の形態においては、制限規則に関する文書を示したが、他の文書を対象としてもよい。類似している構文構成の文が複数含まれており、それらの文が検索対象となるような文書であれば適用できる。
なお、前述の実施の形態内の各モジュールの処理内容として従来技術として説明した技術を採用してもよい。
また、前述の実施の形態の説明において、予め定められた値との比較において、「以上」、「以下」、「より大きい」、「より小さい（未満）」としたものは、その組み合わせに矛盾が生じない限り、それぞれ「より大きい」、「より小さい（未満）」、「以上」、「以下」としてもよい。 In the above-described embodiment, the document relating to the restriction rule is shown, but other documents may be targeted. The present invention can be applied to any document in which a plurality of sentences having similar syntax structures are included and these sentences are to be searched.
The technology described as the prior art may be adopted as the processing contents of each module in the above-described embodiment.
Further, in the description of the above-described embodiment, “more than”, “less than”, “greater than”, and “less than (less than)” in a comparison with a predetermined value contradicts the combination. As long as the above does not occur, “larger”, “smaller (less than)”, “more than”, and “less than” may be used.

なお、説明したプログラムについては、記録媒体に格納して提供してもよく、また、そのプログラムを通信手段によって提供してもよい。その場合、例えば、前記説明したプログラムについて、「プログラムを記録したコンピュータ読み取り可能な記録媒体」の発明として捉えてもよい。
「プログラムを記録したコンピュータ読み取り可能な記録媒体」とは、プログラムのインストール、実行、プログラムの流通などのために用いられる、プログラムが記録されたコンピュータで読み取り可能な記録媒体をいう。
なお、記録媒体としては、例えば、デジタル・バーサタイル・ディスク（ＤＶＤ）であって、ＤＶＤフォーラムで策定された規格である「ＤＶＤ−Ｒ、ＤＶＤ−ＲＷ、ＤＶＤ−ＲＡＭ等」、ＤＶＤ＋ＲＷで策定された規格である「ＤＶＤ＋Ｒ、ＤＶＤ＋ＲＷ等」、コンパクトディスク（ＣＤ）であって、読出し専用メモリ（ＣＤ−ＲＯＭ）、ＣＤレコーダブル（ＣＤ−Ｒ）、ＣＤリライタブル（ＣＤ−ＲＷ）等、ブルーレイ・ディスク（Ｂｌｕ−ｒａｙＤｉｓｃ（登録商標））、光磁気ディスク（ＭＯ）、フレキシブルディスク（ＦＤ）、磁気テープ、ハードディスク、読出し専用メモリ（ＲＯＭ）、電気的消去及び書換可能な読出し専用メモリ（ＥＥＰＲＯＭ）、フラッシュ・メモリ、ランダム・アクセス・メモリ（ＲＡＭ）等が含まれる。
そして、前記のプログラム又はその一部は、前記記録媒体に記録して保存や流通等させてもよい。また、通信によって、例えば、ローカル・エリア・ネットワーク（ＬＡＮ）、メトロポリタン・エリア・ネットワーク（ＭＡＮ）、ワイド・エリア・ネットワーク（ＷＡＮ）、インターネット、イントラネット、エクストラネット等に用いられる有線ネットワーク、あるいは無線通信ネットワーク、さらにこれらの組み合わせ等の伝送媒体を用いて伝送させてもよく、また、搬送波に乗せて搬送させてもよい。
さらに、前記のプログラムは、他のプログラムの一部分であってもよく、あるいは別個のプログラムと共に記録媒体に記録されていてもよい。また、複数の記録媒体に分割して
記録されていてもよい。また、圧縮や暗号化など、復元可能であればどのような態様で記録されていてもよい。 The program described above may be provided by being stored in a recording medium, or the program may be provided by communication means. In that case, for example, the above-described program may be regarded as an invention of a “computer-readable recording medium recording the program”.
The “computer-readable recording medium on which a program is recorded” refers to a computer-readable recording medium on which a program is recorded, which is used for program installation, execution, program distribution, and the like.
The recording medium is, for example, a digital versatile disc (DVD), which is a standard established by the DVD Forum, such as “DVD-R, DVD-RW, DVD-RAM,” and DVD + RW. Standard “DVD + R, DVD + RW, etc.”, compact disc (CD), read-only memory (CD-ROM), CD recordable (CD-R), CD rewritable (CD-RW), Blu-ray disc ( Blu-ray Disc (registered trademark), magneto-optical disk (MO), flexible disk (FD), magnetic tape, hard disk, read-only memory (ROM), electrically erasable and rewritable read-only memory (EEPROM), flash Includes memory, random access memory (RAM), etc. .
The program or a part of the program may be recorded on the recording medium for storage or distribution. Also, by communication, for example, a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), a wired network used for the Internet, an intranet, an extranet, etc., or wireless communication It may be transmitted using a transmission medium such as a network or a combination of these, or may be carried on a carrier wave.
Furthermore, the program may be a part of another program, or may be recorded on a recording medium together with a separate program. Moreover, it may be divided and recorded on a plurality of recording media. Further, it may be recorded in any manner as long as it can be restored, such as compression or encryption.

１１０…文書受付モジュール
１２０…文書格納モジュール
１３０…文抽出モジュール
１４０…類似構文集合生成モジュール
１５０…類似文検索モジュール
１６０…関連文書検索モジュール
１７０…関連文書ランキングモジュール
１８０…関連文書出力モジュール DESCRIPTION OF SYMBOLS 110 ... Document reception module 120 ... Document storage module 130 ... Sentence extraction module 140 ... Similar syntax set generation module 150 ... Similar sentence search module 160 ... Related document search module 170 ... Related document ranking module 180 ... Related document output module

Claims

Document storage means for storing a document;
A sentence extracting means for extracting a sentence from a target document;
A sentence set generation means for generating a set of sentences based on the syntax of the sentence extracted by the sentence extraction means;
Similar sentence search means for searching a sentence in the document stored in the document storage means for a second sentence similar to the first sentence in the sentence set generated by the sentence set generation means;
An information processing apparatus comprising: a related document search unit that searches the document storage unit for a document related to the target document based on the second sentence searched by the similar sentence search unit.

The number of the second sentences included in the document searched by the related document search means, the similarity of the first sentence corresponding to the second sentence, and the target document corresponding to the second sentence A document searched by the related document search unit based on a comparison result of the appearance order of the first sentence and the appearance order of the second sentence in the document searched by the related document search unit, or a combination thereof The information processing apparatus according to claim 1, further comprising ordering means for ordering the items.

The information processing apparatus according to claim 2, wherein the ordering unit performs the ordering based on a similarity between the document searched by the related document search unit and the target document.

Computer
Document storage means for storing a document;
A sentence extracting means for extracting a sentence from a target document;
A sentence set generation means for generating a set of sentences based on the syntax of the sentence extracted by the sentence extraction means;
Similar sentence search means for searching a sentence in the document stored in the document storage means for a second sentence similar to the first sentence in the sentence set generated by the sentence set generation means;
An information processing program for causing a document related to the target document to function as a related document search unit that searches the document storage unit based on a second sentence searched by the similar sentence search unit.