JP2009151373A

JP2009151373A - Citation relation extraction system, citation relation extraction method, and citation relation extracting program

Info

Publication number: JP2009151373A
Application number: JP2007326365A
Authority: JP
Inventors: Tsutomu Ba; 強馬; Toshiyuki Kamiya; 俊之神谷; Yoshihide Ishiguro; 義英石黒
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-12-18
Filing date: 2007-12-18
Publication date: 2009-07-09

Abstract

PROBLEM TO BE SOLVED: To extract an implicit citation relation in a content, and to enhance the precision of implicit citation relation extraction. SOLUTION: This citation relation extraction system, this citation relation extraction method and this citation relation extracting program calculate a degree of citation indicating a degree of possibility of citation made between the contents, based on a difference of preparation, updating or reference time between contents, and based on a degree of relation between writers who have prepared, updated or cited the content. The citation relation extraction system, the citation relation extraction method and the citation relation extracting program extract the citation relation between the contents, based on the calculated degree of citation. COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、コンテンツ間の引用関係を抽出する引用関係抽出システム、引用関係抽出方法、及び引用関係抽出用プログラムに関する。 The present invention relates to a citation relationship extraction system, a citation relationship extraction method, and a citation relationship extraction program for extracting citation relationships between contents.

企業等の組織の日常業務において、上司や同僚又は部下が作成したコンテンツを引用したり、参考にしたりして新たなコンテンツを作成することが多い。例えば、プロジェクトメンバの資料の一部を取り込んで報告書を作成することがある。引用関係を明らかにすることは、直接的には原著や出典を明確にすることに繋がり、著作権保護には重要である。また、コンテンツ群全体について引用関係を発見することは、コンテンツの体系化を行ったり検索を容易にしたりするためにも重要な役割がある。 In daily operations of organizations such as companies, new content is often created by quoting or referring to content created by supervisors, colleagues, or subordinates. For example, a report may be created by incorporating a part of the project member's material. Clarifying the citation relationship directly leads to clarifying the original work and source, and is important for copyright protection. Also, finding citation relationships for the entire content group plays an important role in organizing the content and facilitating searches.

例えば、研究機関等において、被引用回数に基づいて論文のインパクトファクタを測って研究者を評価したり、引用関係に基づいて文書のネットワークを生成して文書群の整理を行ったりすることが行われている。 For example, a research institution or the like may evaluate a researcher by measuring the impact factor of a paper based on the number of citations, or create a network of documents based on a citation relationship and organize a group of documents. It has been broken.

しかしながら、研究機関等で用いられる論文とは違って、企業等の組織内で用いられる社内コンテンツの引用関係は明示されないことが多い。そのため、引用関係に基づくコンテンツの体系化技術のメリットを十分に享受できないとともに、オリジナルアイディアの発案者を正しく評価できない場合がある。 However, unlike papers used in research institutions, etc., the citation relationship of in-house content used in organizations such as companies is often not specified. For this reason, there are cases in which the merit of the content organization technology based on the citation relationship cannot be fully enjoyed, and the original idea creator cannot be evaluated correctly.

以下、社内コンテンツにおける明示されない引用関係を暗黙引用関係という。また、以下、特別な説明がない限り、「引用関係」とは暗黙引用関係のことを示すものとする。 In the following, citation relationships that are not specified in the in-house content are referred to as implicit citation relationships. In addition, hereinafter, unless otherwise specified, “quoting relationship” indicates an implicit quoting relationship.

引用とは、２つのコンテンツ間又は２つのコンテンツ中の部分（以下パッセージ）間にある関係である。引用とは、あるコンテンツ（以下、引用元コンテンツ）の一部又は全部を、別のコンテンツ（以下、引用先コンテンツ）の中でそのまま又は一部を改変して再利用することである。また、以下、引用元コンテンツの一部又は全部を引用元パッセージといい、引用先コンテンツの一部又は全部を引用先パッセージという。 Citation is a relationship between two contents or between parts in two contents (hereinafter referred to as passages). Citation refers to reusing part or all of a certain content (hereinafter referred to as “citation content”) as it is or in a part of another content (hereinafter referred to as “citation content”). Hereinafter, a part or all of the citation source content is referred to as a citation source passage, and a part or all of the citation destination content is referred to as a citation destination passage.

引用関係は文書の組織化と検索の容易化のための重要ファクタであるため、文書やコンテンツの引用や再利用関係を発見する手法が多数提案されている。例えば、特許文献１では、文字列のマッチングにより同一文字列を発見し、同一文字列の出現場所と出現回数等との表層情報を用いたコンテンツの再利用関係の抽出手法が記載されている。また、特許文献２では、パッセージ類似に基づいてリンク関係を自動生成する手法が記載されている。 Since citation relationships are an important factor for organizing documents and facilitating search, many techniques for finding citation and reuse relationships of documents and contents have been proposed. For example, Patent Document 1 describes a method of extracting a content reuse relationship using surface layer information such as the appearance location and the number of appearances of the same character string by finding the same character string by matching character strings. Further, Patent Document 2 describes a method for automatically generating a link relationship based on passage similarity.

国際公開第２００４／０３４２８２号パンフレットInternational Publication No. 2004/034282 Pamphlet 特開２０００−３３７２号公報JP 2000-3372 A

しかし、特許文献１や特許文献２に記載された手法を用いただけでは、引用関係を誤判定する可能性がある。例えば、繋がりのない２人の研究者がほぼ同時に書いた論文は、相互に引用関係がないといえる。しかし、この場合に、特許文献１や特許文献２に記載された手法を用いただけでは、文字列やパッセージが類似であると判断され、引用関係があると誤認識されてしまう可能性がある。すなわち、特許文献１や特許文献２に記載された手法では、２つのコンテンツの作成時期が非常に離れている場合や、組織内でコンテンツの作成者の関係が非常に離れている場合、両者が独立に作成されたコンテンツである可能性が高くなるという点が考慮されていない。 However, the citation relationship may be erroneously determined only by using the methods described in Patent Document 1 and Patent Document 2. For example, papers written by two unconnected researchers at almost the same time can be said to have no citation relationship between them. However, in this case, if only the methods described in Patent Document 1 and Patent Document 2 are used, it is determined that the character strings and passages are similar and may be erroneously recognized as having a citation relationship. In other words, in the methods described in Patent Document 1 and Patent Document 2, if the creation times of two contents are very far apart, or if the relationship between the creators of the contents is very far away in the organization, The possibility that the content is created independently is not considered.

また、一般にアクセス権の設定を行うシステムでは、アクセス権のない機密文書を引用することは不可能である。しかし、特許文献１や特許文献２に記載された類似関係ベースで引用関係を判定する手法では、アクセス権のない機密文書であっても、偶然類似する文字列やパッセージが含まれていれば、引用関係があると誤検出されてしまう可能性がある。 In general, it is impossible to quote a confidential document without access right in a system for setting access right. However, in the method of determining the citation relationship based on the similarity relationship described in Patent Literature 1 and Patent Literature 2, even if it is a confidential document without access right, if a similar character string or passage is included by chance, If there is a citation relationship, there is a possibility of being misdetected.

そこで、本発明は、上記課題を解決するためになされたものであって、コンテンツ中の暗黙引用関係の抽出を可能とするとともに、暗黙引用関係抽出の精度向上を可能とする引用関係抽出システム、引用関係抽出方法、及び引用関係抽出用プログラムを提供することを目的とする。 Accordingly, the present invention has been made to solve the above-described problem, and enables the extraction of the implicit citation relationship in the content and the citation relationship extraction system that enables the accuracy of the implicit citation relationship extraction to be improved, It is an object to provide a citation relationship extraction method and a citation relationship extraction program.

本発明による引用関係抽出システムは、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとに基づいて、コンテンツ間で引用が行われた可能性の度合いを示す引用度を算出する引用度算出手段と、引用度算出手段が算出した引用度に基づいて、コンテンツ間の引用関係を抽出する引用関係抽出手段とを備えたことを特徴とする。 The citation relationship extraction system according to the present invention enables citation between contents based on the difference in creation, update or reference time between contents and the degree of relationship between authors who created, updated or referred to the content. A citation degree calculating means for calculating a citation degree indicating the degree of sexuality, and a citation relation extracting means for extracting a citation relation between contents based on the citation degree calculated by the citation degree calculating means. .

また、本発明による引用関係抽出方法は、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとに基づいて、コンテンツ間で引用が行われた可能性の度合いを示す引用度を算出する引用度算出ステップと、算出した引用度に基づいて、コンテンツ間の引用関係を抽出する引用関係抽出ステップとを含むことを特徴とする。 In addition, the citation relation extraction method according to the present invention performs citation between contents based on the difference in creation, update or reference time between contents and the degree of relation between authors who created, updated or referred to the contents. A citation degree calculating step for calculating a citation degree indicating the degree of possibility, and a citation relation extracting step for extracting a citation relation between contents based on the calculated citation degree.

また、本発明による引用関係抽出用プログラムは、コンピュータに、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとに基づいて、コンテンツ間で引用が行われた可能性の度合いを示す引用度を算出する引用度算出処理と、算出した引用度に基づいて、コンテンツ間の引用関係を抽出する引用関係抽出処理とを実行させるためのものである。 In addition, the citation relationship extraction program according to the present invention enables a computer to extract between contents based on the difference in creation, update or reference time between contents and the degree of relationship between authors who created, updated or referenced the contents. A citation degree calculation process for calculating a citation degree indicating the degree of possibility that citations have been performed, and a citation relation extraction process for extracting a citation relation between contents based on the calculated citation degree. is there.

本発明によれば、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとを考慮して、コンテンツ間の引用関係を抽出するので、引用関係の誤検出を除外することができる。従って、コンテンツ中の暗黙引用関係の抽出を可能とするとともに、暗黙引用関係抽出の精度向上を可能とすることができる。 According to the present invention, the citation relationship between contents is extracted in consideration of the difference in creation, update or reference time between contents and the degree of relationship between authors who created, updated or referenced content. Relationship detection errors can be excluded. Accordingly, it is possible to extract the implicit citation relationship in the content and improve the accuracy of the implicit citation relationship extraction.

実施形態１．
以下、本発明の第１の実施形態について図面を参照して説明する。図１は、本発明による暗黙引用関係発見システム（引用関係抽出システム）の構成の一例を示すブロック図である。本発明は、例えば、社内情報システムに関する。例えば、暗黙引用関係発見システムは、企業等の組織内において、電子文書等の社内コンテンツの組織化や検索の容易化に関して、特に、社内コンテンツ間の暗黙的引用関係を発見する処理を行う。 Embodiment 1. FIG.
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing an example of the configuration of an implicit citation relationship discovery system (citation relationship extraction system) according to the present invention. The present invention relates to an in-house information system, for example. For example, an implicit citation relationship discovery system performs processing for finding an implicit citation relationship between in-house content, particularly regarding organization of in-house content such as an electronic document and facilitating search within an organization such as a company.

なお、暗黙引用関係発見システムは、電子文書に限らず、静止画像や映像等のコンテンツにも適用できるが、以下、便利上、特別な説明がない限り、本実施形態では、コンテンツの一例である電子文書を用いて説明を行う。例えば、引用元コンテンツと引用先コンテンツとを、それぞれ引用元文書と引用先文書とに言い換えて説明を行う。 The implicit citation relationship discovery system can be applied not only to electronic documents but also to contents such as still images and videos. However, for the sake of convenience, this embodiment is an example of contents unless otherwise specified. An explanation is given using an electronic document. For example, the citation source content and the citation destination content will be described as a citation source document and a citation destination document, respectively.

まず、本発明による暗黙引用関係発見システム（引用関係抽出システム）の概念について説明する。本発明では、引用関係の時空間制約条件と引用方向制約条件とを導入して、引用関係の誤判定を除外できる暗黙引用関係の発見手段を提供する。 First, the concept of the implicit citation relationship discovery system (citation relationship extraction system) according to the present invention will be described. In the present invention, a space-time constraint condition and a citation direction constraint condition for citation relations are introduced to provide a means for finding an implicit citation relation that can exclude erroneous judgment of citation relations.

引用方向制約は、引用元コンテンツが引用先コンテンツより以前に作成されている必要があるという時間条件と、引用先コンテンツの著者が引用元コンテンツにアクセスできる権限がある必要があるというアクセス権条件とを含む。暗黙引用関係発見システムは、この引用方向制約を利用して、引用関係が存在しえないコンテンツペアを排除し、引用先と引用元との順序関係を与える。 The citation direction constraint includes a time condition that the citation content must be created before the citation content, and a permission condition that the author of the citation content must have access to the citation content. including. The implicit citation relationship discovery system uses this citation direction constraint to eliminate a content pair in which a citation relationship cannot exist and to give an order relationship between a citation destination and a citation source.

時空間制約条件は、引用関係が発生するためには、コンテンツの作成時刻の差が一定範囲内にある必要があるという時間間隔条件と、著者間に繋がりがある必要があるという組織空間における著者の相関条件とを含む。これらの時空間制約条件は、コンテンツの引用における以下に示すような一般的な傾向を定式化したものである。 For space-time constraints, in order for citation relationships to occur, authors in organizational space must have a time interval condition that the difference in content creation time must be within a certain range and that there must be a connection between authors. Correlation conditions. These spatio-temporal constraints formulate general trends as follows in content citations.

（１）時間間隔条件：同時に作成されたコンテンツには引用関係が存在する可能性が低い。つまり、引用元文書と引用先文書との作成時刻が近いほど、引用関係が存在しない可能性が高い。また、この作成時刻の差が大きくなると、コンテンツにアクセスされる可能性が高くなるので、引用される可能性が大きくなる。 (1) Time interval condition: There is a low possibility that a citation relationship exists in content created at the same time. In other words, the closer the creation times of the citation source document and the citation destination document are, the higher the possibility that the citation relationship does not exist. Also, if the difference in creation time increases, the possibility of accessing the content increases, so the possibility of citation increases.

しかしながら、この作成時刻の差が極端に大きくなると、非常に優れたコンテンツでない限り、かえって忘却されてしまい、引用される可能性が低くなる。つまり、作成時刻の差の増加に伴って、コンテンツ間の引用可能性は一旦増大した後に次第に減少していく傾向がある。 However, if this difference in creation time becomes extremely large, unless the content is very good, it will be forgotten, and the possibility of being quoted will be low. In other words, as the difference in creation time increases, the citation possibility between contents once tends to increase and then gradually decrease.

（２）組織相関条件：組織空間において、強い繋がりのある著者同士は、近い空間にいるため、密にコミュニケーションを行っている可能性が高い。そのため、相手の考え方や相手が作成したコンテンツに対する理解が高く、コンテンツを引用する可能性が高い。 (2) Organization correlation condition: In an organization space, authors who are strongly connected are in close spaces, so there is a high possibility that they are communicating closely. Therefore, there is a high understanding of the other party's way of thinking and the content created by the other party, and the possibility of quoting the content is high.

暗黙引用関係発見システムは、以上の考え方に従って、コンテンツ間の引用関係の抽出を行う。 The implicit citation relationship discovery system extracts citation relationships between contents according to the above-described concept.

図１に示すように、暗黙引用関係発見システムは、コントローラ１００と、文書データベース１０１と、組織構成表記憶手段１０２と、アクセスデータベース１０３と、仮想引用データベース１０４と、順序関係推定手段２０１と、引用度計算手段２０２とを含む。 As shown in FIG. 1, the implicit citation relationship discovery system includes a controller 100, a document database 101, an organization configuration table storage unit 102, an access database 103, a virtual citation database 104, an order relationship estimation unit 201, and a citation. Degree calculation means 202.

暗黙引用関係発見システムは、具体的には、プログラムに従って動作するパーソナルコンピュータ等の情報処理装置によって実現される。なお、暗黙引用関係発見システムは、１つの情報処理装置によって実現されてもよく、複数の情報処理装置を用いて実現されてもよい。例えば、暗黙引用関係発見システムは、企業等の組織内に設置されている文書共有システムや人事管理システム等を実現する複数の情報処理装置を用いて実現されていてもよい。 Specifically, the implicit citation relationship discovery system is realized by an information processing apparatus such as a personal computer that operates according to a program. Note that the implicit citation relationship discovery system may be realized by one information processing apparatus or may be realized by using a plurality of information processing apparatuses. For example, the implicit citation relationship discovery system may be implemented using a plurality of information processing apparatuses that implement a document sharing system, a personnel management system, or the like installed in an organization such as a company.

文書データベース１０１は、社内で用いられる社内文書（電子文書）を格納するデータベースであり、具体的には、磁気ディスク装置や光ディスク装置等のデータベース装置によって実現される。また、文書データベース１０１は、企業等の組織内における文章共有システムを実現する１つ又は複数のデータベースサーバによって実現されてもよい。 The document database 101 is a database that stores in-house documents (electronic documents) used in the company, and is specifically realized by a database device such as a magnetic disk device or an optical disk device. Further, the document database 101 may be realized by one or a plurality of database servers that realize a text sharing system in an organization such as a company.

また、文書データベース１０１は、文書に関する情報として、文書ＩＤや、ファイルパス、著者ＩＤ、作成時刻、アクセスレベル、文書タイプの組を格納する。また、文書データベース１０１は、文書中に含まれるパッセージに関する情報として、パッセージＩＤや、文書ＩＤ、パッセージの組を格納する。また、文書データベース１０１は、コンテンツのタイプ毎の引用度の計算パラメータの情報として、文書のタイプや、単位時間距離の換算パラメータの組を格納する。 The document database 101 stores a set of document ID, file path, author ID, creation time, access level, and document type as information about the document. Further, the document database 101 stores a passage ID, a document ID, and a set of passages as information related to passages included in the document. Further, the document database 101 stores a set of document type and a unit time distance conversion parameter as information on a calculation parameter for the citation degree for each content type.

なお、文書データベース１０１は、例えば、「文書に関する情報」や「パッセージに関する情報」、「計算パラメータの情報」を、文書共有システムに文書を登録するタイミングで予め格納している。例えば、文書データベース１０１は、文書登録の際に入力されるＩＤやパスワードに基づいて特定された著者名（例えば、著者ＩＤ）を、「文書に関する情報」の１つとして格納している。 The document database 101 stores, for example, “document information”, “passage information”, and “calculation parameter information” in advance at the timing of registering a document in the document sharing system. For example, the document database 101 stores an author name (for example, author ID) specified based on an ID and a password input at the time of document registration as one of “information about the document”.

パッセージは、コンテンツ（例えば、電子文書）に含まれる意味的に１つのまとまりを構成する部分である。例えば、コンテンツが文書である場合、パッセージは段落である。 The passage is a part that semantically constitutes one unit included in the content (for example, an electronic document). For example, if the content is a document, the passage is a paragraph.

また、アクセスレベルは、その文書にアクセスするために必要な権限のレベルを示す情報である。例えば、アクセスレベルは、０から１０までの数字で表される。例えば、暗黙引用関係発見システムは、電子文書（コンテンツ）を作成した著者の操作に従って、電子文書を文書データベース１０１に登録する際に、適切なアクセスレベルを設定する。また、利用者は、利用者端末等を操作して、自分に許可されたアクセスレベルに従って、文書データベース１０１に蓄積された文書にアクセスする。 The access level is information indicating the level of authority necessary for accessing the document. For example, the access level is represented by a number from 0 to 10. For example, the implicit citation relationship discovery system sets an appropriate access level when registering an electronic document in the document database 101 in accordance with the operation of the author who created the electronic document (content). Further, the user operates the user terminal or the like to access the document stored in the document database 101 according to the access level permitted by the user.

また、文書の作成時刻は、文書データベース１０１に電子文書（コンテンツ）を登録した時点の時刻である。 The document creation time is the time when the electronic document (content) is registered in the document database 101.

コンテンツのタイプ毎の引用度の計算パラメータの情報を構成する情報のうち、文書タイプとは、例えば、メモ書きや週報、月報、報告書、論文等の文書の目的・用途に応じたタイプ分けを示す情報である。 Among the information that makes up the information of the calculation parameters for the citation level for each type of content, the document type is, for example, type classification according to the purpose and use of documents such as notes, weekly reports, monthly reports, reports, and papers. It is information to show.

また、単位時間距離の換算パラメータとは、各タイプの文書が作成されてから参照される最も可能性が高い時までの時間（有効期間）である。例えば、週報であれば、作成直後から１ヶ月ぐらいまでの間に参照される可能性が高く、その後参照される可能性が低くなる場合には、単位時間距離の換算パラメータの値は１ヶ月である。同様に、例えば、月報であれば、単位時間距離の換算パラメータの値は１年である。文書データベース１０１は、このような形で予め文書タイプに応じて定められた換算パラメータが登録されているものとする。 The unit time distance conversion parameter is the time (effective period) from the creation of each type of document to the most likely time of reference. For example, in the case of weekly reports, if there is a high possibility that it will be referred to in about one month immediately after creation, and the possibility that it will be referred to after that becomes low, the value of the conversion parameter for unit time distance is one month. is there. Similarly, for example, in the case of monthly reports, the value of the conversion parameter for unit time distance is one year. In the document database 101, it is assumed that conversion parameters determined in advance according to the document type are registered in this manner.

組織構成表記憶手段１０２は、具体的には、磁気ディスク装置や光ディスク装置等の記憶装置によって実現される。組織構成表記憶手段１０２は、著者情報と、組織構成グラフの隣接行列と、組織構成グラフの更新時間とを含む組織構成表を格納する。なお、組織構成表は、例えば、企業等の組織内の人事部門によって作成され、予め組織構成表記憶手段１０２に登録される。また、暗黙引用関係発見システムは、組織内の人事管理システムから組織構成表を取得し、処理を実行するようにしてもよい。 Specifically, the organization configuration table storage unit 102 is realized by a storage device such as a magnetic disk device or an optical disk device. The organization configuration table storage unit 102 stores an organization configuration table including author information, an adjacency matrix of the organization configuration graph, and an update time of the organization configuration graph. The organization structure table is created by, for example, a personnel department in an organization such as a company and registered in the organization structure table storage unit 102 in advance. Further, the implicit citation relationship discovery system may acquire the organization configuration table from the personnel management system in the organization and execute the processing.

図２は、組織構成グラフと、その組織構成グラフに対する隣接行列と、著者情報との例を示す説明図である。このうち、図２（ａ）は、組織構成グラフの例を示している。また、図２（ｂ）に示すように、組織構成表記憶手段１０２は、組織構成表に含まれる著者情報として、著者や著者ＩＤの組を蓄積する。また、図２（ｃ）に示すように、組織構成表記憶手段１０２は、著者ＩＤを用いて表現される隣接グラフとして、グラフＩＤや隣接行列の組を格納する。 FIG. 2 is an explanatory diagram illustrating an example of the organization configuration graph, an adjacency matrix for the organization configuration graph, and author information. Among these, Fig.2 (a) has shown the example of the organization structure graph. As shown in FIG. 2B, the organization structure table storage unit 102 accumulates a set of authors and author IDs as author information included in the organization structure table. Further, as shown in FIG. 2C, the organization configuration table storage unit 102 stores a set of graph IDs and adjacency matrices as an adjacency graph expressed using the author ID.

また、組織構成表記憶手段１０２は、組織構成グラフの更新時間として、グラフＩＤや更新時刻の組を格納する。暗黙引用関係発見システムは、この組織構成表記憶手段１０２が格納する更新時間の情報を用いて、組織改正や人事異動に伴う組織グラフの更新を管理することができる。 The organization configuration table storage unit 102 stores a set of graph IDs and update times as the update time of the organization configuration graph. The implicit citation relationship discovery system can manage the update of the organization graph due to the organization revision or personnel change using the update time information stored in the organization configuration table storage unit 102.

アクセスデータベース１０３は、文書のアクセス権情報を格納するデータベースであり、具体的には、磁気ディスク装置や光ディスク装置等のデータベース装置によって実現される。なお、アクセス権情報は、予め企業等の組織内の人事部門やシステム管理部門によって作成され、予めアクセスデータベース１０３に登録される。また、暗黙引用関係発見システムは、組織内の人事管理システムやアクセス権管理システムからアクセス情報を取得し、処理を実行するようにしてもよい。 The access database 103 is a database that stores document access right information, and is specifically realized by a database device such as a magnetic disk device or an optical disk device. The access right information is created in advance by a personnel department or a system management department in an organization such as a company and registered in the access database 103 in advance. Further, the implicit citation relationship discovery system may acquire access information from the personnel management system or access right management system in the organization and execute the processing.

アクセスデータベース１０３は、アクセス権情報として、著者ＩＤや、更新時間、アクセスレベルの組を格納している。アクセスレベルは、著者ＩＤに対応する著者のアクセス権限のレベルを示す情報である。例えば、アクセスレベルは、０から１０までの数字で表される。従って、本実施形態では、文書のアクセスに必要なアクセスレベル以上のアクセスレベルをもつ著者しか、その文書にアクセスすることができない。更新時間は、組織改正や人事異動に伴う著者のアクセスレベルの変更時間である。従って、アクセスレベルを特定するには、著者ＩＤと更新時間とを同時に用いて判断する必要がある。 The access database 103 stores a set of author ID, update time, and access level as access right information. The access level is information indicating the level of access authority of the author corresponding to the author ID. For example, the access level is represented by a number from 0 to 10. Therefore, in this embodiment, only an author who has an access level higher than the access level necessary for accessing the document can access the document. The update time is a change time of the access level of the author due to the organization revision or personnel change. Therefore, in order to specify the access level, it is necessary to make a determination using the author ID and the update time at the same time.

仮想引用データベース１０４は、暗黙引用関係発見システムが抽出した引用関係の抽出結果を格納するデータベースであり、具体的には、磁気ディスク装置や光ディスク装置等のデータベース装置によって実現される。仮想引用データベース１０４は、暗黙引用関係発見システムが抽出したパッセージの引用関係の抽出結果として、引用元の文書ＩＤや、引用元のパッセージＩＤ、引用先の文書ＩＤ、引用先のパッセージＩＤ、引用度の組を格納している。 The virtual citation database 104 is a database that stores the citation relationship extraction results extracted by the implicit citation relationship discovery system, and is specifically realized by a database device such as a magnetic disk device or an optical disk device. The virtual citation database 104 includes a citation relationship document ID, a citation source passage ID, a citation destination document ID, a citation destination passage ID, and a citation degree as an extraction result of the passage citation relationship extracted by the implicit citation relationship discovery system. Is stored.

コントローラ１００は、具体的には、プログラムに従って動作する情報処理装置のＣＰＵによって実現される。コントローラ１００は、引用度計算手段２０２が算出した引用度及び順序関係推定手段２０１が推定した順序関係に基づいて、コンテンツ間の引用関係を抽出する機能を備える。 Specifically, the controller 100 is realized by a CPU of an information processing apparatus that operates according to a program. The controller 100 has a function of extracting a citation relation between contents based on the citation degree calculated by the citation degree calculating means 202 and the order relation estimated by the order relation estimating means 201.

本実施形態では、コントローラ１００は、文書データベース１０１が格納する文書の全部又は一部を対象に、順序関係推定手段２０１の推定結果（推定処理の判定結果）と引用度計算手段２０２の計算結果とを用いて、暗黙引用関係の抽出を行う。また、コントローラ１００は、コンテンツ（例えば、電子文書）間の暗黙引用関係の抽出結果を、仮想引用データベース１０４に格納させる。 In the present embodiment, the controller 100 includes the estimation result of the order relation estimation unit 201 (determination result of the estimation process) and the calculation result of the citation degree calculation unit 202 for all or part of the documents stored in the document database 101. To extract the implicit citation relationship. Further, the controller 100 stores the extraction result of the implicit citation relationship between contents (for example, electronic documents) in the virtual citation database 104.

順序関係推定手段２０１は、具体的には、プログラムに従って動作する情報処理装置のＣＰＵによって実現される。順序関係推定手段２０１は、引用元となりうるコンテンツと引用先となりうるコンテンツとの順序関係を推定する機能を備える。図１に示すように、順序関係推定手段２０１は、アクセス権判断手段２０１１と、時間順序判断手段２０１２とを含む。 Specifically, the order relation estimation unit 201 is realized by a CPU of an information processing apparatus that operates according to a program. The order relationship estimation means 201 has a function of estimating the order relationship between content that can be a citation source and content that can be a citation destination. As illustrated in FIG. 1, the order relationship estimation unit 201 includes an access right determination unit 2011 and a time order determination unit 2012.

本実施形態では、順序関係推定手段２０１は、引用方向制約に基づいて、引用度の高い文書ペアの引用元と引用先との順序関係を推定する。つまり、順序関係推定手段２０１は、引用元へのアクセスの可否に関わるアクセス権条件と、引用元が引用先より先に作成される必要があるという時間条件とに基づいて、引用元と引用先とを推定する。 In the present embodiment, the order relationship estimation unit 201 estimates the order relationship between the citation source and the citation destination of a document pair with a high citation level based on the citation direction constraint. That is, the order relation estimation unit 201 determines whether the citation source and the citation destination are based on the access right condition related to whether or not the citation source is accessible and the time condition that the citation source needs to be created before the citation destination. Is estimated.

順序関係推定手段２０１に含まれる各手段のうち、アクセス権判断手段２０１１は、アクセス権条件を検査して、引用元と引用先との順序関係を推定する機能を備える。アクセス権判断手段２０１１は、コンテンツに設定されたアクセス権のレベルと、著者に設定されたアクセス権のレベルとに基づいて、引用元となりうるコンテンツと引用先となりうるコンテンツとの順序関係を推定する。この場合、アクセス権判断手段２０１１は、著者に設定されたアクセス権のレベルがコンテンツに設定されたアクセス権のレベル以上であると判断すると、そのコンテンツを引用元となりうるコンテンツと推定する。 Among the means included in the order relation estimation means 201, the access right determination means 2011 has a function of examining the access right condition and estimating the order relation between the citation source and the citation destination. The access right judging means 2011 estimates the order relationship between the content that can be cited and the content that can be cited based on the level of access right set for the content and the level of access right set for the author. . In this case, if the access right determination unit 2011 determines that the access right level set for the author is equal to or higher than the access right level set for the content, the access right determination unit 2011 estimates the content as a content that can be cited.

また、時間順序判断手段２０１２は、時間条件を検査して、引用元と引用先との順序関係を推定する機能を備える。時間順序判断手段２０１２は、コンテンツの作成、更新又は参照時間（本実施形態では作成時間）に基づいて、引用元となりうるコンテンツと引用先となりうるコンテンツとの順序関係を推定する。この場合、時間順序判断手段２０１２は、作成、更新又は参照時間が古いコンテンツを引用元となりうるコンテンツと推定し、作成、更新又は参照時間が新しいコンテンツを引用先となりうるコンテンツと推定する。 Further, the time order determination unit 2012 has a function of inspecting the time condition and estimating the order relation between the citation source and the citation destination. The time order determination unit 2012 estimates the order relationship between the content that can be the citation source and the content that can be the citation destination based on the content creation, update, or reference time (creation time in the present embodiment). In this case, the time order determination unit 2012 estimates content that has a long creation, update, or reference time as content that can be cited, and estimates content that has a new creation, update, or reference time as content that can be cited.

引用度計算手段２０２は、具体的には、プログラムに従って動作する情報処理装置のＣＰＵによって実現される。引用度計算手段２０２は、電子文書中のパッセージの引用の可能性の度合いを示す引用度を計算する機能を備える。引用度計算手段２０２は、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとに基づいて、コンテンツ間で引用が行われた可能性の度合いを示す引用度を算出する。図１に示すように、引用度計算手段２０２は、類似度計算手段２０２１と、時間距離計算手段２０２２と、著者距離計算手段２０２３と、統合計算手段２０２４とを含む。 Specifically, the citation level calculation means 202 is realized by a CPU of an information processing apparatus that operates according to a program. The citation level calculation means 202 has a function of calculating a citation level indicating the degree of possibility of citation of passages in an electronic document. The citation level calculation means 202 indicates the possibility that citations have been made between contents based on the difference in creation, update or reference time between contents and the degree of relationship between authors who created, updated or referred to the contents. The degree of citation indicating the degree is calculated. As shown in FIG. 1, the citation degree calculation means 202 includes a similarity degree calculation means 2021, a time distance calculation means 2022, an author distance calculation means 2023, and an integrated calculation means 2024.

類似度計算手段２０２１は、コンテンツ間の類似度を算出する機能を備える。本実施形態では、類似度計算手段２０２１は、異なる文書に含まれるパッセージ間の類似度を計算する。例えば、類似度計算手段２０２１は、ベクトル空間モデルに基づいて計算されるキーワードベクトルの余弦を用いて、文書中のパッセージの類似度を求めることができる。なお、ベクトル空間モデルに基づいて計算されるキーワードベクトルの余弦を用いて、文書中のパッセージの類似度を求める方法は、例えば、文献Ａ「徳永健伸、”情報検索と言語処理”、東京大学出版会、ｐｐ．３１，４１−４３」に記載されている。 The similarity calculation unit 2021 has a function of calculating the similarity between contents. In this embodiment, the similarity calculation unit 2021 calculates the similarity between passages included in different documents. For example, the similarity calculation unit 2021 can obtain the similarity of passages in a document using a cosine of a keyword vector calculated based on a vector space model. For example, Document A “Takenobu Tokunaga,“ Information Retrieval and Language Processing ”, University of Tokyo Press, uses the cosine of a keyword vector calculated based on a vector space model to determine the similarity of passages in a document. Society, pp. 31, 41-43 ".

なお、類似度計算手段２０２１は、文書以外のコンテンツの類似度を求める場合には、そのコンテンツの種類に応じて用意された類似度計算方式を用いて、コンテンツ間の類似度を計算する。 Note that the similarity calculation unit 2021 calculates the similarity between contents using a similarity calculation method prepared according to the type of content when obtaining the similarity of content other than a document.

時間距離計算手段２０２２は、コンテンツ間の作成、更新又は参照時間の差を示す時間距離を算出する機能を備える。本実施形態では、時間距離計算手段２０２２は、２つの文書の作成時刻の差の絶対値を計算する。なお、時間距離計算手段２０２２は、例えば、２つの文書の更新時刻や参照時刻の差の絶対値を計算してもよい。 The time distance calculation unit 2022 has a function of calculating a time distance indicating a difference in creation, update, or reference time between contents. In this embodiment, the time distance calculation unit 2022 calculates the absolute value of the difference between the creation times of two documents. Note that the time distance calculation unit 2022 may calculate the absolute value of the difference between the update time and the reference time of two documents, for example.

また、時間距離計算手段２０２２は、求めた作成時刻の差の絶対値を、単位時間距離に換算する機能を備える。なお、単位時間距離に換算するためのパラメータは、文書のタイプ毎に決められ、予め文書データベース１０１に格納されている。そして、時間距離計算手段２０２２は、単位距離に換算するためのパラメータを文書データベース１０１から取得（抽出）し、抽出したパラメータを用いて正規化することによって、作成時刻の差の絶対値を単位時間距離に換算する。すなわち、時間距離計算手段２０２２は、コンテンツタイプに応じたコンテンツの作成、更新又は参照時間（本実施形態では作成時間）の差を正規化するための正規化パラメータを用いて、正規化した時間距離を算出する。 In addition, the time distance calculation unit 2022 has a function of converting the absolute value of the obtained difference in creation time into a unit time distance. Note that parameters for conversion to unit time distance are determined for each document type and stored in the document database 101 in advance. Then, the time distance calculation unit 2022 acquires (extracts) the parameter for conversion to the unit distance from the document database 101 and normalizes it using the extracted parameter, thereby obtaining the absolute value of the difference in creation time as the unit time. Convert to distance. That is, the time distance calculation means 2022 uses the normalization parameter for normalizing the difference in creation, update or reference time (creation time in the present embodiment) according to the content type, and normalizes the time distance. Is calculated.

著者距離計算手段２０２３は、コンテンツを作成、更新又は参照した著者間の関係の度合いを示す著者距離を算出する機能を備える。本実施形態では、著者距離計算手段２０２３は、組織構成グラフにおける文書の著者に対応するノード間の最短パスの長さを、著者距離として計算する。 The author distance calculation means 2023 has a function of calculating an author distance indicating the degree of relationship between authors who created, updated, or referred to content. In the present embodiment, the author distance calculation unit 2023 calculates the length of the shortest path between nodes corresponding to the document author in the organization structure graph as the author distance.

総合計算手段２０２４は、時間距離計算手段２０２２が算出した時間距離と、著者距離計算手段２０２３が算出した著者距離と、類似度計算手段２０２１が算出した類似度とを統合した引用度を算出する機能を備える。本実施形態では、統合計算手段２０２４は、類似度計算手段２０２１が求めたパッセージ間の類似度と、著者距離計算手段２０２３が求めた著者距離と、時間距離計算手段２０２２が求めた時間距離とを用いて、文書中のパッセージの引用度を計算する。つまり、引用度は、時空間制約条件に基づく、類似度と、時間距離と、著者距離との関数である。 The total calculation unit 2024 has a function of calculating a quoting degree obtained by integrating the time distance calculated by the time distance calculation unit 2022, the author distance calculated by the author distance calculation unit 2023, and the similarity calculated by the similarity calculation unit 2021. Is provided. In the present embodiment, the integrated calculation unit 2024 calculates the similarity between passages obtained by the similarity calculation unit 2021, the author distance obtained by the author distance calculation unit 2023, and the time distance obtained by the time distance calculation unit 2022. Used to calculate the citation level of passages in the document. That is, the degree of citation is a function of similarity, time distance, and author distance based on space-time constraints.

なお、本実施形態において、暗黙引用関係発見システムを実現する情報処理装置の記憶装置（図示せず）は、コンテンツ（例えば、電子文書）間の暗黙的引用関係を発見するための各種プログラムを記憶している。例えば、暗黙引用関係発見システムを実現する情報処理装置の記憶装置は、コンピュータに、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとに基づいて、コンテンツ間で引用が行われた可能性の度合いを示す引用度を算出する引用度算出処理と、算出した引用度に基づいて、コンテンツ間の引用関係を抽出する引用関係抽出処理とを実行させるための暗黙引用関係発見用プログラム（引用関係抽出用プログラム）を記憶している。 In this embodiment, the storage device (not shown) of the information processing apparatus that implements the implicit citation relationship discovery system stores various programs for discovering the implicit citation relationship between contents (for example, electronic documents). is doing. For example, the storage device of the information processing apparatus that implements the implicit citation relationship discovery system allows a computer to determine the difference in creation, update, or reference time between contents and the degree of relationship between authors who created, updated, or referenced content. A citation degree calculation process for calculating a citation degree indicating the degree of possibility that citation was performed between contents, and a citation relation extraction process for extracting a citation relation between contents based on the calculated citation degree. An implicit citation relationship discovery program (citation relationship extraction program) for execution is stored.

次に、動作について説明する。図３は、暗黙引用関係発見システムがコンテンツ（電子文書中のパッセージ）間に含まれる暗黙引用関係を抽出する処理の一例を示す流れ図である。暗黙引用関係発見システムは、所定のタイミングで、図３に示す暗黙引用関係の抽出処理を開始する。 Next, the operation will be described. FIG. 3 is a flowchart illustrating an example of a process in which the implicit citation relationship discovery system extracts an implicit citation relationship included between contents (passages in an electronic document). The implicit citation relationship discovery system starts the implicit citation relationship extraction process shown in FIG. 3 at a predetermined timing.

例えば、暗黙引用関係発見システムは、システム管理者の指示操作をトリガとして、暗黙引用関係の抽出処理を開始する。また、例えば、暗黙引用関係発見システムは、夜間バッチ等を用いて所定時間毎に、暗黙引用関係の抽出処理を実行してもよい。また、例えば、暗黙引用関係発見システムは、文書データベース１０１に新規の電子文書が登録されたことに基づいて、暗黙引用関係の抽出処理を開始してもよい。さらに、例えば、暗黙引用関係発見システムは、文書データベース１０１に所定量の電子文書が登録されたことに基づいて、暗黙引用関係の抽出処理を開始してもよい。 For example, the implicit citation relationship discovery system starts the extraction process of the implicit citation relationship with the instruction operation of the system administrator as a trigger. Further, for example, the implicit citation relationship discovery system may execute an implicit citation relationship extraction process at predetermined time intervals using a nighttime batch or the like. For example, the implicit citation relationship discovery system may start the extraction process of the implicit citation relationship based on the registration of a new electronic document in the document database 101. Further, for example, the implicit citation relationship discovery system may start the implicit citation relationship extraction process based on a predetermined amount of electronic documents registered in the document database 101.

まず、コントローラ１００は、文書データベース１０１から処理対象文書の集合Ｄを取得（抽出）する（ステップＳ１０１）。 First, the controller 100 acquires (extracts) a set D of processing target documents from the document database 101 (step S101).

コントローラ１００は、抽出した文書集合Ｄに含まれる文書ｄｉ（０＜ｉ＜Ｄ．ｃｏｕｎｔ）を対象に、以下に示すステップＳ１０３〜Ｓ１１２の処理を繰り返し実行する（ステップＳ１０２）。なお、ｉは文書の順番を示し、Ｄ．ｃｏｕｎｔは文書の総数を示している。 The controller 100 repeatedly executes the following steps S103 to S112 for the document di (0 <i <D.count) included in the extracted document set D (step S102). Note that i indicates the document order. “count” indicates the total number of documents.

文書ｄｉに対するループ処理において、コントローラ１００は、文書集合Ｄに含まれる文書ｄｊ（ｉ＋１≦ｊ≦Ｄ．ｃｏｕｎｔ）を対象に、以下に示すステップＳ１０４〜Ｓ１１１の処理を繰り返し実行する（ステップＳ１０３）。なお、ｊは文書の順番を示している。 In the loop process for the document di, the controller 100 repeatedly executes the following processes in steps S104 to S111 for the document dj (i + 1 ≦ j ≦ D.count) included in the document set D (step S103). Note that j indicates the document order.

まず、コントローラ１００は、順序関係推定手段２０１に、処理対象となる文書ｄｉと文書ｄｊとを渡す（出力する）。順序関係推定手段２０１は、アクセス権判断手段２０１１と時間判断手段２０１２とを用いて、文書ｄｉと文書ｄｊとの引用の順序関係を推定する（ステップＳ１０４）。そして、順序関係の推定結果をコントローラ１００に返す（出力する）。 First, the controller 100 passes (outputs) the document di and the document dj to be processed to the order relation estimation unit 201. The order relation estimation means 201 estimates the citation order relation between the document di and the document dj using the access right judgment means 2011 and the time judgment means 2012 (step S104). Then, the estimation result of the order relation is returned (output) to the controller 100.

コントローラ１００は、順序関係推定手段２０１から順序関係の推定結果を受け取る（入力する）。そして、コントローラ１００は、入力した推定結果が文書ｄｉと文書ｄｊとに引用の順序関係があることを示しているか否かを判断する（ステップＳ１０５）。文書ｄｉと文書ｄｊとに引用の順序関係がないという推定結果であれば、コントローラ１００は、そのままステップＳ１１２にジャンプ（移行）する。そして、ステップＳ１０３〜Ｓ１１２のループ処理を繰り返す。 The controller 100 receives (inputs) an order relation estimation result from the order relation estimation unit 201. Then, the controller 100 determines whether or not the input estimation result indicates that there is a citation order relationship between the document di and the document dj (step S105). If the estimation result indicates that there is no citation order relationship between the document di and the document dj, the controller 100 jumps (transfers) to step S112 as it is. Then, the loop processing of steps S103 to S112 is repeated.

文書ｄｉと文書ｄｊとに引用の順序関係があるという推定結果であれば、コントローラ１００は、文書ｄｉ及び文書ｄｊとともに、文書ｄｉと文書ｄｊとの引用の順序関係を引用度計算手段２０２に渡す（出力する）。 If the estimation result indicates that the document di and the document dj have a citation order relationship, the controller 100 passes the document di and the document dj together with the document di and the document dj to the citation degree calculation unit 202. (Output).

次いで、引用度計算手段２０２は、時間距離計算手段２０２２を用いて、文書ｄｉと文書ｄｊとの時間距離を計算する（ステップＳ１０６）。また、同時に、引用度計算手段２０２は、著者距離計算手段２０２３を用いて、文書ｄｉの著者と文書ｄｊの著者との著者距離を計算する（ステップＳ１０７）。さらに、同時に、引用度計算手段２０２は、類似度計算手段２０２１を用いて、文書ｄｉ及び文書ｄｊに含まれるパッセージの類似度を計算する（ステップＳ１０８）。 Next, the citation degree calculating unit 202 calculates the time distance between the document di and the document dj using the time distance calculating unit 2022 (step S106). At the same time, the citation level calculation means 202 uses the author distance calculation means 2023 to calculate the author distance between the author of the document di and the author of the document dj (step S107). Further, at the same time, the citation level calculation unit 202 calculates the similarity level of the passages included in the document di and the document dj using the similarity level calculation unit 2021 (step S108).

なお、ステップＳ１０６〜Ｓ１０８の処理を実行する順番は問わない。例えば、引用度計算手段２０２は、ステップＳ１０６の時間距離の算出処理を実行した後にステップＳ１０７，Ｓ１０８の処理を実行してもよいし、ステップＳ１０７の著者距離の算出処理を実行した後にステップＳ１０６，Ｓ１０８の処理を実行してもよい。また、引用度計算手段２０２は、ステップＳ１０８の類似度の算出処理を実行した後にステップＳ１０６，Ｓ１０７の処理を実行してもよく、タイムシェアリングによりステップＳ１０６，Ｓ１０７，Ｓ１０８の処理を並行して実行してもよい。 In addition, the order which performs the process of step S106-S108 is not ask | required. For example, the citation level calculation means 202 may execute the processing of steps S107 and S108 after executing the time distance calculation processing of step S106, or may execute step S106, after executing the author distance calculation processing of step S107. You may perform the process of S108. Further, the citation level calculation means 202 may execute the processing of steps S106 and S107 after executing the similarity calculation processing of step S108, and the processing of steps S106, S107, and S108 in parallel by time sharing. May be executed.

次いで、引用度計算手段２０２は、ステップＳ１０６で計算した時間距離と、ステップＳ１０７で計算した著者距離と、ステップＳ１０８で計算したパッセージ類似度とを用いて、引用度ｃを求める。この場合、引用度計算手段２０２は、統合計算手段２０２４を利用して、文書ｄｉ及び文書ｄｊに含まれる２つのパッセージの組み合わせの引用度ｃを計算する（ステップＳ１０９）。そして、引用度計算手段２０２は、引用度ｃの計算結果を、コントローラ１００に渡す（出力する）。 Next, the citation degree calculating means 202 obtains the citation degree c using the time distance calculated in step S106, the author distance calculated in step S107, and the passage similarity calculated in step S108. In this case, the citation level calculation unit 202 uses the integrated calculation unit 2024 to calculate the citation level c of the combination of two passages included in the document di and the document dj (step S109). Then, the citation level calculation means 202 passes (outputs) the calculation result of the citation level c to the controller 100.

次いで、コントローラ１００は、ステップＳ１０９で計算した引用度ｃの値を引用度計算手段２０２から受け取る（入力する）。そして、コントローラ１００は、入力した引用度ｃと予め定義されている閾値との比較を行い、引用度ｃの値が所定の閾値より大きいか否かを判断する（ステップＳ１１０）。 Next, the controller 100 receives (inputs) the value of the citation degree c calculated in step S109 from the citation degree calculation means 202. Then, the controller 100 compares the input citation level c with a predefined threshold value, and determines whether or not the value of the citation level c is greater than a predetermined threshold value (step S110).

コントローラ１００は、引用度ｃが所定の閾値より大きいと判断した場合には、文書ｄｉと文書ｄｊとに含まれるパッセージの組み合わせに引用関係があると判断する。そして、コントローラ１００は、引用関係があると判断した判定結果を、仮想引用データベース１０４に登録する（ステップＳ１１１）。 When the controller 100 determines that the citation degree c is greater than a predetermined threshold, the controller 100 determines that the combination of passages included in the document di and the document dj has a citation relationship. Then, the controller 100 registers the determination result determined to have a citation relationship in the virtual citation database 104 (step S111).

次いで、コントローラ１００は、文書の順番を示す係数ｊに１加算（ｊ＝ｊ＋１）して、ステップＳ１０３に移行する。すなわち、次の文書ｄｊについて、ステップＳ１０４〜Ｓ１１１の処理を行う。 Next, the controller 100 adds 1 to the coefficient j indicating the document order (j = j + 1), and proceeds to step S103. That is, the processes of steps S104 to S111 are performed for the next document dj.

ステップＳ１１３では、コントローラ１００は、文書の順番を示す係数ｉに１加算（ｉ＝ｉ＋１）して、ステップＳ１０２に移行する。すなわち、次の文書ｄｉについて、ステップＳ１０３〜Ｓ１１２の処理を行う。 In step S113, the controller 100 adds 1 to the coefficient i indicating the document order (i = i + 1), and proceeds to step S102. That is, the processing of steps S103 to S112 is performed for the next document di.

以上のように、文書データベース１０１が蓄積する全ての電子文書についてステップＳ１０３〜Ｓ１１２の処理が繰り返し実行されることによって、全ての電子文書に含まれるパッセージの組み合わせについて引用関係が抽出される。そして、引用関係があると判断された全てのパッセージの判定結果が仮想引用データベース１０４に格納される。 As described above, the citation relationship is extracted for the combination of passages included in all electronic documents by repeatedly executing the processes of steps S103 to S112 for all the electronic documents stored in the document database 101. Then, the determination results of all passages determined to have a citation relationship are stored in the virtual citation database 104.

なお、仮想引用データベース１０４に格納された判定結果は、企業等の組織内の各部門からの要求に応じて提供され利用することができる。例えば、暗黙引用関係発見システムは、組織内のマネジメント部門や人事部門の端末からの要求に応じて、仮想引用データベース１０４から引用関係の判定結果を抽出して送信する。そして、マネジメント部門や人事部門の端末において受信した引用関係の判定結果を表示することによって、組織の業績評価や人事評価に利用することができる。 The determination result stored in the virtual citation database 104 can be provided and used in response to a request from each department in an organization such as a company. For example, the implicit citation relationship discovery system extracts and transmits a citation relationship determination result from the virtual citation database 104 in response to a request from a management department or personnel department terminal in the organization. Then, by displaying the judgment result of the citation relationship received at the terminal of the management department or the personnel department, it can be used for performance evaluation or personnel evaluation of the organization.

以下、図３に示した暗黙引用関係の抽出処理に含まれるそれぞれのステップについて説明する。 Hereinafter, each step included in the extraction process of the implicit citation relationship illustrated in FIG. 3 will be described.

（１）処理対象文書集合の取得処理（ステップＳ１０１）：ステップＳ１０１では、コントローラ１００は、文書データベース１０１にアクセスして、処理対象となる文書のＩＤ集合を文書データベース１０１から取得（抽出）する。一般的には、コントローラ１００は、文書データベース１０１に蓄積されている全ての文書を対象として、文書ＩＤを抽出する。なお、コントローラ１００は、条件を指定して、文書データベース１０１に蓄積されている一部の文書を対象として、文書ＩＤを抽出することも可能である。 (1) Processing for Acquiring Document Set for Processing (Step S101): In step S101, the controller 100 accesses the document database 101 and acquires (extracts) an ID set of documents to be processed from the document database 101. In general, the controller 100 extracts document IDs for all documents stored in the document database 101. The controller 100 can also specify a condition and extract a document ID for a part of documents stored in the document database 101.

なお、文書ＩＤと文書とは一対一に対応しているため、以下では、特別の説明がない限り、この文書ＩＤの集合を文書集合Ｄという。 Since document IDs and documents correspond one-to-one, hereinafter, a set of document IDs is referred to as a document set D unless otherwise specified.

（２）順序関係の推定処理（ステップＳ１０４）：引用には、順序関係がある。つまり、引用元となりうる文書と引用先となりうる文書とは、予め決まっている。本実施形態では、順序関係推定手段２０１は、このような引用元となりうる文書と引用先となりうる文書との方向性の制約である引用方向制約を導入して、引用の順序関係を決める。引用方向制約は、時間条件と、アクセス権条件とを含む。時間条件とは、引用元の文書は、引用先の文書が作成されるより以前に作成されている必要があるという条件である。また、アクセス権条件とは、引用先の文書の著者は引用元の文書にアクセスできる（アクセス権が与えられている）という条件である。 (2) Order relation estimation process (step S104): Citations have an order relation. That is, a document that can be a citation source and a document that can be a citation destination are determined in advance. In this embodiment, the order relation estimation unit 201 introduces a citation direction restriction that is a restriction on the directionality of a document that can be a citation source and a document that can be a citation destination, and determines the citation order relation. The citation direction constraint includes a time condition and an access right condition. The time condition is a condition that the citation source document needs to be created before the citation destination document is created. The access right condition is a condition that the author of the cited document can access the cited document (access right is given).

順序関係推定手段２０１は、文書データベース１０１にアクセスして、文書の作成時刻を文書データベース１０１から取得（抽出）して比較することによって、時間条件をチェックできる。なお、時間条件のチェック処理は、順序関係推定手段２０１の時間順序判断手段２０１２によって実行される。 The order relation estimation unit 201 can check the time condition by accessing the document database 101, obtaining (extracting) the creation time of the document from the document database 101, and comparing it. The time condition check process is executed by the time order determination unit 2012 of the order relation estimation unit 201.

時間順序判断手段２０１２は、文書の作成時刻を、順序関係推定手段２０１を通して文書データベース１０１から取得（抽出）する。そして、時間順序判断手段２０１２は、抽出した各文書の作成時刻を比較して、引用先となりうる文書と、引用元となりうる文書とを判断する。なお、時間順序判断手段２０１２は、文書データベース１０１にアクセスして、直接文書の作成時刻を取得（抽出）するようにしてもよい。 The time order determination unit 2012 acquires (extracts) the document creation time from the document database 101 through the order relation estimation unit 201. Then, the time order determination unit 2012 compares the creation times of the extracted documents, and determines a document that can be a citation destination and a document that can be a citation source. The time order determination unit 2012 may access the document database 101 to directly acquire (extract) the document creation time.

また、順序関係推定手段２０１は、著者Ａが文書ｂにアクセスできるか否かを、アクセス権判断手段２０１１を用いて判断する。アクセス権判断手段２０１１は、順序関係推定手段２０１を通して、文書データベース１０１から文書ｂに必要なアクセスレベルを抽出する。また、アクセス権判断手段２０１１は、順序関係推定手段２０１を通して、アクセスデータベース１０３から著者Ａのアクセスレベルを抽出する。そして、アクセス権判断手段２０１１は、抽出した文書ｂのアクセスレベルと著者Ａのアクセスレベルとを比較して、アクセス権条件を満たすか否かを判断する。 Further, the order relation estimation unit 201 uses the access right determination unit 2011 to determine whether or not the author A can access the document b. The access right determination unit 2011 extracts an access level necessary for the document b from the document database 101 through the order relation estimation unit 201. Further, the access right determination unit 2011 extracts the access level of the author A from the access database 103 through the order relation estimation unit 201. Then, the access right determination unit 2011 compares the access level of the extracted document b with the access level of the author A, and determines whether or not the access right condition is satisfied.

この場合、アクセス権判断手段２０１１は、著者Ａのアクセスレベルが文書ｂのアクセスレベル以上であれば、著者Ａが文書ｂにアクセスできる（アクセス権条件を満たす）と判断する。すなわち、アクセス権判断手段２０１１は、文書ｂが引用元となりえると判断する。また、アクセス権判断手段２０１１は、著者Ａのアクセスレベルが文書ｂのアクセスレベル以上でなければ、著者Ａが文書ｂにアクセスできない（アクセス権条件を満たさない）と判断する。すなわち、アクセス権判断手段２０１１は、文書ｂが引用元となりえないと判断する。 In this case, the access right determination unit 2011 determines that the author A can access the document b (the access right condition is satisfied) if the access level of the author A is equal to or higher than the access level of the document b. That is, the access right determination unit 2011 determines that the document b can be a citation source. Further, the access right determination unit 2011 determines that the author A cannot access the document b (the access right condition is not satisfied) unless the access level of the author A is equal to or higher than the access level of the document b. That is, the access right determination unit 2011 determines that the document b cannot be a citation source.

なお、アクセス権判断手段２０１１は、文書データベース１０１とアクセスデータベース１０３とに、直接アクセスするようにしてもよい。 Note that the access right determination unit 2011 may directly access the document database 101 and the access database 103.

図４は、ステップＳ１０４の順序関係の推定処理の一例を示す流れ図である。ステップＳ１０４において、順序関係推定手段２０１は、ステップＳ１０１で取得した文書集合に含まれる文書の組み合わせ（ｄｉ，ｄｊ）に対して、文書データベース１０１から文書（ｄｉ又はｄｊ）の作成時刻を抽出する。また、順序関係推定手段２０１は、文書ｄｉ及び文書ｄｊの著者のＩＤを用いて、アクセスレベル情報をアクセスデータベース１０３から取得（抽出）する。 FIG. 4 is a flowchart illustrating an example of the order relation estimation process in step S104. In step S104, the order relation estimation unit 201 extracts the creation time of the document (di or dj) from the document database 101 for the document combination (di, dj) included in the document set acquired in step S101. Further, the order relation estimation unit 201 acquires (extracts) access level information from the access database 103 using the IDs of the authors of the document di and the document dj.

すなわち、順序関係推定手段２０１は、文書（ｄｉ又はｄｊ）の作成時刻の直前のアクセスレベル情報を用いて引用順序を決定する。そして、順序関係推定手段２０１は、引用方向制約条件を用いて、以下に示す手順に従って、文書ｄｉと文書ｄｊとの引用順序を決める。 That is, the order relation estimation unit 201 determines the citation order using the access level information immediately before the creation time of the document (di or dj). Then, the order relation estimation unit 201 determines the citation order of the document di and the document dj according to the following procedure using the citation direction constraint.

順序関係推定手段２０１は、文書ｄｉの著者が文書ｄｊにアクセスできるか否かを判断するとともに、文書ｄｊの著者が文書ｄｉにアクセスできるか否かを判断する。文書ｄｊの著者が文書ｄｉにアクセスできるが、文書ｄｉの著者が文書ｄｊにアクセスできないと判断した場合には（ステップＳ４０１）、順序関係推定手段２０１は、文書ｄｊが引用先の文書であり、文書ｄｉが引用元の文書であると判断する（ステップＳ４０２）。 The order relation estimation unit 201 determines whether or not the author of the document di can access the document dj, and determines whether or not the author of the document dj can access the document di. When it is determined that the author of the document dj can access the document di but the author of the document di cannot access the document dj (step S401), the order relation estimation unit 201 determines that the document dj is the cited document, It is determined that the document di is a citation source document (step S402).

また、文書ｄｉの著者が文書ｄｊにアクセスできるが、文書ｄｊの著者が文書ｄｉにアクセスできないと判断した場合には（ステップＳ４０３）、順序関係推定手段２０１は、文書ｄｉが引用先の文書であり、文書ｄｊが引用元の文書であると判断する（ステップＳ４０４）。 If it is determined that the author of the document di can access the document dj, but the author of the document dj cannot access the document di (step S403), the order relation estimation unit 201 determines that the document di is the cited document. Yes, it is determined that the document dj is a citation source document (step S404).

また、文書ｄｉの著者が文書ｄｊにアクセスでき、かつ、文書ｄｊの著者が文書ｄｉにアクセスできると判断した場合には（ステップＳ４０５のＹ）、順序関係推定手段２０１は、文書ｄｉと文書ｄｊとの作成時刻に基づいて、文書ｄｉと文書ｄｊとの順序関係を推定する。 When it is determined that the author of the document di can access the document dj and the author of the document dj can access the document di (Y in step S405), the order relation estimation unit 201 determines that the document di and the document dj The order relation between the document di and the document dj is estimated on the basis of the creation time.

文書ｄｉが文書ｄｊより先に作成されたと判断した場合には（ステップＳ４０６）、順序関係推定手段２０１は、文書ｄｉが引用元の文書であり、文書ｄｊが引用先の文書であると判断する（ステップＳ４０７）。逆に、文書ｄｊが文書ｄｉより先に作成されたと判断した場合には（ステップＳ４０８のＹ）、順序関係推定手段２０１は、文書ｄｉが文書ｄｊを引用していると判断する（ステップＳ４０９）。また、文書ｄｉと文書ｄｊとの作成時刻が同じであると判断した場合には（ステップＳ４０８のＮ）、順序関係推定手段２０１は、この２つの文書ｄｉ，ｄｊには引用関係がないと判断する（ステップＳ４１０）。 When it is determined that the document di is created before the document dj (step S406), the order relation estimation unit 201 determines that the document di is the citation source document and the document dj is the citation destination document. (Step S407). On the other hand, when it is determined that the document dj is created before the document di (Y in step S408), the order relation estimation unit 201 determines that the document di cites the document dj (step S409). . If it is determined that the creation times of the document di and the document dj are the same (N in step S408), the order relationship estimation unit 201 determines that there is no citation relationship between the two documents di and dj. (Step S410).

また、文書ｄｉの著者が文書ｄｊにアクセスできず、かつ、文書ｄｊの著者が文書ｄｉにアクセスできないと判断した場合には（ステップＳ４０５のＮ）、順序関係推定手段２０１は、文書ｄｉと文書ｄｊとに引用関係がないと判断する（ステップＳ４１０）。 If it is determined that the author of the document di cannot access the document dj and the author of the document dj cannot access the document di (N in step S405), the order relation estimation unit 201 determines that the document di and the document di It is determined that there is no citation relationship with dj (step S410).

そして、ステップＳ１０５に移行し、順序関係推定手段２０１は、コントローラ１００に推定結果を返す（出力する）。 Then, the process proceeds to step S 105, and the order relation estimation unit 201 returns (outputs) an estimation result to the controller 100.

（３）時空間制約条件を用いた引用度計算処理（ステップＳ１０６、ステップＳ１０７、ステップＳ１０８、及びステップＳ１０９）と、引用関係登録処理（ステップＳ１１０，Ｓ１１１）：文書ペア（ｄｉ，ｄｊ）に対して、コントローラ１００は、引用度計算手段２０２を用いて、パッセージ単位に総当たりで引用度を計算する。 (3) Citation degree calculation processing (step S106, step S107, step S108, and step S109) using a spatiotemporal constraint condition and citation relationship registration processing (step S110, S111): for document pair (di, dj) Then, the controller 100 uses the citation level calculation means 202 to calculate the citation level in round robin for each passage.

引用は文書中のパッセージ単位で行われることが多いため、本実施形態では、暗黙引用関係発見システムは、パッセージ単位で引用度を計算して引用関係の有無を判定する。暗黙引用関係発見システムは、２つのパッセージの引用度が高ければ、この２つのパッセージには引用関係があると判断する。なお、この場合、この２つのパッセージを含む文書間にも引用関係があることになる。以下、特別の説明がない限り、文書ｄ１を引用元文書とし、文書ｄ２を引用先文書として説明を行う。 Since citation is often performed in units of passages in a document, in this embodiment, the implicit citation relationship discovery system determines the citation relationship by calculating the citation level in units of passages. If the quotation level of two passages is high, the implicit quotation relationship discovery system determines that the two passages have a quotation relationship. In this case, there is also a citation relationship between documents including these two passages. Hereinafter, the description will be made with the document d1 as a citation source document and the document d2 as a citation destination document unless otherwise specified.

引用度計算手段２０２は、文書の時間距離と、著者距離と、パッセージ間の類似度とを用いて、引用度を計算される。引用度計算手段２０２は、時間距離と著者距離とを文書単位で計算する。一方、引用度計算手段２０２は、類似度をパッセージ単位で計算する。 The citation degree calculation means 202 calculates the citation degree using the time distance of the document, the author distance, and the similarity between passages. The citation level calculation means 202 calculates the time distance and the author distance in document units. On the other hand, the citation level calculation means 202 calculates the similarity level in units of passages.

本実施形態では、時間距離は、２つの文書の作成時刻の差である。また、著者距離は、組織空間における文書の著者を繋げるパスの最短距離であり、著者の繋がりの強弱を示す尺度である。 In this embodiment, the time distance is the difference between the creation times of two documents. The author distance is the shortest distance of the path connecting the authors of the document in the organization space, and is a measure indicating the strength of the connection between the authors.

基本的に、類似度が高いほど、パッセージ間の引用可能性が高く、引用度が高くなる。また、著者距離が短いほど、パッセージ間の引用可能性が高く、引用度が高くなる。また、時間距離が小さい又は大きいパッセージ間の引用度は小さくなる。 Basically, the higher the similarity, the higher the possibility of citation between passages, and the higher the citation level. In addition, the shorter the author distance, the higher the possibility of citation between passages and the higher the citation level. In addition, the degree of citation between passages with small or large time distance is small.

引用度計算手段２０２は、ステップＳ１０６で計算した時間距離と、ステップＳ１０７で計算した著者距離と、ステップＳ１０８で計算したパッセージ類似度とを用いて、ステップＳ１０９において引用度を計算する。この場合、引用度計算手段２０２は、統合計算手段２０２４を用いて、文書ｄ１のパッセージｐ１と文書ｄ２のパッセージｐ２との引用度ｃｉｔを、次の式（１）に従って計算する。 The citation level calculation means 202 calculates the citation level in step S109 using the time distance calculated in step S106, the author distance calculated in step S107, and the passage similarity calculated in step S108. In this case, the citation level calculation unit 202 uses the integrated calculation unit 2024 to calculate the citation level cit between the passage p1 of the document d1 and the passage p2 of the document d2 according to the following equation (1).

ただし、式（１）において、ｓｉｍはパッセージの類似度である。また、式（１）において、ｔｉｍｅｄｉｓは時間距離であり、ａｕｔｈｄｉｓは著者距離である。 However, in Formula (1), sim is a passage similarity. In the formula (1), timedis is a time distance, and authdis is an author distance.

引用度計算手段２０２が求めた文書ｄ１のパッセージｐ１と文書ｄ２のパッセージｐ２との引用度ｃｉｔが予め定義された閾値より大きければ、コントローラ１００は、文書ｄ１のパッセージｐ１と文書ｄ２のパッセージｐ２との間に引用関係があると判断する。そして、コントローラ１００は、文書ｄ１のＩＤ、パッセージｐ１のＩＤ、文書ｄ２のＩＤ、パッセージｐ２のＩＤ、及び引用度ｃｉｔを対応付けた形で、仮想引用データベース１０４に引用関係の判定結果の登録を行う。なお、この引用関係の判定結果の登録処理は、ステップＳ１１１で行われる。 If the citation degree cit between the passage p1 of the document d1 and the passage p2 of the document d2 obtained by the citation degree calculation means 202 is larger than a predetermined threshold, the controller 100 determines whether the passage p1 of the document d1 and the passage p2 of the document d2 It is judged that there is a citation relationship between. Then, the controller 100 registers the citation relation determination result in the virtual citation database 104 in a form in which the ID of the document d1, the ID of the passage p1, the ID of the document d2, the ID of the passage p2, and the citation degree cit are associated with each other. Do. The registration process of the citation relation determination result is performed in step S111.

以下、パッセージの類似度、時間距離、及び著者距離の計算方法についてそれぞれ説明する。 Hereinafter, a method of calculating passage similarity, time distance, and author distance will be described.

（３−１）時間間隔条件を用いた時間距離計算（ステップＳ１０６）：一般に、同時に作成された文書には引用関係が存在する可能性が低い。つまり、引用元の文書と引用先の文書との作成時刻が近いほど、引用関係が存在する可能性が高い。一方、この文書間の作成時刻の差が大きくなると、文書が読まれる可能性が高くなるので、引用される可能性が大きくなる。 (3-1) Time distance calculation using time interval conditions (step S106): In general, there is a low possibility that a citation relationship exists in a document created at the same time. In other words, the closer the creation time between the citation source document and the citation destination document is, the higher the possibility that a citation relationship exists. On the other hand, when the difference in creation time between the documents increases, the possibility that the document is read increases, so that the possibility that the document is cited increases.

しかしながら、この文書館の作成時刻の差が極端に大きくなると、非常に優れた文書でない限り、かえって忘却されてしまい、引用される可能性が低くなる。つまり、図５に示しているように、文書間の作成時刻の差の増加に伴って、文書間の引用可能性は、一旦増大した後にある時点から減少していく傾向がある。 However, if the difference in the creation time of the document building becomes extremely large, unless it is a very good document, it will be forgotten, and the possibility of being cited will be reduced. That is, as shown in FIG. 5, as the difference in creation time between documents increases, the citation possibility between documents tends to decrease from a certain point after increasing once.

本実施形態では、時間距離計算手段２０２２は、時間距離として、単位時間距離に換算された（正規化された）文書の作成時刻の差を求める。ステップＳ１０６では、時間距離計算手段２０２２は、コントローラ１００を通して、文書データベース１０１から各文書の作成時刻、文書タイプ、及び文書タイプに対応した単位時間距離の換算パラメータを取得（抽出）する。そして、時間距離計算手段２０２２は、抽出したこれらの情報を用いて、次の式（２）に従って、文書ｄ１と文書ｄ２との時間距離を計算する。 In the present embodiment, the time distance calculation unit 2022 obtains a difference in document creation time converted to a unit time distance (normalized) as a time distance. In step S 106, the time distance calculation unit 2022 acquires (extracts) the creation time of each document, the document type, and the conversion parameter of the unit time distance corresponding to the document type from the document database 101 through the controller 100. Then, the time distance calculation unit 2022 calculates the time distance between the document d1 and the document d2 according to the following equation (2) using the extracted information.

ｔｉｍｅｄｉｓ（ｄ１，ｄ２）＝（｜ｔｉｍｅ（ｄ１）−ｔｉｍｅ（ｄ２）｜）／μ（ｄ１）・・・式（２） timedis (d1, d2) = (| time (d1) −time (d2) |) / μ (d1) (2)

ただし、式（２）において、ｔｉｍｅ（ｄ１）及びｔｉｍｅ（ｄ２）は、それぞれ、時間（ｈｏｕｒ）単位とした文書ｄ１及び文書ｄ２の作成時刻である。また、μは、引用元文書のタイプに対応した単位時間距離の換算パラメータである。単位時間距離の換算パラメータは、前述したように、文書のタイプ毎に予め設定されている。 However, in Expression (2), time (d1) and time (d2) are the creation times of the document d1 and the document d2 in units of time (hour), respectively. Further, μ is a conversion parameter for the unit time distance corresponding to the type of the citation source document. As described above, the unit time distance conversion parameter is set in advance for each document type.

例えば、文書が週報である場合には、換算パラメータμは、２４時間（１日）と設定できる。また、例えば、文書が社内報告ＲＮである場合には、換算パラメータは、７２０時間（１ヶ月）と設定できる。このように、単位時間距離に換算することによって、時間距離計算手段２０２２は、文書の有効時間の長短の影響を取り除いた形で時間距離を求めることができる。 For example, when the document is a weekly report, the conversion parameter μ can be set to 24 hours (1 day). For example, when the document is an in-house report RN, the conversion parameter can be set to 720 hours (one month). In this way, by converting the unit time distance, the time distance calculation unit 2022 can obtain the time distance in a form that eliminates the effect of the effective time of the document.

なお、時間距離計算手段２０２２は、文書データベース１０１にアクセスして、文書の作成時刻と換算パラメータμとを直接取得（抽出）するようにしてもよい。 The time distance calculation unit 2022 may access the document database 101 and directly acquire (extract) the document creation time and the conversion parameter μ.

（３−２）組織相関条件を用いた著者距離計算（ステップＳ１０７）：一般に、組織空間において、強い繋がりのある著者同士は、近い空間にいるため、密にコミュニケーションを行っている可能性が高い。そのため、相手の考え方や相手が作成した文書に対する理解が高く、相手が作成した文書を引用する可能性が高い。例えば、同じ部署内の同僚や上司、部下の関係にある著者同士は、相手が作成した文書を引用する可能性が高い。 (3-2) Author distance calculation using organization correlation condition (step S107): Generally, in an organization space, authors with strong connections are close to each other, so there is a high possibility that they are communicating closely. . Therefore, there is a high understanding of the partner's way of thinking and the document created by the partner, and there is a high possibility that the document created by the partner is cited. For example, co-workers, supervisors, and subordinate authors in the same department are likely to cite documents created by the other party.

本実施形態では、著者距離計算手段２０２３は、著者距離は、組織構成表記憶手段１０２から取得（抽出）する組織グラフを用いて、著者距離を計算する。例えば、図２（ａ）に示す組織グラフでは、ノードは社員に対応し、枝は社員間の組織関係を表している。著者距離計算手段２０２３は、図２（ａ）に示されるような組織グラフを用いて、以下の処理に従って著者距離を算出する。 In the present embodiment, the author distance calculation unit 2023 calculates the author distance using the organization graph acquired (extracted) from the organization configuration table storage unit 102. For example, in the organization graph shown in FIG. 2A, nodes correspond to employees and branches represent organizational relationships between employees. The author distance calculation means 2023 calculates the author distance according to the following process using the organization graph as shown in FIG.

著者距離計算手段２０２３は、引用度計算手段２０２を通して、引用先文書ｄ２の作成時刻に基づいて、組織構成表記憶手段１０２から組織グラフを取得（抽出）する。つまり、著者距離計算手段２０２３は、文書ｄ２の作成時刻の直前の組織グラフを取得して利用することによって、著者距離を求める。なお、著者距離計算手段２０２３は、引用度計算手段２０２を通さず、直接組織構成表記憶手段１０２から組織グラフに関する情報を取得（抽出）するようにしてもよい。 The author distance calculation means 2023 acquires (extracts) the organization graph from the organization configuration table storage means 102 through the citation degree calculation means 202 based on the creation time of the cited document d2. That is, the author distance calculation unit 2023 obtains the author distance by acquiring and using the organization graph immediately before the creation time of the document d2. The author distance calculation means 2023 may acquire (extract) information related to the organization graph directly from the organization structure table storage means 102 without passing through the citation degree calculation means 202.

ステップＳ１０７では、著者距離計算手段２０２３は、文書ｄ１と文書ｄ２との著者距離ａｕｔｈｄｉｓを、次の式（３）に従って求める。すなわち、著者距離計算手段２０２３は、組織グラフにおけるノードｄ１．ａｕｔｈｏｒとノードｄ２．ａｕｔｈｏｒとを結ぶ最短パスの長さとして計算する。なお、パスの長さは、パスの枝の数として数えられる。また、文書ｄｉと文書ｄｊとの著者が複数いる場合には、著者距離計算手段２０２３は、その全ての組み合わせについて、著者距離を計算することになる。 In step S107, the author distance calculation means 2023 obtains the author distance authdis between the document d1 and the document d2 according to the following equation (3). That is, the author distance calculation means 2023 has the nodes d1. author and node d2. It is calculated as the length of the shortest path connecting the author. Note that the length of the path is counted as the number of branches of the path. Further, when there are a plurality of authors of the document di and the document dj, the author distance calculation means 2023 calculates the author distance for all the combinations.

ａｕｔｈｄｉｓ（ｄ１，ｄ２）＝ｓｈｏｒｔｅｓｔｐａｔｈ（ｄ１．ａｕｔｈｏｒ，ｄ２．ａｕｔｈｏｒ）・・・式（３） authdis (d1, d2) = shortestpath (d1.author, d2.author) (3)

この求めた著者距離が短いほど、文書ｄ１と文書ｄ２との著者の繋がりが強く、同じ空間にいる可能性が高い。そのため、コントローラ１００は、相手の文書を引用する可能性が高いと判断できる。 As the calculated author distance is shorter, the author's connection with the document d1 and the document d2 is stronger and the possibility of being in the same space is higher. Therefore, the controller 100 can determine that there is a high possibility of quoting the partner document.

例えば、図２（ａ）に示す組織グラフの例では、「Ｓ統括」と「Ｈ部長」との距離は１であり、「Ｓ統括」と「Ｋ主任研究員」との距離は２である。従って、コントローラ１００は、「Ｓ統括」と「Ｈ部長」の繋がりがより強く、「Ｓ統括」が「Ｋ主任研究員」の文書より「Ｈ部長]の文書を引用する可能性が高いと判断する。 For example, in the example of the organization graph shown in FIG. 2A, the distance between the “S supervisor” and the “H manager” is 1, and the distance between the “S supervisor” and the “K senior researcher” is two. Therefore, the controller 100 determines that there is a stronger connection between the “S manager” and the “H manager”, and that the “S manager” is more likely to cite the document of the “H manager” than the “K chief researcher” document. .

なお、著者距離計算手段２０２３は、２人の著者を繋げる最短パスを、グラフの最短路問題として求めることができる。例えば、著者距離計算手段２０２３は、著者間を繋げる最短パスを、Ｄｉｊｋｓｔｒａのアルゴリズムを利用して求めることができる。なお、Ｄｉｊｋｓｔｒａのアルゴリズムを利用して最短パスを求める方法は、例えば、文献Ｂ「石畑清、”アルゴリズムとデータ構造”、岩波書店、ｐｐ．２６０−２７０」に記載されている。 The author distance calculation means 2023 can determine the shortest path connecting two authors as the shortest path problem of the graph. For example, the author distance calculation means 2023 can obtain the shortest path connecting the authors using the Dijkstra algorithm. A method for obtaining the shortest path using Dijkstra's algorithm is described in, for example, Document B “Kei Ishihata,“ Algorithm and Data Structure ”, Iwanami Shoten, pp. 260-270”.

（３−３）類似度計算（ステップＳ１０８）：類似度計算手段２０２１は、例えば、以下に示す式（４）を用いて、類似度を計算することができる。なお、式（４）は、ベクトル空間モデルに基づいてキーワードベクトルの余弦を計算する式である（文献Ａ参照）。 (3-3) Similarity calculation (step S108): The similarity calculation means 2021 can calculate the similarity using, for example, the following equation (4). Expression (4) is an expression for calculating the cosine of the keyword vector based on the vector space model (see Document A).

ただし、式（４）において、パッセージｐ１に対するキーワードベクトルは（ｘ１，ｘ２，．．．，ｘｎ）であり、パッセージｐ２に対するキーワードベクトルは（ｙ１，ｙ１２，．．．，ｙｍ）である。 However, in Equation (4), the keyword vector for the passage p1 is (x1, x2,..., Xn), and the keyword vector for the passage p2 is (y1, y12,..., Ym).

類似度計算手段２０２１は、引用度計算手段２０２を通して、文書データベース１０１から、パッセージｐ１とｐ２とのテキストをそれぞれ抽出する。なお、類似度計算手段２０２１は、引用度計算手段２０２を通さず、文書データベース１０１から直接パッセージｐ１とｐ２とのテキストを取得（抽出）してもよい。パッセージ間の類似度が高い場合には、パッセージ間に引用関係のある可能性が高い。なお、類似度計算手段２０２１は、文書以外のコンテンツである場合、相応する類似度の計算式を用意して類似度計算を行う。 The similarity calculation unit 2021 extracts the texts of the passages p1 and p2 from the document database 101 through the citation level calculation unit 202, respectively. The similarity calculation unit 2021 may acquire (extract) the texts of the passages p1 and p2 directly from the document database 101 without passing through the citation level calculation unit 202. If the similarity between the passages is high, there is a high possibility that there is a citation relationship between the passages. In the case of content other than a document, the similarity calculation unit 2021 prepares a corresponding similarity calculation formula and performs similarity calculation.

以上のように、本実施形態によれば、コンテンツ間の作成時間の差と、コンテンツを作成した著者間の関係の度合いとを考慮して、コンテンツ間の引用関係を抽出するので、引用関係の誤検出を除外することができる。従って、コンテンツ中の暗黙引用関係の抽出を可能とするとともに、暗黙引用関係抽出の精度向上を可能とすることができる。 As described above, according to the present embodiment, the citation relationship between contents is extracted in consideration of the difference in creation time between contents and the degree of relationship between authors who created the content. False detection can be excluded. Accordingly, it is possible to extract the implicit citation relationship in the content and improve the accuracy of the implicit citation relationship extraction.

すなわち、本実施形態によれば、引用方向制約を用いた順序関係の推定手段と、時空間制約条件に基づいて、類似度、時間距離及び著者距離を用いた引用度の計算手段とを備える。引用度に基づいて引用関係の抽出が行えるので、コンテンツ中に明示されていない暗黙引用関係であっても抽出することができる。また、時間距離及び著者距離を考慮した抽出を行えるので、類似度ベースの手法を用いただけでは、実際には引用関係がないにもかかわらず引用関係があるものと誤検出されることを除外できる。つまり、引用関係の検出の精度を向上させることができる。 In other words, according to the present embodiment, there is provided an order relation estimation means using citation direction constraints, and a citation degree calculation means using similarity, time distance and author distance based on spatiotemporal constraint conditions. Since the citation relationship can be extracted based on the citation level, it is possible to extract even an implicit citation relationship that is not explicitly specified in the content. In addition, since extraction can be performed in consideration of time distance and author distance, it is possible to exclude false detection of a citation relationship even though there is no citation relationship by using a similarity-based method. . That is, the accuracy of citation relationship detection can be improved.

本実施形態に示した手法によって構築された暗黙引用関係抽出システムの仮想引用データベースと文書データベースとを参照することによって、以下のような利用方法が可能となる。例えば、文書データベース内の文書ＩＤやファイルパスを参照して社内の文書を表示し、表示した文書に関連する文書を表示することができる。また、文書中のパッセージ間で自動的にハイパーリンクを生成し、相互に参照を行うことが可能となる。また、引用関係をＷｅｂのリンクと同様に見なせば、Ｗｅｂの検索と同様にリンク関係を用い、重要文書のランキングを行うことができる。 By referring to the virtual citation database and the document database of the implicit citation relationship extraction system constructed by the method shown in the present embodiment, the following usage method becomes possible. For example, an in-house document can be displayed with reference to a document ID or file path in the document database, and a document related to the displayed document can be displayed. In addition, it is possible to automatically generate hyperlinks between passages in a document and refer to each other. If the citation relationship is regarded in the same way as a Web link, the ranking of important documents can be performed using the link relationship in the same manner as the Web search.

さらに、引用関係のパッセージ単位でのノベルティ（新規度）やオーソリティ（非引用度）を求め、オリジナリティの高い文書を作成している著者を特定することによって業績評価を行うことができる。 Furthermore, it is possible to evaluate performance by determining novelty (novelty) and authority (non-quoting) in citation-related passage units, and identifying authors who have created highly original documents.

なお、ノベルティやオーソリティ、オリジナリティは、例えば以下のような式（５）を用いて求めることができる。 Note that novelty, authority, and originality can be obtained using, for example, the following equation (5).

Ｏ（ｐ）＝Ａ（ｐ）・Ｎ（ｐ）
Ｎ（ｐ）＝１／（ｒｅｆｉｎ（ｐ）＋１）
Ａ（ｐ）＝ｒｅｆｏｕｔ（ｐ）
・・・式（５） O (p) = A (p) · N (p)
N (p) = 1 / (refin (p) +1)
A (p) = refout (p)
... Formula (5)

ここで、式（５）において、Ｏ（ｐ）はパッセージｐのオリジナリティであり、Ｎ（ｐ）はパッセージｐのノベルティであり、Ａ（ｐ）はパッセージｐのオーソリティである。また、ｒｅｆｏｕｔ（ｐ）はパッセージｐを引用するパッセージの数であり、ｒｅｆｉｎ（ｐ）はパッセージｐが引用しているパッセージの数である。例えば、著者毎の文書内のパッセージのオリジナリティを求め、平均値を求めることで、著者の作成する文書の平均のオリジナリティを求めることができる。 Here, in Equation (5), O (p) is the originality of the passage p, N (p) is the novelty of the passage p, and A (p) is the authority of the passage p. Further, refout (p) is the number of passages that quote the passage p, and refin (p) is the number of passages that the passage p cites. For example, by obtaining the originality of the passage in the document for each author and obtaining the average value, the average originality of the document created by the author can be obtained.

実施形態２．
次に、本発明の第２の実施形態について図面を参照して説明する。図６は、第２の実施形態における暗黙引用関係発見システムの構成例を示すブロック図である。図６に示すように、本実施形態では、暗黙引用関係発見システムは、第１の実施形態で示した構成要素に加えて、文書登録監視手段３０１を含む点で、第１の実施形態と異なる。また、本実施形態では、順序関係推定手段２０１が、第１の実施形態で示した構成要素のうち、時間順序判断手段２０１２を含まない点で、第１の実施形態と異なる。 Embodiment 2. FIG.
Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 6 is a block diagram illustrating a configuration example of the implicit citation relationship discovery system in the second embodiment. As shown in FIG. 6, in this embodiment, the implicit citation relationship discovery system is different from the first embodiment in that it includes a document registration monitoring unit 301 in addition to the components shown in the first embodiment. . Further, the present embodiment is different from the first embodiment in that the order relation estimation unit 201 does not include the time order determination unit 2012 among the components shown in the first embodiment.

すなわち、本実施形態では、暗黙引用関係発見システムは、コントローラ１００と、文書データベース１０１と、組織構成表記憶手段１０２と、アクセスデータベース１０３と、仮想引用データベース１０４と、順序関係推定手段２０１と、引用度計算手段２０２と、文書監視手段３０１とを含む。以下、第１の実施形態と同様の構成要素については、図１に示したブロック図と同一の符号を付し、詳細な説明を省略する。 That is, in this embodiment, the implicit citation relationship discovery system includes the controller 100, the document database 101, the organization configuration table storage unit 102, the access database 103, the virtual citation database 104, the order relationship estimation unit 201, and the citation. Degree calculation means 202 and document monitoring means 301 are included. In the following, the same components as those in the first embodiment are denoted by the same reference numerals as those in the block diagram shown in FIG.

文書登録監視手段３０１は、具体的には、プログラムに従って動作する情報処理装置のＣＰＵによって実現される。文書監視手段３０１は、文書データベース１０１への新規文書の登録をモニタリングする機能を備える。 Specifically, the document registration monitoring unit 301 is realized by a CPU of an information processing apparatus that operates according to a program. The document monitoring unit 301 has a function of monitoring registration of a new document in the document database 101.

本実施形態では、コントローラ１００は、文書監視手段３０１が検出する新規文書に対して、文書データベース１０１に格納されている各文書について、検出した新規文書より以前に登録された文書（既登録文書）との引用関係を判定する。この場合、コントローラ１００は、順序関係推定手段２０１と引用度計算手段２０２とを用いて、新規文書と既登録文書との引用関係を判定する。そして、コントローラ１００は、引用関係の判定結果を仮想引用データベース１０４に登録する。 In the present embodiment, the controller 100, with respect to a new document detected by the document monitoring unit 301, for each document stored in the document database 101, a document registered before the detected new document (registered document). The citation relationship with is determined. In this case, the controller 100 determines the citation relationship between the new document and the registered document by using the order relationship estimation unit 201 and the citation degree calculation unit 202. Then, the controller 100 registers the citation relation determination result in the virtual citation database 104.

順序関係推定手段２０１は、アクセス権判断手段２０１１を含む。順序関係推定手段２０１は、引用方向制約に基づいて、引用度の高い文書ペアの引用元と引用先との順序関係を推定する。すなわち、順序関係推定手段２０１は、文書監視手段３０１が検出した新規文書と、コントローラ１００を通して文書データベース１０１から取得（抽出）した新規文書より以前に登録された既登録文書との引用の順序関係を推定する。この場合、順序関係推定手段２０１は、アクセス権判断手段２０１１を利用して、引用元文書へのアクセス権が必要である旨のアクセス権条件に基づいて推定する。つまり、順序関係推定手段２０１は、引用元の文書と引用先の文書とを推定して決める。 The order relation estimation unit 201 includes an access right determination unit 2011. The order relation estimation means 201 estimates the order relation between the citation source and the citation destination of a document pair with a high citation degree based on the citation direction constraint. That is, the order relation estimation means 201 determines the citation order relation between the new document detected by the document monitoring means 301 and the registered document registered before the new document acquired (extracted) from the document database 101 through the controller 100. presume. In this case, the order relation estimation unit 201 uses the access right determination unit 2011 to estimate based on the access right condition that the access right to the citation source document is necessary. That is, the order relation estimation unit 201 estimates and determines the citation source document and the citation destination document.

引用度計算手段２０２は、パッセージの引用の可能性を示す引用度を計算する。 The citation degree calculation means 202 calculates a citation degree indicating the possibility of passage citation.

コントローラ１００は、順序関係推定手段２０１と引用度計算手段２０２とを用いて、パッセージ間の引用関係を抽出して、仮想引用データベース１０４に格納させる。 The controller 100 uses the order relation estimation means 201 and the citation degree calculation means 202 to extract the citation relation between passages and store it in the virtual citation database 104.

次に、動作について説明する。図７は、第２の実施形態における暗黙引用関係発見システムがコンテンツ（電子文書中のパッセージ）間に含まれる暗黙引用関係を抽出する処理の一例を示す流れ図である。なお、本実施形態において、第１の実施形態と同様の処理を行うステップについては、詳細な説明を省略する。 Next, the operation will be described. FIG. 7 is a flowchart illustrating an example of a process for extracting an implicit citation relationship included between contents (passages in an electronic document) by the implicit citation relationship discovery system according to the second embodiment. In the present embodiment, detailed description of steps for performing the same processing as in the first embodiment is omitted.

本実施形態では、文書登録監視手段３０１は、文書データベース１０１の新規文書の登録を繰り返しモニタリングしている。例えば、文書登録監視手段３０１は、所定時間毎に、文書データベース１０１に新規文書が登録されたか否かを判断する（ステップＳ２００）。新規文書が登録されたと判断すると、文書登録監視手段３０１は、新規文書が登録された旨を、コントローラ１００に知らせる（例えば、通知情報を出力する）。新規文書の登録がなければ、文書登録監視手段３０１は、ステップＳ２００のモニタリングの処理を継続する。 In this embodiment, the document registration monitoring unit 301 repeatedly monitors registration of new documents in the document database 101. For example, the document registration monitoring unit 301 determines whether or not a new document is registered in the document database 101 every predetermined time (step S200). If it is determined that a new document has been registered, the document registration monitoring unit 301 notifies the controller 100 that the new document has been registered (for example, outputs notification information). If no new document is registered, the document registration monitoring unit 301 continues the monitoring process in step S200.

コントローラ１００は、文書登録監視手段３０１が検出した新規文書ｄに対して、引用関係抽出の処理を開始する（ステップＳ２０１）。コントローラ１００は、文書データベース１０１から、文書登録監視手段３０１が検出した新しい文書の登録の時刻より以前に作成された既登録文書の集合Ｄを取得（抽出）する（ステップＳ２０２）。そして、コントローラ１００は、ステップＳ２００で検出した個々の新規文書ｄに対して、以下の処理を行う。 The controller 100 starts a citation relation extraction process for the new document d detected by the document registration monitoring unit 301 (step S201). The controller 100 acquires (extracts) a set D of registered documents created before the registration time of the new document detected by the document registration monitoring unit 301 from the document database 101 (step S202). Then, the controller 100 performs the following processing for each new document d detected in step S200.

コントローラ１００は、抽出した文書集合Ｄに含まれる文書ｄｉ（０＜ｉ＜Ｄ．ｃｏｕｎｔ）を対象に、以下に示すステップＳ２０４〜Ｓ２１１の処理を繰り返し実行する（ステップＳ２０３）。なお、ｉは文書の順番を示し、Ｄ．ｃｏｕｎｔは文書の総数を示している。 The controller 100 repeatedly executes the following processes of steps S204 to S211 for the document di (0 <i <D.count) included in the extracted document set D (step S203). Note that i indicates the document order. “count” indicates the total number of documents.

まず、コントローラ１００は、順序関係推定手段２０１に、処理対象となる文書ｄと文書ｄｉとを渡す（出力する）。順序関係推定手段２０１は、アクセス権判断手段２０１１を用いて、文書ｄと文書ｄｉとの引用の順序関係を推定する（ステップＳ２０４）。そして、順序関係の推定結果をコントローラ１００に返す（出力する）。 First, the controller 100 passes (outputs) the document d and the document di to be processed to the order relation estimation unit 201. The order relation estimation unit 201 uses the access right judgment unit 2011 to estimate the order relation of the citation between the document d and the document di (step S204). Then, the estimation result of the order relation is returned (output) to the controller 100.

コントローラ１００は、順序関係推定手段２０１から順序関係の推定結果を受け取る（入力する）。そして、コントローラ１００は、入力した推定結果が文書ｄと文書ｄｉとに引用の順序関係があることを示しているか否かを判断する（ステップＳ２０５）。文書ｄと文書ｄｉとに引用の順序関係がないという推定結果であれば、コントローラ１００は、そのままステップＳ２１２にジャンプ（移行）する。そして、ステップＳ２０４〜Ｓ２１２のループ処理を繰り返す。 The controller 100 receives (inputs) an order relation estimation result from the order relation estimation unit 201. Then, the controller 100 determines whether or not the input estimation result indicates that the document d and the document di have a citation order relationship (step S205). If the estimation result indicates that there is no citation order relationship between the document d and the document di, the controller 100 jumps (transfers) to step S212 as it is. Then, the loop processing of steps S204 to S212 is repeated.

文書ｄと文書ｄｉとに引用の順序関係があるという推定結果であれば、コントローラ１００は、文書ｄ及び文書ｄｉとともに、文書ｄと文書ｄｉとの引用の順序関係を引用度計算手段２０２に渡す（出力する）。 If the estimation result indicates that there is a citation order relationship between the document d and the document di, the controller 100 passes the citation order relationship between the document d and the document di to the citation degree calculation unit 202 together with the document d and the document di. (Output).

次いで、引用度計算手段２０２は、時間距離計算手段２０２２を用いて、文書ｄと文書ｄｉとの時間距離を計算する（ステップＳ２０６）。また、同時に、引用度計算手段２０２は、著者距離計算手段２０２３を用いて、文書ｄの著者と文書ｄｉの著者との著者距離を計算する（ステップＳ２０７）。さらに、同時に、引用度計算手段２０２は、類似度計算手段２０２１を用いて、文書ｄ及び文書ｄｉに含まれるパッセージの類似度を計算する（ステップＳ２０８）。 Next, the citation degree calculation unit 202 calculates the time distance between the document d and the document di using the time distance calculation unit 2022 (step S206). At the same time, the citation level calculation unit 202 calculates the author distance between the author of the document d and the author of the document di using the author distance calculation unit 2023 (step S207). Further, at the same time, the citation level calculation unit 202 uses the similarity level calculation unit 2021 to calculate the similarity level of the passages included in the document d and the document di (step S208).

次いで、引用度計算手段２０２は、ステップＳ２０６で計算した時間距離と、ステップＳ２０７で計算した著者距離と、ステップＳ２０８で計算したパッセージ類似度とを用いて、引用度ｃを求める。この場合、引用度計算手段２０２は、統合計算手段２０２４を利用して、文書ｄ及び文書ｄｉに含まれる２つのパッセージの組み合わせの引用度ｃを計算する（ステップＳ２０９）。そして、引用度計算手段２０２は、引用度ｃの計算結果を、コントローラ１００に渡す（出力する）。 Next, the citation degree calculating means 202 obtains the citation degree c using the time distance calculated in step S206, the author distance calculated in step S207, and the passage similarity calculated in step S208. In this case, the citation level calculation unit 202 uses the integrated calculation unit 2024 to calculate the citation level c of the combination of the two passages included in the document d and the document di (step S209). Then, the citation level calculation means 202 passes (outputs) the calculation result of the citation level c to the controller 100.

次いで、コントローラ１００は、ステップＳ２０９で計算した引用度ｃの値を引用度計算手段２０２から受け取る（入力する）。そして、コントローラ１００は、入力した引用度ｃと予め定義されている閾値との比較を行い、引用度ｃの値が所定の閾値より大きいか否かを判断する（ステップＳ２１０）。 Next, the controller 100 receives (inputs) the value of the citation degree c calculated in step S209 from the citation degree calculation means 202. Then, the controller 100 compares the input citation level c with a predefined threshold value, and determines whether or not the value of the citation level c is greater than a predetermined threshold value (step S210).

コントローラ１００は、引用度ｃが所定の閾値より大きいと判断した場合には、文書ｄと文書ｄｉとに含まれるパッセージの組み合わせに引用関係があると判断する。そして、コントローラ１００は、引用関係があると判断した判定結果を、仮想引用データベース１０４に登録する（ステップＳ２１１）。 If the controller 100 determines that the citation degree c is greater than a predetermined threshold, the controller 100 determines that there is a citation relationship in the combination of passages included in the document d and the document di. Then, the controller 100 registers the determination result determined to have a citation relationship in the virtual citation database 104 (step S211).

次いで、コントローラ１００は、文書の順番を示す係数ｉに１加算（ｉ＝ｉ＋１）して、ステップＳ２０３に移行する。すなわち、次の文書ｄｉについて、ステップＳ２０４〜Ｓ２１１の処理を行う。 Next, the controller 100 adds 1 to the coefficient i indicating the document order (i = i + 1), and proceeds to step S203. That is, the processing of steps S204 to S211 is performed for the next document di.

その後、文書登録監視手段３０１が次の新規文書ｄを検出した場合には（ステップＳ２１３）、コントローラ１００は、次の新規文書ｄを対象にステップＳ２０１〜Ｓ２１２と同様の処理を行う。 Thereafter, when the document registration monitoring unit 301 detects the next new document d (step S213), the controller 100 performs the same processing as steps S201 to S212 on the next new document d.

以下、第１の実施形態とは異なる処理を行うステップＳ２００，Ｓ２０２，Ｓ２０４の処理について説明する。まず、ステップＳ２００において、文書登録監視手段３０１は、文書データベース１０１をモニタリングし、新規文書の登録があるか否かを監視する。新規文書の登録を検出したら、文書登録監視手段３０１は、新規登録された新規文書群をコントローラ１００に知らせる（例えば、通知情報を出力する）。そして、コントローラ１００は、新規文書群と、文書データベース１０１に以前に登録された既登録文書との引用関係の抽出を行う。つまり、本実施形態では、暗黙引用関係発見システムは、新規文書の登録をトリガとして、引用関係の発見の処理を行う。文書登録監視手段３０１による新規文書登録の検出は、このトリガの役割を果たす。 Hereinafter, the processes of steps S200, S202, and S204 that perform processes different from those of the first embodiment will be described. First, in step S200, the document registration monitoring unit 301 monitors the document database 101 to monitor whether a new document is registered. When the registration of the new document is detected, the document registration monitoring unit 301 notifies the controller 100 of the newly registered new document group (for example, outputs notification information). Then, the controller 100 extracts a citation relationship between the new document group and the already registered documents previously registered in the document database 101. In other words, in the present embodiment, the implicit citation relationship discovery system performs citation relationship discovery processing with the registration of a new document as a trigger. Detection of new document registration by the document registration monitoring unit 301 plays a role of this trigger.

ステップＳ２０２において、コントローラ１００は、文書データベース１０１にアクセスして、ステップＳ２００で検出した新規文書より以前に登録された既登録文書の集合を、文書データベース１０１から取得（抽出）する。本実施形態では、文書データベース１０１への登録時刻が文書の作成時刻を示している。そのため、コントローラ１００は、文書データベース１０１が格納する文書の作成時刻をチェックすることによって、新規文書より以前に登録された既登録文書集合を取得（抽出）することができる。 In step S202, the controller 100 accesses the document database 101, and acquires (extracts) from the document database 101 a set of registered documents registered before the new document detected in step S200. In this embodiment, the registration time in the document database 101 indicates the document creation time. Therefore, the controller 100 can acquire (extract) a registered document set registered before the new document by checking the creation time of the document stored in the document database 101.

図８は、第２の実施形態におけるステップＳ２０４の順序関係の推定処理の一例を示す流れ図である。本実施形態では、文書集合Ｄに含まれる文書ｄｉが新規文書ｄより以前に作成されていることが既知であるため、順序関係推定手段２０１は、時間条件の検査の処理を行う必要がない。そのため、順序関係推定手段２０１は、アクセス権判断手段２０１１を用いて、文書データベース１０１から取得（抽出）した著者情報とアクセスデータベース１０３から取得（抽出）したアクセス権条件とに基づいて、以下に示す手順に従って、文書ｄと文書ｄｉとの引用順序を決める。なお、アクセス権判断手段２０１１は、順序関係推定手段２０１を通さず、文書データベース１０１とアクセスデータベース１０３とに直接アクセスするようにしてもよい。 FIG. 8 is a flowchart illustrating an example of the order relationship estimation processing in step S204 in the second embodiment. In this embodiment, since it is known that the document di included in the document set D is created before the new document d, the order relation estimation unit 201 does not need to perform the time condition inspection process. Therefore, the order relation estimation unit 201 uses the access right determination unit 2011 to show the following based on author information acquired (extracted) from the document database 101 and access right conditions acquired (extracted) from the access database 103. The citation order of document d and document di is determined according to the procedure. Note that the access right determination unit 2011 may directly access the document database 101 and the access database 103 without passing through the order relation estimation unit 201.

順序関係推定手段２０１は、文書ｄｉの著者が文書ｄにアクセスできるか否かを判断するとともに、文書ｄの著者が文書ｄｉにアクセスできるか否かを判断する。文書ｄの著者が文書ｄｉにアクセスできるが、文書ｄｉの著者が文書ｄにアクセスできないと判断した場合には（ステップＳ４５１）、順序関係推定手段２０１は、文書ｄが引用先の文書であり、文書ｄｉが引用元の文書であると判断する（ステップＳ４５２）。 The order relation estimation unit 201 determines whether or not the author of the document di can access the document d, and determines whether or not the author of the document d can access the document di. When it is determined that the author of the document d can access the document di but the author of the document di cannot access the document d (step S451), the order relation estimation unit 201 determines that the document d is a cited document, It is determined that the document di is a citation source document (step S452).

また、文書ｄｉの著者が文書ｄにアクセスできるが、文書ｄの著者が文書ｄｉにアクセスできないと判断した場合には（ステップＳ４５３）、順序関係推定手段２０１は、文書ｄｉが引用先の文書であり、文書ｄが引用元の文書であると判断する（ステップＳ４５４）。 If it is determined that the author of the document di can access the document d, but the author of the document d cannot access the document di (step S453), the order relation estimation unit 201 determines that the document di is the cited document. Yes, it is determined that the document d is a citation source document (step S454).

また、文書ｄの著者が文書ｄｉにアクセスでき、かつ、文書ｄｉの著者が文書ｄにアクセスできると判断した場合には（ステップＳ４５５のＹ）、順序関係推定手段２０１は、文書ｄが引用先の文書であり、文書ｄｉが引用元の文書であると判断する（ステップＳ４５６）。 If it is determined that the author of the document d can access the document di and the author of the document di can access the document d (Y in step S455), the order relation estimation unit 201 determines that the document d is the citation destination. It is determined that the document di is the citation source document (step S456).

また、文書ｄの著者が文書ｄｉにアクセスできず、かつ、文書ｄｉの著者が文書ｄにアクセスできないと判断した場合には（ステップＳ４５５のＮ）、順序関係推定手段２０１は、文書ｄｉと文書ｄには引用関係がないと判断する（ステップＳ４５７）。 When it is determined that the author of the document d cannot access the document di and the author of the document di cannot access the document d (N in step S455), the order relation estimation unit 201 determines that the document di and the document di It is determined that d does not have a citation relationship (step S457).

そして、ステップＳ２０５に移行し、順序関係推定手段２０１は、コントローラ１００に推定結果を渡す（出力する）。 Then, the process proceeds to step S 205, and the order relation estimation unit 201 passes (outputs) the estimation result to the controller 100.

以上のように、本実施形態によれば、順序関係推定手段２０１は、文書登録監視手段３０１が検出した新規文書より以前に登録された既登録文書のみを取得して処理すれば、文書間の引用関係を推定することができる。従って、引用の順序関係の推定処理において、時間条件の検査の処理を省略することができ、処理負担の軽減を図ることができる。また、本実施形態によれば、既登録文書に対しての処理のみを行えばよいので、処理の対象となる文書数を削減することができる。 As described above, according to the present embodiment, the order relation estimation unit 201 obtains and processes only registered documents registered before the new document detected by the document registration monitoring unit 301. Citation relationships can be estimated. Therefore, in the process of estimating the citation order relationship, the time condition inspection process can be omitted, and the processing load can be reduced. Further, according to the present embodiment, it is only necessary to perform processing for registered documents, so the number of documents to be processed can be reduced.

次に、本発明による引用関係抽出システムの最小構成について説明する。図９は、引用関係抽出システムの最小の構成例を示すブロック図である。図９に示すように、引用関係抽出システムは、最小の構成要素として、文書データベース１０１と、引用度計算手段２０２とを含む。 Next, the minimum configuration of the citation relationship extraction system according to the present invention will be described. FIG. 9 is a block diagram illustrating a minimum configuration example of the citation relationship extraction system. As shown in FIG. 9, the citation relationship extraction system includes a document database 101 and a citation degree calculation unit 202 as minimum components.

引用度計算手段２０２は、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとに基づいて、コンテンツ間で引用が行われた可能性の度合いを示す引用度を算出する機能を備える。また、コントローラ１００は、引用度計算手段２０２が算出した引用度に基づいて、コンテンツ間の引用関係を抽出する機能を備える。 The citation level calculation means 202 indicates the possibility that citations have been made between contents based on the difference in creation, update or reference time between contents and the degree of relationship between authors who created, updated or referred to the contents. A function for calculating the degree of citation indicating the degree is provided. In addition, the controller 100 has a function of extracting a citation relationship between contents based on the citation degree calculated by the citation degree calculating unit 202.

図９に示す最小構成の引用関係抽出システムによれば、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとを考慮して、コンテンツ間の引用関係を抽出するので、引用関係の誤検出を除外することができる。従って、コンテンツ中の暗黙引用関係の抽出を可能とするとともに、暗黙引用関係抽出の精度向上を可能とすることができる。 According to the citation relation extraction system with the minimum configuration shown in FIG. 9, the difference between creation, update or reference time between contents and the degree of relation between authors who created, updated or referred to the contents are considered. Since the citation relationship is extracted, erroneous detection of the citation relationship can be excluded. Accordingly, it is possible to extract the implicit citation relationship in the content and improve the accuracy of the implicit citation relationship extraction.

なお、上記の各実施形態では、以下の（１）〜（８）に示すような引用関係抽出システム（暗黙引用関係発見システム）の特徴的構成が示されている。 In each of the above embodiments, the characteristic configuration of the citation relationship extraction system (implicit citation relationship discovery system) as shown in the following (1) to (8) is shown.

（１）引用関係抽出システムは、コンテンツ間の作成、更新又は参照時間の差と、コンテンツを作成、更新又は参照した著者間の関係の度合いとに基づいて、コンテンツ間で引用が行われた可能性の度合いを示す引用度を算出する引用度算出手段（例えば、引用度計算手段２０２によって実現される）と、引用度算出手段が算出した引用度に基づいて、コンテンツ間の引用関係を抽出する引用関係抽出手段（例えば、コントローラ１００によって実現される）とを備えたことを特徴とする。 (1) In the citation relationship extraction system, citations can be made between contents based on the difference in creation, update or reference time between contents and the degree of relationship between authors who created, updated or referenced the contents. The citation relationship between the contents is extracted based on the citation degree calculated by the citation degree calculating means (for example, realized by the citation degree calculating means 202) and the citation degree calculated by the citation degree calculating means. Citation relationship extraction means (for example, realized by the controller 100) is provided.

（２）引用関係抽出システムは、引用元となりうるコンテンツと引用先となりうるコンテンツとの順序関係を推定する順序関係推定手段（例えば、順序関係推定手段２０１によって実現される）を備え、引用関係抽出手段は、引用度算出手段が算出した引用度及び順序関係推定手段が推定した順序関係に基づいて、コンテンツ間の引用関係を抽出するように構成されていてもよい。 (2) The citation relation extraction system includes order relation estimation means (for example, realized by the order relation estimation means 201) for estimating the order relation between content that can be a citation source and content that can be a citation destination. The means may be configured to extract a citation relation between contents based on the citation degree calculated by the citation degree calculation means and the order relation estimated by the order relation estimation means.

（３）引用関係抽出システムにおいて、引用度算出手段は、コンテンツ間の作成、更新又は参照時間の差を示す時間距離を算出する時間距離算出手段（例えば、時間距離計算手段２０２２によって実現される）と、時間距離算出手段が算出した時間距離に基づいて引用度を算出する算出手段（例えば、総合計算手段２０２４によって実現される）とを含むように構成されていてもよい。 (3) In the citation relationship extraction system, the citation degree calculating means calculates a time distance indicating a difference in creation, update or reference time between contents (for example, realized by the time distance calculating means 2022). And a calculation unit (for example, realized by the total calculation unit 2024) that calculates the citation degree based on the time distance calculated by the time distance calculation unit.

（４）引用関係抽出システムにおいて、引用度算出手段は、コンテンツを作成、更新又は参照した著者間の関係の度合いを示す著者距離を算出する著者距離算出手段（例えば、著者距離計算手段２０２３によって実現される）と、著者距離算出手段が算出した著者距離に基づいて引用度を算出する算出手段（例えば、総合計算手段２０２４によって実現される）とを含むように構成されていてもよい。 (4) In the citation relation extraction system, the citation degree calculation means is realized by an author distance calculation means (for example, an author distance calculation means 2023) that calculates an author distance indicating a degree of relation between authors who created, updated, or referred to content. And a calculation means (for example, realized by the total calculation means 2024) for calculating the citation degree based on the author distance calculated by the author distance calculation means.

（５）引用関係抽出システムにおいて、引用度算出手段は、コンテンツ間の類似度を算出する類似度算出手段（例えば、類似度計算手段２０２１によって実現される）を含み、算出手段は、時間距離算出手段が算出した時間距離と、著者距離算出手段が算出した著者距離と、類似度算出手段が算出した類似度とを統合した引用度を算出するように構成されていてもよい。 (5) In the citation relation extraction system, the citation degree calculating means includes a similarity degree calculating means (for example, realized by the similarity degree calculating means 2021) for calculating a similarity degree between contents, and the calculating means calculates a time distance. The citation degree may be calculated by integrating the time distance calculated by the means, the author distance calculated by the author distance calculation means, and the similarity calculated by the similarity calculation means.

（６）引用関係抽出システムにおいて、順序関係推定手段は、コンテンツに設定されたアクセス権のレベルと、著者に設定されたアクセス権のレベルとに基づいて、引用元となりうるコンテンツと引用先となりうるコンテンツとの順序関係を推定するアクセス権順序推定手段（例えば、アクセス権判断手段２０１１によって実現される）を含み、アクセス権順序推定手段は、著者に設定されたアクセス権のレベルがコンテンツに設定されたアクセス権のレベル以上であると判断すると、当該コンテンツを引用元となりうるコンテンツと推定するように構成されていてもよい。 (6) In the citation relationship extraction system, the order relationship estimation means can be a content that can be a citation source and a citation destination based on the level of access right set for the content and the level of access right set for the author. Including an access right order estimating unit (e.g., realized by the access right determining unit 2011) for estimating the order relation with the content, and the access right order estimating unit is configured such that the level of the access right set for the author is set for the content. If it is determined that the access right level or higher, the content may be estimated as content that can be cited.

（７）引用関係抽出システムにおいて、順序関係推定手段は、コンテンツの作成、更新又は参照時間に基づいて、引用元となりうるコンテンツと引用先となりうるコンテンツとの順序関係を推定する時間順序推定手段（例えば、時間順序判断手段２０１２によって実現される）を含み、時間順序推定手段は、作成、更新又は参照時間が古いコンテンツを引用元となりうるコンテンツと推定し、作成、更新又は参照時間が新しいコンテンツを引用先となりうるコンテンツと推定するように構成されていてもよい。 (7) In the citation relation extraction system, the order relation estimation means estimates time order estimation means for estimating the order relation between content that can be a citation source and content that can be a citation destination based on the creation, update, or reference time of the content. For example, the time order estimation means estimates the content that has been created, updated, or referred to as the content that can be cited, and the content that has the new creation, update, or reference time. The content may be estimated as a content that can be cited.

（８）引用関係抽出システムにおいて、時間距離算出手段は、コンテンツタイプに応じたコンテンツの作成、更新又は参照時間の差を正規化するための正規化パラメータ（例えば、単位時間距離に換算するためのパラメータ）を用いて、正規化した時間距離を算出するように構成されていてもよい。 (8) In the citation relationship extraction system, the time distance calculation means is a normalization parameter (for example, for converting to a unit time distance) for normalizing a difference in content creation, update or reference time according to the content type. The parameter may be used to calculate a normalized time distance.

本発明は、社内コンテンツを体系化する情報処理装置や、社内コンテンツの検索装置、社内の業績評価を支援する装置といった用途に適用できる。また、コンテンツの再利用関係を発見して原著と出典とを明確にし著作権保護を支援するための装置といった用途にも適用できる。 The present invention can be applied to uses such as an information processing apparatus that organizes in-house content, an in-house content search apparatus, and an in-house performance evaluation support apparatus. It can also be applied to uses such as a device for discovering the reuse relationship of content, clarifying the original and the source, and supporting copyright protection.

本発明による暗黙引用関係発見システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the implicit citation relationship discovery system by this invention. 組織構成グラフと、その組織構成グラフに対する隣接行列と、著者情報との例を示す説明図である。It is explanatory drawing which shows the example of an organization structure graph, the adjacency matrix with respect to the organization structure graph, and author information. 暗黙引用関係発見システムがコンテンツ（電子文書中のパッセージ）間に含まれる暗黙引用関係を抽出する処理の一例を示す流れ図である。It is a flowchart which shows an example of the process which an implicit citation relationship discovery system extracts the implicit citation relationship contained between contents (passage in an electronic document). 順序関係の推定処理の一例を示す流れ図である。It is a flowchart which shows an example of the estimation process of order relation. 引用の可能性と時間距離との関係を示す説明図である。It is explanatory drawing which shows the relationship between the possibility of quotation, and time distance. 第２の実施形態における暗黙引用関係発見システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the implicit citation relationship discovery system in 2nd Embodiment. 第２の実施形態における暗黙引用関係発見システムがコンテンツ（電子文書中のパッセージ）間に含まれる暗黙引用関係を抽出する処理の一例を示す流れ図である。It is a flowchart which shows an example of the process which the implicit citation relationship discovery system in 2nd Embodiment extracts the implicit citation relationship contained between content (passage in an electronic document). 第２の実施形態における順序関係の推定処理の一例を示す流れ図である。It is a flowchart which shows an example of the estimation process of the order relationship in 2nd Embodiment. 引用関係抽出システムの最小の構成例を示すブロック図である。It is a block diagram which shows the minimum structural example of a quotation relation extraction system.

Explanation of symbols

１００コントローラ
１０１文書データベース
１０２組織構成表記憶手段
１０３アクセスデータベース
１０４仮想引用データベース
２０１順序関係の推定手段
２０１１アクセス権判断手段
２０１２時間順序判断手段
２０２引用度計算手段
２０２１類似度計算手段
２０２２時間距離計算手段
２０２３著者距離計算手段
２０２４統合計算手段
３０１文書登録監視手段 100 controller 101 document database 102 organization structure table storage means 103 access database 104 virtual citation database 201 order relation estimation means 2011 access right judgment means 2012 time order judgment means 202 citation degree calculation means 2021 similarity calculation means 2022 time distance calculation means 2023 Author distance calculation means 2024 Integrated calculation means 301 Document registration monitoring means

Claims

Based on the difference in creation, update, or reference time between contents and the degree of relationship between authors who created, updated, or referenced the contents, the degree of citation that indicates the likelihood of citations between the contents was calculated A citation level calculating means to
A citation relationship extraction system comprising: citation relationship extraction means for extracting a citation relationship between contents based on the citation level calculated by the citation level calculation means.

An order relationship estimating means for estimating an order relationship between content that can be cited and content that can be cited;
The citation relationship extraction system according to claim 1, wherein the citation relationship extraction unit extracts a citation relationship between contents based on the citation degree calculated by the citation degree calculation unit and the order relationship estimated by the order relationship estimation unit.

Citation level calculation means
A time distance calculating means for calculating a time distance indicating a difference in creation, update or reference time between contents;
The citation relationship extraction system according to claim 1, further comprising a calculation unit that calculates a citation degree based on the time distance calculated by the time distance calculation unit.

Citation level calculation means
Author distance calculation means for calculating the author distance indicating the degree of relationship between the authors who created, updated or referred to the content;
The citation relation extraction system according to any one of claims 1 to 3, further comprising a calculation unit that calculates a citation degree based on the author distance calculated by the author distance calculation unit.

The citation degree calculating means includes a similarity calculating means for calculating the similarity between contents,
The calculation means calculates a citation degree obtained by integrating the time distance calculated by the time distance calculation means, the author distance calculated by the author distance calculation means, and the similarity calculated by the similarity calculation means. Item 5. The citation relationship extraction system according to item 4.

The order relation estimation means estimates the order relation between the content that can be cited and the content that can be cited based on the level of access right set for the content and the level of access right set for the author. A right order estimation means,
3. The access right order estimating unit estimates the content as a content that can be cited from the content when the access right level set for the author is determined to be equal to or higher than the access right level set for the content. Citation relationship extraction system.

The order relation estimation means includes time order estimation means for estimating an order relation between content that can be a citation source and content that can be a citation destination based on the creation, update, or reference time of the content,
The time order estimation means estimates content that has an old creation, update, or reference time as content that can be cited, and estimates content that has a new creation, update, or reference time as content that can be cited. 6. The citation relationship extraction system according to 6.

The time distance calculation means calculates a normalized time distance using a normalization parameter for normalizing a difference in content creation, update or reference time according to the content type. The quotation relation extraction system of any one of them.

Based on the difference in creation / update / reference time between contents and the degree of relationship between authors who created / updated / referenced the content, the degree of citation indicating the possibility of citation between the contents was calculated. Quoting level calculating step,
A citation relationship extraction method, comprising: a citation relationship extraction step for extracting a citation relationship between contents based on the calculated citation degree.

Including an order relation estimation step for estimating an order relation between content that can be cited and content that can be cited;
The citation relationship extraction method according to claim 9, wherein in the citation relationship extraction step, a citation relationship between contents is extracted based on the calculated citation degree and the estimated order relationship.

In the citation level calculation step,
Calculate the time distance indicating the difference in creation, update or reference time between content,
The citation relationship extraction method according to claim 9 or 10, wherein a citation degree is calculated based on the calculated time distance.

In the citation level calculation step,
Calculate the author distance that indicates the degree of relationship between authors who created, updated, or referenced content,
The citation relationship extraction method according to any one of claims 9 to 11, wherein a citation degree is calculated based on the calculated author distance.

In the citation level calculation step,
Calculate the similarity between content,
The citation relationship extraction method according to claim 11, wherein a citation degree is calculated by integrating the calculated time distance, author distance, and similarity.

In the order relation estimation step, the access right level set for the content based on the access right level set for the content and the access right level set for the author. 11. The citation relationship extraction method according to claim 10, wherein if it is determined that the content is a content that can be a citation source, the order relationship between the content that can be a citation source and the content that can be a citation destination is estimated.

In the order relation estimation step, based on the content creation, update, or reference time, the content with the old creation, update, or reference time is estimated as the content that can be cited, and the content with the new creation, update, or reference time becomes the reference destination The citation relationship extraction method according to claim 10 or 14, wherein the order relationship between the content that can be a citation source and the content that can be a citation destination is estimated by estimating the content as a possible content.

The normalized time distance is calculated using a normalization parameter for normalizing a difference in creation, update, or reference time of content according to the content type in the quoting degree calculation step. The citation relation extraction method according to any one of the above.

On the computer,
Based on the difference in creation, update, or reference time between contents and the degree of relationship between authors who created, updated, or referenced the contents, the degree of citation that indicates the likelihood of citations between the contents was calculated Citation level calculation process to
A citation relationship extraction program for executing a citation relationship extraction process for extracting a citation relationship between contents based on the calculated citation level.

On the computer,
Execute an order relationship estimation process that estimates the order relationship between content that can be cited and content that can be cited.
The citation relationship extraction program according to claim 17, wherein the citation relationship extraction process executes a process of extracting a citation relationship between contents based on the calculated citation degree and the estimated order relationship.