JP3425906B2

JP3425906B2 - Document retrieval apparatus, document retrieval method, and computer-readable recording medium recording a program for causing a computer to execute the method

Info

Publication number: JP3425906B2
Application number: JP29890199A
Authority: JP
Inventors: 敏宏安食
Original assignee: 株式会社ジャストシステム
Priority date: 1999-10-20
Filing date: 1999-10-20
Publication date: 2003-07-14
Anticipated expiration: 2019-10-20
Also published as: JP2001117941A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、電子化された複
数の文書の中から、入力された検索要求に適合する文書
とともに当該適合文書に記載された事象と関連性を有す
る事象について記載した文書もあわせて検索することが
できる文書検索装置、文書検索方法およびその方法をコ
ンピュータに実行させるプログラムを記録したコンピュ
ータ読み取り可能な記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document that describes, from a plurality of digitized documents, a document that meets an input search request and an event that is related to an event described in the relevant document. The present invention also relates to a document search device capable of performing a search, a document search method, and a computer-readable recording medium having a program for causing a computer to execute the method.

【０００２】[0002]

【従来の技術】電子化された文書をハードディスク等の
記憶装置にデータベースとして蓄積しておき、操作者か
ら検索要求が入力されると、当該検索要求に適合する文
書を検索して表示画面等に出力する文書検索装置が従来
から知られている。2. Description of the Related Art Digitized documents are stored as a database in a storage device such as a hard disk, and when a search request is input by an operator, a document matching the search request is searched and displayed on a display screen or the like. A document search device for outputting is conventionally known.

【０００３】このような装置の多くは、入力された検索
語が文書中に含まれているかどうかをもって、検索要求
に対する当該文書の適合または不適合を判定する。ただ
しこの方法は、処理負荷が少ない反面、検索結果にノイ
ズ（その検索語を含んではいるが、操作者の検索目的に
は合致しない文書）が混入しやすい。そこである特定の
語彙の有無でなく、文書の全体的な意味内容に着目する
ことにより、よりノイズの少ない検索結果を得ることの
できるベクトル空間法という検索方法が知られている。[0003] Many of such devices judge whether or not the document is suitable for the search request based on whether or not the input search word is included in the document. However, although this method has a small processing load, noise (a document that includes the search word but does not match the search purpose of the operator) is easily mixed in the search result. Therefore, there is known a search method called a vector space method that can obtain a search result with less noise by paying attention to the entire meaning content of a document, not the presence or absence of a specific vocabulary.

【０００４】この検索方法においては、まず入力された
自然文とデータベース内の個々の文書について、その意
味内容を客観的・数値的に表現したｎ次元のベクトルを
作成する。ベクトルの各要素値は、それに対応する語彙
の文書中での出現頻度等によって、所定の計算式にした
がって決定される。In this search method, first, an n-dimensional vector is created that objectively and numerically expresses the semantic content of the input natural sentence and individual documents in the database. Each element value of the vector is determined according to a predetermined calculation formula based on the frequency of appearance of the corresponding vocabulary in the document.

【０００５】そして入力文のベクトルと各文書のベクト
ルとの類似度をもって、入力文に対する当該文書の適合
度とする。たとえばベクトルのなす角度が小さいほど、
その類似度は高く、入力文に対する当該文書の適合度は
高くなる。この適合度が所定のしきい値を超えた文書、
あるいは適合度の高い順に所定の数の文書が、適合文書
として、通常その適合度の順序でディスプレイ等に表示
される。Then, the degree of similarity between the vector of the input sentence and the vector of each document is taken as the matching degree of the document with respect to the input sentence. For example, the smaller the angle between the vectors,
The similarity is high, and the matching degree of the document with respect to the input sentence is high. Documents whose conformance exceeds a specified threshold,
Alternatively, a predetermined number of documents in descending order of relevance are usually displayed on the display or the like as relevance documents in order of relevance.

【０００６】[0006]

【発明が解消しようとする課題】しかしながら、上記従
来技術においては、それぞれの適合文書が作成された時
期、あるいはそれら相互の時間的な関係は重視されてい
ない。ただ各文書の、入力された検索文との類似関係が
示されるのみである。そこで検索によって得られた、類
似する主題を扱った複数の文書からなんらかの経時的な
知識、たとえばある事件の辿った経緯、ある事項に関す
る政策の変遷等を得ようとする場合には、別途これらの
文書を作成日時順に並び替える等の作業が必要であっ
た。However, in the above-mentioned prior art, the time when each relevant document is created, or the time relationship between them is not considered important. However, only the similarity between each document and the input search text is shown. Therefore, if you want to obtain some time-based knowledge from multiple documents that deal with similar subjects obtained by searching, for example, the history of a certain incident or the transition of a policy related to a certain matter, etc. It was necessary to sort the documents in order of creation date and time.

【０００７】また上記従来技術においては、適合文書と
不適合文書との関係、とりわけそれらの時間的な関連性
は操作者にはなんら提示されていない。そこで操作者
が、ある適合文書に記載されている事象について、他の
不適合文書に記載されている事象との関係、たとえばこ
の時期にこの事件が起こるきっかけとなるような社会的
要因があったかどうか、この時期にこの政策が取られた
ことにより市民の生活や意識にどのような変化がもたら
されたか、等を分析したいと考えたときは、別途その適
合文書と近い時期に作成された文書を閲覧して、それら
の文書に顕著な傾向やそれらの文書の多くに記載されて
いる事象等を、もっぱら人手で見出さなければならない
という問題点があった。Further, in the above-mentioned prior art, the relation between the conforming document and the nonconforming document, especially the temporal relation between them is not presented to the operator. Therefore, the operator has a relationship between an event described in one conforming document and an event described in another nonconforming document, for example, whether there is a social factor that triggers this case at this time. When I wanted to analyze what kind of changes in the lives and consciousness of the citizens were brought about by the adoption of this policy at this time, I would like to separate the documents that were created at the same time as the relevant documents. There is a problem in that it is necessary to browse and find out the prominent tendencies in those documents and the events described in many of those documents exclusively by hand.

【０００８】この発明は、上述した従来技術による問題
点を解消するため、電子化された複数の文書に記載され
た事象間の、時間的な前後関係や共起関係を操作者に分
かりやすく提示することが可能な文書検索装置、文書検
索方法およびその方法をコンピュータに実行させるプロ
グラムを記録したコンピュータ読み取り可能な記録媒体
を提供することを目的とする。In order to solve the above-mentioned problems of the prior art, the present invention provides the operator with an easy-to-understand temporal front-rear relationship and co-occurrence relationship between events described in a plurality of electronic documents. An object of the present invention is to provide a document search device, a document search method, and a computer-readable recording medium in which a program for causing a computer to execute the method is recorded.

【０００９】[0009]

【課題を解決するための手段】上述した課題を解決し、
目的を達成するため、請求項１の発明にかかる文書検索
装置は、電子化された複数の文書の中から入力された検
索要求に適合する文書を抽出する第１の文書抽出手段
と、前記第１の文書抽出手段により抽出された文書が作
成された日時に関する情報を取得し、取得された情報に
かかる日時の前後の所定期間において作成された文書
を、前記複数の文書の中から、当該複数の文書の日時に
関する情報を参照して抽出する第２の文書抽出手段と、
前記第１の文書抽出手段によっては抽出されなかった文
書であって、前記第１の文書抽出手段により抽出された
文書が作成された日時の前後の所定期間において作成さ
れた文書を、前記複数の文書の中から抽出する第２の文
書抽出手段と、前記第２の文書抽出手段により抽出され
た文書どうしの類似度を算出する文書類似度算出手段
と、前記文書類似度算出手段により算出された類似度が
所定のしきい値を超えた文書どうしをまとめることによ
り前記第２の文書抽出手段により抽出された文書を分類
する文書分類手段と、を備えたことを特徴とする。[Means for Solving the Problems]
In order to achieve the object, a document search device according to the invention of claim 1 is a first document extraction means for extracting a document that matches an input search request from a plurality of digitized documents; The information about the date and time when the document extracted by the first document extracting unit is obtained, and the document created during the predetermined period before and after the date and time related to the obtained information is selected from the plurality of documents. Second document extracting means for extracting with reference to the date and time information of the document,
A plurality of documents which are not extracted by the first document extracting unit and which are created in a predetermined period before and after the date and time when the document extracted by the first document extracting unit is created Second document extracting means for extracting from documents, document similarity calculating means for calculating similarity between documents extracted by the second document extracting means, and document similarity calculating means And a document classification unit that classifies the documents extracted by the second document extraction unit by collecting documents whose similarity exceeds a predetermined threshold value.

【００１０】この請求項１の発明によれば、入力された
検索要求に適合する文書とともに、当該適合文書と同時
期に作成された文書もあわせて検索される。According to the first aspect of the present invention, not only the document conforming to the input retrieval request but also the document created at the same time as the conforming document are retrieved.

【００１１】また、請求項２の発明にかかる文書検索方
法は、電子化された複数の文書の中から入力された検索
要求に適合する文書を抽出する第１の文書抽出工程と、
前記第１の文書抽出工程により抽出された文書が作成さ
れた日時に関する情報を取得し、取得された情報にかか
る日時の前後の所定期間において作成された文書を、前
記複数の文書の中から、当該複数の文書の日時に関する
情報を参照して抽出する第２の文書抽出工程と、前記第
１の文書抽出手段によっては抽出されなかった文書であ
って、前記第１の文書抽出工程により抽出された文書が
作成された日時の前後の所定期間において作成された文
書を、前記複数の文書の中から抽出する第２の文書抽出
工程と、前記第２の文書抽出工程により抽出された文書
どうしの類似度を算出する文書類似度算出工程と、前記
文書類似度算出工程により算出された類似度が所定のし
きい値を超えた文書どうしをまとめることにより前記第
２の文書抽出工程により抽出された文書を分類する文書
分類工程と、を含んだことを特徴とする。According to a second aspect of the present invention, there is provided a document retrieval method, which comprises a first document extracting step of extracting a document that matches an input retrieval request from a plurality of digitized documents.
The information about the date and time when the document extracted by the first document extracting step is acquired, and the document created during a predetermined period before and after the date and time according to the acquired information is selected from the plurality of documents. A second document extracting step of extracting with reference to the information regarding the dates and times of the plurality of documents, and a document which is not extracted by the first document extracting means and is extracted by the first document extracting step. A second document extracting step for extracting a document created in a predetermined period before and after the date and time when the document was created, and the documents extracted by the second document extracting step. The document similarity calculation step of calculating the similarity and the second document extraction step by grouping together the documents whose similarity calculated by the document similarity calculation step exceeds a predetermined threshold value. Characterized in that it contains, and document classification step of classifying a more retrieved document.

【００１２】この請求項２の発明によれば、入力された
検索要求に適合する文書とともに、当該適合文書と同時
期に作成された文書もあわせて検索される。According to the second aspect of the present invention, not only the document conforming to the input retrieval request but also the document created at the same time as the relevant document are retrieved.

【００１３】また、請求項３の発明にかかる記録媒体
は、請求項２に記載された方法をコンピュータに実行さ
せるプログラムを記録したことで、そのプログラムが機
械読み取り可能となり、これによって、請求項２の動作
をコンピュータによって実現することが可能となる。Further, the recording medium according to the invention of claim 3 records a program for causing a computer to execute the method described in claim 2, so that the program becomes machine-readable, whereby the recording medium according to claim 2 is obtained. The operation of can be realized by a computer.

【００１４】[0014]

【発明の実施の形態】以下に添付図面を参照して、この
発明にかかる文書検索装置、文書検索方法およびその方
法をコンピュータに実行させるプログラムを記録したコ
ンピュータ読み取り可能な記録媒体の好適な実施の形態
を詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION With reference to the attached drawings, preferred embodiments of a document retrieval apparatus, a document retrieval method, and a computer-readable recording medium in which a program for causing a computer to execute the method is recorded according to the present invention. The form will be described in detail.

【００１５】（実施の形態）まず、本発明の実施の形態にかかる文書検索装置のハー
ドウエア構成について説明する。図１は、本実施の形態
にかかる文書検索装置のハードウエア構成を示すブロッ
ク図である。(Embodiment) First, a hardware configuration of a document search apparatus according to an embodiment of the present invention will be described. FIG. 1 is a block diagram showing the hardware configuration of the document search device according to the present embodiment.

【００１６】図１において、１０１はシステム全体を制
御するＣＰＵを、１０２はブートプログラム等を記憶し
たＲＯＭを、１０３はＣＰＵ１０１のワークエリアとし
て使用されるＲＡＭを、１０４はＣＰＵ１０１の制御に
したがってＨＤ（ハードディスク）１０５に対するデー
タのリード／ライトを制御するＨＤＤ（ハードディスク
ドライブ）を、１０５はＨＤＤ１０４の制御で書き込ま
れたデータを記憶するＨＤをそれぞれ示している。In FIG. 1, 101 is a CPU that controls the entire system, 102 is a ROM that stores a boot program, 103 is a RAM used as a work area of the CPU 101, and 104 is an HD (under the control of the CPU 101). An HDD (hard disk drive) that controls reading / writing of data from / to a hard disk 105 and an HD that stores data written under the control of the HDD 104 are shown.

【００１７】また、１０６はＣＰＵ１０１の制御にした
がってＦＤ（フロッピーディスク）１０７に対するデー
タのリード／ライトを制御するＦＤＤ（フロッピーディ
スクドライブ）を、１０７はＦＤＤ１０６の制御で書き
込まれたデータを記憶する着脱自在のＦＤをそれぞれ示
している。Reference numeral 106 denotes an FDD (floppy disk drive) for controlling the reading / writing of data from / to the FD (floppy disk) 107 under the control of the CPU 101, and 107 is a detachable device for storing the data written under the control of the FDD 106. FD of each.

【００１８】また、１０８はカーソル、アイコンあるい
はツールボックスをはじめ、文書、画像、機能情報等の
データに関するウインドウを表示するディスプレイを、
１０９は通信回線１１０を介してネットワークＮＥＴに
接続され、そのネットワークＮＥＴと内部とのインター
フェイスを司るインターフェイス（Ｉ／Ｆ）をそれぞれ
示している。Reference numeral 108 denotes a display for displaying a window relating to data such as a document, an image, and functional information, including a cursor, an icon or a tool box,
Reference numerals 109 denote interfaces (I / F) that are connected to the network NET via the communication line 110 and control the interface between the network NET and the inside.

【００１９】また、１１１は文字、数値、各種指示等の
入力のための複数のキーを備えたキーボードを、１１２
はカーソルの移動や範囲選択、あるいはウインドウの移
動やサイズの変更、アイコンの選択、移動等をおこなう
マウスを、１１３は画像を光学的に読み取るスキャナ
を、１１４はウインドウに表示された内容等を印刷する
プリンタを、１１５は着脱可能な記録媒体であるＣＤ−
ＲＯＭを、１１６はＣＤ−ＲＯＭ１１５に対するデータ
のリードを制御するＣＤ−ＲＯＭドライブを、それぞれ
示している。また、１００は上記各部を接続するための
バスを示している。Reference numeral 111 denotes a keyboard provided with a plurality of keys for inputting characters, numerical values, various instructions, and the like.
Is a mouse for moving the cursor, selecting a range, moving a window, changing a size, selecting an icon, moving, etc., 113 is a scanner for optically reading an image, and 114 is a window for printing the contents displayed in the window. 115 is a removable recording medium CD-
Reference numeral 116 denotes a ROM, and reference numeral 116 denotes a CD-ROM drive for controlling the reading of data from the CD-ROM 115. Further, reference numeral 100 denotes a bus for connecting the above-mentioned respective parts.

【００２０】つぎに、本発明の実施の形態にかかる文書
検索装置の機能的構成について説明する。図２は、本実
施の形態にかかる文書検索装置の構成を機能的に示すブ
ロック図である。図２において、本実施の形態にかかる
文書検索装置は、ファイル記憶部２００と、ベクトル計
算部２０１と、入力部２０２と、第１の文書抽出部２０
３と、第２の文書抽出部２０４と、文書分類部２０５
と、文書並び替え部２０６と、表示部２０７と、を含む
構成である。またベクトル計算部２０１は、クエリーベ
クトル作成部２０１ａと、文書ベクトル作成部２０１ｂ
と、ベクトル類似度算出部２０１ｃと、を含む構成であ
る。Next, the functional configuration of the document search device according to the embodiment of the present invention will be described. FIG. 2 is a block diagram functionally showing the configuration of the document search device according to the present embodiment. In FIG. 2, the document search device according to the present exemplary embodiment includes a file storage unit 200, a vector calculation unit 201, an input unit 202, and a first document extraction unit 20.
3, the second document extraction unit 204, and the document classification unit 205
And a document rearrangement unit 206 and a display unit 207. The vector calculation unit 201 also includes a query vector creation unit 201a and a document vector creation unit 201b.
And a vector similarity calculation unit 201c.

【００２１】ファイル記憶部２００は、具体的にはハー
ドディスク１０５およびハードディスク１０５の読み取
り装置であるハードディスクドライブ１０４により実現
されるものであり、電子化された複数の文書を記憶して
いる。これらの文書は、あらかじめ所定のファイル形
式、具体的にはＳ−ＪＩＳ形式に統一されている。The file storage unit 200 is specifically realized by the hard disk 105 and the hard disk drive 104 which is a reading device of the hard disk 105, and stores a plurality of digitized documents. These documents are standardized in advance in a predetermined file format, specifically the S-JIS format.

【００２２】Ｓ−ＪＩＳ形式とはＳＧＭＬなどの構造化
文書と類似の形式であり、たとえば<body>のタグに続い
て文書の本文、<title>のタグに続いて文書の表題、<au
thor>のタグに続いて文書の作成者名、<createddate>の
タグに続いて文書の作成日時、<pubdate>のタグに続い
て文書の公表日時（刊行物への掲載日時等）等が記述さ
れている。The S-JIS format is a format similar to a structured document such as SGML. For example, the <body> tag is followed by the body of the document, the <title> tag is followed by the document title, and <au>.
The thor> tag is followed by the author name of the document, the <createddate> tag is followed by the document creation date and time, and the <pubdate> tag is followed by the publication date and time of the document (such as the date and time of publication in the publication). Has been done.

【００２３】なお、ファイル記憶部２００に異なる形式
の文書、たとえば市販のワープロソフトにより作成され
た独自形式の文書を追加するときは、それに先立ってフ
ァイル形式の変換をおこなわなければならない。本実施
の形態にかかる文書検索装置は、各種形式のファイルを
Ｓ−ＪＩＳ形式に変換するための機能（フィルタ）を有
しているが、本発明の中心的な内容ではないので詳細な
説明を省略する。When adding a document of a different format, such as a document of a unique format created by a commercially available word processing software, to the file storage unit 200, the file format must be converted prior to the addition. The document search device according to the present embodiment has a function (filter) for converting files of various formats into the S-JIS format, but since it is not the central content of the present invention, a detailed description will be given. Omit it.

【００２４】またファイル記憶部２００は、上記文書の
ファイルのほかに、後述する文書ベクトル作成部２０１
ｂがそれらの文書の文書ベクトルを作成する際に使用す
る、複数の検索用ファイルを保持している。これらの検
索用ファイルは、上記文書に含まれるすべての語句につ
いて、その出現頻度やどの文書に出現しているか等の情
報を記述したものである。The file storage unit 200 includes a document vector creation unit 201, which will be described later, in addition to the above-mentioned document files.
b holds a plurality of search files used when creating the document vector of those documents. These search files describe information such as the frequency of appearance and the document in which all words and phrases included in the above document appear.

【００２５】ベクトル計算部２０１は、請求項１にいう
「文書類似度算出手段」としての機能を有し、クエリー
ベクトル作成部２０１ａと、文書ベクトル作成部２０１
ｂと、ベクトル類似度算出部２０１ｃとを備えている。The vector calculation unit 201 has a function as "document similarity calculation unit" in claim 1, and includes a query vector creation unit 201a and a document vector creation unit 201.
b and a vector similarity calculation unit 201c.

【００２６】クエリーベクトル作成部２０１ａは、後述
する入力部２０２から入力した検索文にもとづいて、そ
の意味内容を数値的に表現したクエリーベクトルを作成
する。ベクトルの構造については後述する。The query vector creating unit 201a creates a query vector that numerically expresses the meaning of the meaning based on a search sentence input from the input unit 202 described later. The structure of the vector will be described later.

【００２７】文書ベクトル作成部２０１ｂは、ファイル
記憶部２００に記憶された検索用ファイルにもとづい
て、ファイル記憶部２００に格納されている個々の文書
の文書ベクトルを作成する。The document vector creation unit 201b creates a document vector of each document stored in the file storage unit 200 based on the search file stored in the file storage unit 200.

【００２８】ここで文書ベクトルとは、記憶されている
すべての文書に含まれるすべての語句と同数の要素値か
らなるベクトルであり、各語句に対応する各要素値の特
徴によって、当該文書の意味内容を数値的に把握するこ
とができる。Here, the document vector is a vector having the same number of element values as all the words and phrases included in all the stored documents, and the meaning of the document is defined by the characteristics of the respective element values corresponding to the respective words and phrases. You can understand the contents numerically.

【００２９】より単純には、文書ベクトルの各要素値
は、それに対応する語句が当該文書中に出現する頻度に
よって決定される。たとえばその文書にある語句が１回
出現していれば、当該語句に対応する文書ベクトル内の
要素値は「１」となり、１０回出現していれば「１０」
となり、まったく出現していなければ「０」となる。More simply, each element value of the document vector is determined by the frequency with which the corresponding phrase appears in the document. For example, if a phrase in the document appears once, the element value in the document vector corresponding to the phrase is "1", and if it appears 10 times, "10".
And, if it has not appeared at all, it becomes “0”.

【００３０】ただし、たとえば「銀行」という語を多く
含む文書であっても、同じデータベース内のほかの文書
にもその語が多く含まれる場合は、その語が当該文書に
とって特徴的である度合いは低いと考えられる。逆に、
「銀行」という語をわずかしか含まない文書であって
も、他の文書にはその語がまったく現れていなければ、
その語は他の文書と比較したときの当該文書の特徴を端
的に表現していると考えられる。However, for example, even if a document includes many words such as “bank”, if other documents in the same database also include many words, the degree to which the word is characteristic for the document is not. It is considered low. vice versa,
If a document contains only a few words "bank", but the word does not appear in any of the other documents,
It is considered that the word simply expresses the feature of the document when compared with other documents.

【００３１】このような事情に鑑みて、本実施の形態で
はベクトルの各要素値を単純な語句の出現頻度でなく、
当該語句の出現箇所の統計学的な特徴、すなわち文書間
や、あるいは一文書内での分散状況等を考慮して算出す
るようにしている。上述のように、ファイル記憶部２０
０に記憶されている検索用ファイルにはこれらの情報が
記述されているので、文書ベクトル作成部２０１ｂはこ
れらの検索用ファイルを参照して、個々の文書の文書ベ
クトルを作成する。In view of such a situation, in the present embodiment, each element value of the vector is not the appearance frequency of a simple word, but
The calculation is performed in consideration of the statistical characteristics of the appearance location of the phrase, that is, the distribution status between documents or within one document. As described above, the file storage unit 20
Since these pieces of information are described in the search file stored in 0, the document vector creation unit 201b creates a document vector of each document by referring to these search files.

【００３２】ベクトル類似度算出部２０１ｃは、クエリ
ーベクトル作成部２０１ａで作成されたクエリーベクト
ルと、文書ベクトル作成部２０１ｂで作成された個々の
文書ベクトルの類似度、あるいは後述する文書分類部２
０５から入力した文書について作成された文書ベクトル
どうしの類似度、を順次算出する。ベクトルの類似度
は、具体的にはそれらの内積にもとづいて所定の計算式
により算出することができる。The vector similarity calculation unit 201c calculates the similarity between the query vector created by the query vector creation unit 201a and the individual document vectors created by the document vector creation unit 201b, or the document classification unit 2 described later.
The similarity between the document vectors created for the document input from 05 is sequentially calculated. Specifically, the vector similarity can be calculated by a predetermined calculation formula based on the inner product of them.

【００３３】なお、後述する文書分類部２０５から入力
した文書の類似度を算出するには、実際にはそれらの文
書ベクトルのあらゆる組み合わせについてではなく、個
々の文書の文書ベクトルと、それらの文書のすべてをい
わば一つの文書とみて、当該文書について作成した文書
ベクトルとの類似度を算出する。In order to calculate the similarity of the documents input from the document classification unit 205 described later, the document vector of each document and the document vector of each document are not actually used for every combination of those document vectors. All are regarded as one document, so to speak, and the degree of similarity with the document vector created for the document is calculated.

【００３４】すなわち、たとえば文書分類部２０５から
入力した文書がａ１、ａ２、ａ３およびａ４の４件あ
り、それぞれの文書ベクトルがｖ１、ｖ２、ｖ３および
ｖ４、これら４つの文書を一つの文書とみたときの当該
文書の文書ベクトルがＶであったとすると、Ｖとｖ１、
Ｖとｖ２、Ｖとｖ３、Ｖとｖ４の４組の文書ベクトルの
類似度をそれぞれ算出する。That is, for example, there are four documents a1, a2, a3, and a4 input from the document classification unit 205, and the respective document vectors are v1, v2, v3, and v4, and these four documents are regarded as one document. If the document vector of the document at that time is V, V and v1,
The similarity of four sets of document vectors of V and v2, V and v3, and V and v4 is calculated.

【００３５】このようにするのは、もっぱらクエリーベ
クトルと個々の文書ベクトルとの類似度算出を念頭に設
計されているベクトル計算部２０１を、個々の文書ベク
トルどうしの類似度算出の用途にも利用するためであ
る。上記のような組み合わせで文書ベクトルの類似度を
算出しても、実質的にはｖ１とｖ２、ｖ１とｖ３、ｖ１
とｖ４、ｖ２とｖ３、ｖ２とｖ４、ｖ３とｖ４の、６組
の文書ベクトルについて類似度を算出するのと同じこと
である。In this way, the vector calculation unit 201, which is designed only for calculating the similarity between the query vector and the individual document vectors, is also used for calculating the similarity between the individual document vectors. This is because Even if the document vector similarity is calculated with the above combination, it is substantially the same as v1 and v2, v1 and v3, and v1.
And v4, v2 and v3, v2 and v4, v3 and v4, which is the same as calculating the similarity for six sets of document vectors.

【００３６】個々のベクトルは検索文や文書の意味内容
を数値的に表現したものなので、ベクトルの類似度をも
って、当該ベクトルの元となった検索文と文書の意味内
容の類似度（言い換えれば、当該検索文に対する当該文
書の適合度）、あるいは当該ベクトルの元となった文書
間の意味内容の類似度とする。Since each vector is a numerical representation of the meaning and content of the search sentence or document, the similarity between the vector and the search sentence that is the source of the vector and the meaning and content of the document (in other words, The degree of matching of the document with respect to the search sentence), or the degree of similarity of the semantic content between the documents that are the sources of the vector.

【００３７】上記のようにして算出された検索文と文
書、あるいは文書と文書との類似度は、ベクトル計算部
２０１に対して上記処理をおこなうよう指示を入力した
機能部、具体的には後述する第１の文書抽出部２０３、
または文書分類部２０５に対して出力される。The degree of similarity between the search sentence and the document, or the similarity between the document and the document calculated as described above, is determined by the functional unit that has input an instruction to the vector calculation unit 201 to perform the above-described processing, and more specifically, as will be described later. The first document extraction unit 203,
Alternatively, it is output to the document classification unit 205.

【００３８】入力部２０２は、具体的にはキーボード１
１１やマウス１１２により実現され、操作者からの種々
の命令を入力する。ファイル記憶部２００に記憶された
文書の中からある事項に関連する文書を検索するには、
操作者は自然文による検索文とともに、検索文に適合す
る文書のみを検索するか、あるいはそれらの文書と同時
期に作成された文書（厳密に同時でなくともよく、それ
らと近い時期に作成された文書を含む）もあわせて検索
するかを指定する。The input unit 202 is specifically the keyboard 1
11 and a mouse 112 to input various commands from the operator. To search for documents related to a certain item from the documents stored in the file storage unit 200,
The operator may search only the documents that match the search text together with the natural text search text, or create documents at the same time as those documents (not necessarily at the same time, but created at a time close to them). Specified documents) are also specified.

【００３９】以下では前者の例として、（１）検索文
「アポロ１１号が月面に着陸した」とともに、当該検索
文に適合する文書のみを検索するよう指示した場合、後
者の例として、（２）検索文「国内総生産の低迷」とと
もに、適合文書およびそれらの文書と同時期に作成され
た文書を検索するよう指示した場合の、それぞれの処理
内容を区別して説明する。In the following, as an example of the former, (1) when the search sentence "Apollo 11 landed on the moon" is instructed and only documents matching the search sentence are instructed, 2) In addition to the search statement "Gross domestic product stagnation", the description will be made separately for each processing content when an instruction is issued to search for matching documents and documents created at the same time as those documents.

【００４０】入力された命令が上記（１）（２）のいず
れであった場合も、入力部２０２は入力された命令を後
述する第１の文書抽出部２０３に対して出力する。な
お、入力された命令が上記（１）（２）のいずれでもな
かった場合は、入力した命令を他の適切な機能部（図示
せず）に対して出力するが、このような命令の種類の判
定や出力先の振り分けは、たとえば図示を省略するコマ
ンド解析部によっておこなう。When the input command is any of the above (1) and (2), the input unit 202 outputs the input command to the first document extracting unit 203 described later. If the input command is neither of the above (1) and (2), the input command is output to another appropriate functional unit (not shown). Is determined and output destinations are distributed by, for example, a command analysis unit (not shown).

【００４１】第１の文書抽出部２０３は、入力部２０２
から入力された命令に含まれる検索文を抜き出して、ベ
クトル計算部２０１に対して出力する。ベクトル計算部
２０１は上述のようにしてクエリーベクトルと文書ベク
トルとを作成する。そして、それらの類似度を算出し
て、その処理結果を第１の文書抽出部２０３に対して出
力する。The first document extraction unit 203 has an input unit 202.
The search sentence included in the command input from is extracted and output to the vector calculation unit 201. The vector calculation unit 201 creates the query vector and the document vector as described above. Then, the degree of similarity between them is calculated, and the processing result is output to the first document extracting unit 203.

【００４２】第１の文書抽出部２０３は、ベクトル計算
部２０１から入力した処理結果とあらかじめ設定された
基準とにもとづいて、ファイル記憶部２００に記憶され
た文書のうちのいずれかを抽出する。たとえば、検索文
に対する適合度の高いものから順に２０件の文書や、適
合度が所定のしきい値を超えたすべての文書等を抽出す
る。このようにして第１の文書抽出部２０３により抽出
された文書を、以下では「適合文書」と呼ぶことにす
る。The first document extraction unit 203 extracts one of the documents stored in the file storage unit 200 based on the processing result input from the vector calculation unit 201 and a preset reference. For example, 20 documents in order from the one having the highest matching degree with respect to the search sentence, or all the documents having the matching degree exceeding a predetermined threshold value are extracted. The document extracted by the first document extracting unit 203 in this way is hereinafter referred to as a “matching document”.

【００４３】なお本実施の形態では、抽出の規準はあら
かじめ第１の文書抽出部２０３に保持されているものと
したが、操作者が検索文等を入力する際に、たとえば上
位２０件等といった規準をあわせて入力できるようにし
てもよい。In the present embodiment, the extraction criterion is assumed to be held in advance in the first document extracting section 203, but when the operator inputs a search sentence or the like, for example, the top 20 or the like. It may be possible to input the criteria together.

【００４４】さらに第１の文書抽出部２０３は、ファイ
ル記憶部２００を検索して、当該適合文書の表題を取得
する。上述のようにファイル記憶部２００の文書はＳ−
ＪＩＳ形式に統一されているため、文書中の<title>タ
グを検索することにより、当該文書の表題を取得するこ
とができる。Further, the first document extraction section 203 searches the file storage section 200 to obtain the title of the relevant document. As described above, the document in the file storage unit 200 is S-
Since it is unified in JIS format, the title of the document can be obtained by searching the <title> tag in the document.

【００４５】ここでは上述（１）の指示に応じた上記処
理によって、それぞれ「アポロ１１号月面着陸」「アポ
ロ１０号打ち上げ」「アポロ１３号故障報告」「アポロ
１号火災」の表題を有する文書４件が、また上述（２）
の指示に応じた上記処理によって、「ＧＤＰ、１１．２
％マイナス成長」「９７年度ＧＤＰ成長率Ｄ総研は−
０．５％予測」「「日本成長率１．１％」ＩＭＦが下方
修正」の表題を有する文書３件が、それぞれ抽出された
ものとする。Here, by the processing according to the above-mentioned instruction (1), the titles of "Apollo 11 lunar landing", "Apollo 10 launch", "Apollo 13 failure report", and "Apollo 1 fire" are respectively given. 4 documents, above (2)
According to the above-mentioned processing according to the instruction of “GDP, 11.2
% Negative growth ”“ GDP growth rate in 1997 D Research Institute-
It is assumed that three documents having the titles of "0.5% prediction" and "" Japan growth rate 1.1% "IMF revised downward" have been extracted.

【００４６】そして第１の文書抽出部２０３は、入力部
２０２から入力した指示が上述の（１）であった場合
は、適合文書の表題とファイル名（あるいは識別番号
等、ファイルを一意に特定できるものであれば何であっ
てもよい。以下同じ）とを後述する文書並び替え部２０
６に対して出力する。また、入力部２０２から入力した
指示が上述の（２）であった場合は、適合文書の表題と
ファイル名とを文書並び替え部２０６に対して、またフ
ァイル名のみを後述する第２の文書抽出部２０４に対し
て、それぞれ出力する。If the instruction input from the input unit 202 is the above (1), the first document extracting unit 203 uniquely identifies the file such as the title and file name (or the identification number) of the matching document. Anything is possible as long as it is possible. The same applies to the following)
Output to 6. If the instruction input from the input unit 202 is the above (2), the title and file name of the matching document are sent to the document sorting unit 206, and only the file name is described later in the second document. It outputs to each extraction part 204.

【００４７】第２の文書抽出部２０４は、第１の文書抽
出部２０３から入力したファイル名によってファイル記
憶部２００を検索し、<createdate>タグからそれら適合
文書の作成日時を取得する。そして、それぞれの適合文
書の作成日時を挟む前後の所定期間、たとえば作成日当
日とその前日および翌日の三日間の間に作成された文書
を、ファイル記憶部２００に記憶された個々の文書の<c
reateddate>タグを参照することにより検索する。The second document extraction unit 204 searches the file storage unit 200 by the file name input from the first document extraction unit 203, and acquires the creation date and time of the relevant documents from the <createdate> tag. Then, the documents created during a predetermined period before and after the creation date and time of each relevant document, for example, three days of the current day of creation and the day before and the next day of the creation date are stored in the file storage unit 200. c
Search by referring to the reateddate> tag.

【００４８】なお本実施の形態では、この検索の対象と
なる所定期間はあらかじめ第２の文書抽出部２０４に保
持されているものとしたが、操作者が検索文等を入力す
る際に、たとえば適合文書の作成当日とその前後の一
日、等といった指定をあわせて入力できるようにしても
よい。In the present embodiment, the predetermined period to be searched is held in advance in the second document extracting section 204, but when the operator inputs a search sentence or the like, for example, It is also possible to make it possible to input together the designation such as the day of creation of the matching document and the days before and after that.

【００４９】そして第２の文書抽出部２０４は、上記の
ようにして取得した文書のファイル名を、当該文書がい
ずれの適合文書と同時期に作成された文書であるか（言
い換えれば、いずれの適合文書に対応して抽出された文
書であるか）が区別できるような情報を付して、後述す
る文書分類部２０５に対して出力する。Then, the second document extraction unit 204 uses the file name of the document acquired as described above to determine which relevant document the document was created at the same time (in other words, which document was created). Information is added so that it can be distinguished whether it is a document extracted corresponding to a matching document or not, and is output to the document classification unit 205 described later.

【００５０】文書分類部２０５は、第２の文書抽出部２
０４から入力したファイル名をベクトル計算部２０１に
対して出力し、それらのファイル名を有する文書の相互
の類似度、すなわちそれらの文書ベクトルの相互の類似
度を算出するように指示する。ベクトル計算部２０１は
上述のようにしてそれらの文書の類似度を算出し、処理
結果を文書分類部２０５に対して出力する。The document classifying unit 205 includes the second document extracting unit 2
The file names input from 04 are output to the vector calculation unit 201, and instructions are given to calculate the degree of similarity between documents having those file names, that is, the degree of similarity between those document vectors. The vector calculation unit 201 calculates the degree of similarity between these documents as described above, and outputs the processing result to the document classification unit 205.

【００５１】なおこの類似度の算出は、同じ適合文書に
対応する文書間でおこなう。たとえば適合文書Ａと同時
期に作成された文書としてａ１、ａ２、ａ３およびａ４
の４件が、適合文書Ｂと同時期に作成された文書として
ｂ１、ｂ２およびｂ３の３件が、それぞれ第２の文書抽
出部２０４から入力されたとすると、類似度の算出はａ
１、ａ２、ａ３およびａ４の文書群内と、ｂ１、ｂ２お
よびｂ３の文書群内とで独立しておこない、文書ａ１と
ｂ１、ａ２とｂ３等の文書群をまたがる類似度の算出は
おこなわない。The calculation of the degree of similarity is performed between documents corresponding to the same matching document. For example, as documents created at the same time as the conforming document A, a1, a2, a3, and a4
Assuming that the four documents b1, b2, and b3 are input as documents created at the same time as the conforming document B from the second document extraction unit 204, the similarity is calculated as a.
It is performed independently in the document group of 1, a2, a3, and a4 and in the document group of b1, b2, and b3, and the degree of similarity across the document groups of documents a1, b1, a2, b3, etc. is not calculated. .

【００５２】文書分類部２０５は、ベクトル計算部２０
１から入力した処理結果にもとづいて、類似度が所定の
しきい値を超えた文書どうしを一つのグループにまとめ
ることにより、第２の文書抽出部２０４によって検索さ
れた文書を、いくつかの内容的に類似した文書のグルー
プに分類する。The document classification unit 205 is a vector calculation unit 20.
Based on the processing result input from 1, the documents whose similarity exceeds a predetermined threshold value are grouped into one group, so that the documents retrieved by the second document extracting unit 204 may have several contents. Categorize documents into groups of similar documents.

【００５３】たとえば、適合文書Ａについて第２の文書
抽出部２０４により検索された文書がａ１、ａ２、ａ３
およびａ４の４件あり、ベクトル計算部２０１によりそ
れぞれの文書ベクトルの類似度がｖ１とｖ２とは５０
０、ｖ１とｖ３とは２０、ｖ１とｖ４とは４００、ｖ２
とｖ３とは５０、ｖ２とｖ４とは３５０、ｖ３とｖ４と
は１００と算出されたとして、文書分類部２０５であら
かじめ保持しているしきい値が３００であったとする
と、これら４つの文書はａ１、ａ２およびａ４からなる
グループと、ａ３からなるグループとの、二つのグルー
プに分類される。For example, the documents retrieved by the second document extracting unit 204 for the matching document A are a1, a2, a3.
And there are four cases of a4, and the vector calculation unit 201 determines that the similarity of each document vector is 50 for v1 and v2.
0, v1 and v3 are 20, v1 and v4 are 400, v2
And v3 are calculated to be 50, v2 and v4 are calculated to be 350, and v3 and v4 are calculated to be 100. If the threshold value held in advance by the document classification unit 205 is 300, these four documents are It is classified into two groups, a group consisting of a1, a2 and a4 and a group consisting of a3.

【００５４】そして文書分類部２０５は、上記の処理結
果を第２の文書抽出部２０４に対して出力する。第２の
文書抽出部２０４は、文書分類部２０５から入力したグ
ループの中から、あらかじめ設定された基準にしたがっ
ていずれかのグループを抽出する。たとえば、当該グル
ープに属する文書数の最も多いグループを一つ、あるい
は二つ以上の文書を含むグループすべて、等を抽出す
る。ここでは適合文書Ａについて、文書ａ１、ａ２およ
びａ４を含むグループのみが抽出されたものとする。Then, the document classification unit 205 outputs the above processing result to the second document extraction unit 204. The second document extraction unit 204 extracts one of the groups input from the document classification unit 205 according to a preset standard. For example, one group having the largest number of documents belonging to the group or all groups including two or more documents is extracted. Here, for the matching document A, only the group including the documents a1, a2, and a4 is extracted.

【００５５】さらに第２の文書抽出部２０４は、抽出し
たグループに属する各文書、上記の例では文書ａ１、ａ
２およびａ４について、ファイル記憶部２００を検索し
て<title>タグに記述されている当該文書の表題を取得
する。Further, the second document extracting section 204 determines the documents belonging to the extracted group, that is, the documents a1 and a in the above example.
For 2 and a4, the file storage unit 200 is searched to obtain the title of the document described in the <title> tag.

【００５６】ここではたとえば、上述（２）の指示に応
じて第１の文書抽出部２０３により抽出された適合文書
「ＧＤＰ、１１．２％マイナス成長」と、同時期に作成
された文書の表題として、「東証平均株価１万８０００
円割る」「東証１万８０００円割れ難題次々市場に
弱気も」および「東証株価全面安に」が取得されたもの
とする。Here, for example, the conforming document “GDP, 11.2% minus growth” extracted by the first document extracting unit 203 in accordance with the above-mentioned instruction (2) and the title of the document created at the same time. "TSE average stock price 18,000
It is assumed that the yen is divided, the TSE falls below 18,000 yen, the difficult problems one after another and the market is bearish, and the TSE stock price is fully weakened.

【００５７】また適合文書「９７年度ＧＤＰ成長率Ｄ
総研は−０．５％予測」についても、当該文書と同時期
に作成された文書の表題として「有事の対米協力拡大
日米、新防衛指針に合意」および「中国外相新防衛指
針でけん制」が、また適合文書「「日本成長率１．１
％」ＩＭＦが下方修正」についても、当該文書と同時期
に作成された文書の表題として「Ｙ証券元専務ら５人
きょう聴取」および「Ｙ証券元専務ら５人逮捕」が、
それぞれ取得されたものとする。In addition, the conforming document “Fiscal 1997 GDP Growth Rate D
Regarding the “-0.5% forecast by R & D,” the title of the document prepared at the same time as the document was “Expansion of cooperation with the US in an emergency.
"Japan and the United States agree on new defense guidelines" and "Chinese foreign ministers restraint on new defense guidelines"
% Regarding “IMF downward revision”, the titles of documents prepared at the same time as the document were “Listed by five former managing directors of Y Securities” and “Arrest of five managing directors of Y securities”.
It is assumed that each has been acquired.

【００５８】そして第２の文書抽出部２０４は、上記の
ようにして取得した文書の表題とファイル名とを、後述
する文書並び替え部２０６に対して出力する。Then, the second document extracting section 204 outputs the title and file name of the document obtained as described above to the document rearranging section 206 described later.

【００５９】文書並び替え部２０６は、第１の文書抽出
部２０３および第２の文書抽出部２０４から入力したフ
ァイル名によってファイル記憶部２００を検索し、<cre
ateddate>タグに記載されたそれら文書の作成日時を取
得する。そしてまず第１の文書抽出部２０３から入力し
た適合文書の表題を、その作成日時の古い順に並び替え
る。The document rearranging unit 206 searches the file storage unit 200 by the file name input from the first document extracting unit 203 and the second document extracting unit 204, and
Acquires the creation date and time of those documents described in the ateddate> tag. Then, first, the titles of the matching documents input from the first document extracting unit 203 are rearranged in the order of the creation date and time.

【００６０】たとえば、上述（１）の指示に応じて第１
の文書抽出部２０３により抽出された上記４件の文書の
作成日時が、それぞれ１９６９年７月２０日、１９６９
年５月１８日、１９７０年４月１３日、１９６７年１月
２７日であったとすると、上記４件の文書の表題を「ア
ポロ１号火災」「アポロ１０号打ち上げ」「アポロ１１
号月面着陸」「アポロ１３号故障報告」の順に並び替え
る。For example, in response to the instruction of (1) above, the first
The creation dates and times of the above four documents extracted by the document extracting unit 203 are July 20, 1969 and 1969, respectively.
If it was May 18, 2013, April 13, 1970, and January 27, 1967, the titles of the above four documents would be "Apollo 1 Fire", "Apollo 10 Launch", and "Apollo 11".
Sort by No. Landing on the moon, "Apollo 13 trouble report".

【００６１】同様に、上述（２）の指示に応じて第１の
文書抽出部２０３により取得された上記３件の文書の作
成日時が、それぞれ１９９７年９月１２日、同２５日、
同１７日であったとすると、上記３件の文書の表題を、
「ＧＤＰ、１１．２％マイナス成長」「「日本成長率
１．１％」ＩＭＦが下方修正」「９７年度ＧＤＰ成長率
Ｄ総研は−０．５％予測」の順に並び替える。Similarly, the creation dates and times of the three documents acquired by the first document extraction unit 203 in accordance with the above-mentioned instruction (2) are September 12, 1997 and September 25, 1997, respectively.
If the date is 17th, the titles of the above three documents are
Sort by "GDP, 11.2% negative growth", "Japan growth rate 1.1%" IMF downward revision, and "GDP growth rate D Research Institute in 1997 forecast -0.5%".

【００６２】つぎに文書並び替え部２０６は、第２の文
書抽出部２０４から入力した文書の表題を並び替える。
これらはまず対応する適合文書の作成日時の古い順に並
び替えられ、同じ適合文書に対応する文書の中では意味
内容の類似するグループごとに並び替えられ、さらに同
じグループに属する文書の中では当該文書自体の作成日
時の古い順に並び替えられる。Next, the document rearranging section 206 rearranges the titles of the documents input from the second document extracting section 204.
These are first sorted in the order of the creation date and time of the corresponding relevant document, sorted by groups having similar meanings in the documents corresponding to the same relevant document, and further, among documents belonging to the same group. They are sorted in order of oldest creation date.

【００６３】このようにして第１の文書抽出部２０３お
よび第２の文書抽出部２０４から入力した文書の表題を
並び替えると、文書並び替え部２０６はこれらの文書の
表題をその作成日時と対応づけて、後述する表示部２０
７に対して出力する。When the titles of the documents input from the first document extracting unit 203 and the second document extracting unit 204 are rearranged in this way, the document rearranging unit 206 associates the titles of these documents with their creation dates and times. In addition, the display unit 20 described later
Output to 7.

【００６４】表示部２０７は、具体的にはディスプレイ
１０８や表示用のメモリ等により実現され、文書並び替
え部２０６から入力した文書の表題と作成日時とを表示
画面上に表示する。The display unit 207 is specifically realized by the display 108, a display memory, and the like, and displays the title of the document and the creation date and time input from the document rearrangement unit 206 on the display screen.

【００６５】図３は、入力部２０２から上述（１）の指
示が入力された場合に、最終的に表示部２０７によって
表示される画面の一例を示す説明図である。第１の文書
抽出部２０３によって抽出された適合文書の表題と作成
日時とが、文書並び替え部２０６によってその作成日時
の順に並び替えられて表示される。FIG. 3 is an explanatory diagram showing an example of a screen finally displayed by the display unit 207 when the instruction (1) is input from the input unit 202. The titles and the creation dates and times of the matching documents extracted by the first document extraction unit 203 are sorted and displayed by the document sorting unit 206 in the order of the creation dates and times.

【００６６】また図４は、入力部２０２から上述（２）
の指示が入力された場合に、最終的に表示部２０７によ
って表示される画面の一例を示す説明図である。適合文
書の表題と作成日時とが、その作成日時の順に並び替え
られるとともに、第２の文書抽出部２０４により抽出さ
れた、それぞれの適合文書と同時期に作成された文書の
表題と作成日時とが、その作成日時等の順にあわせて表
示される。Further, FIG. 4 shows the above-mentioned (2) from the input unit 202.
FIG. 8 is an explanatory diagram showing an example of a screen finally displayed by the display unit 207 when the instruction is input. The titles and the creation dates and times of the matching documents are sorted in the order of the creation date and time, and the titles and the creating dates and times of the documents extracted by the second document extracting unit 204 are created at the same time as the matching documents. Are displayed according to the order of creation date and time.

【００６７】なお本実施の形態では、適合文書と同時期
に作成された文書を意味的に類似するいくつかのグルー
プに分類して、ある程度顕著なグループ（たとえば、そ
れに属する文書数が多い等）に属する文書のみを抽出し
て操作者に提示するようにしたが、適合文書と同時期に
作成された文書をすべて表示するようにしてもよい。In the present embodiment, the documents created at the same time as the conforming document are classified into several groups that are semantically similar to each other, and the groups are conspicuous to some extent (for example, a large number of documents belong to it). Although only the documents belonging to the document are extracted and presented to the operator, all the documents created at the same time as the matching document may be displayed.

【００６８】また本実施の形態では、適合文書と同時期
に作成された文書を表示するようにしたが、適合文書と
同時期に公表された文書を表示するようにしてもよい。
これを実現するには、第２の文書抽出部２０４におい
て、<createddate>タグの代わりに<pubdate>タグを参照
するようにすればよい。上記の作成日時や公表日時のほ
か、任意のタグに記述された送信日時や受信日時、最終
更新日時等を参照して、たとえば適合文書が送信された
のと同時期に送信された文書を表示するように操作者が
指定できるようにしてもよい。In the present embodiment, the document created at the same time as the conforming document is displayed, but the document published at the same time as the conforming document may be displayed.
In order to realize this, the second document extraction unit 204 may refer to the <pubdate> tag instead of the <createddate> tag. In addition to the above creation date and publication date, refer to the transmission date and time, the reception date and time, and the last update date and time written on any tag to display the documents that were sent at the same time as the conforming document was sent. The operator may be allowed to specify so.

【００６９】なおファイル記憶部２００、ベクトル計算
部２０１、入力部２０２、第１の文書抽出部２０３、第
２の文書抽出部２０４、文書分類部２０５、文書並び替
え部２０６および表示部２０７は、それぞれＲＯＭ１０
２、ＲＡＭ１０３またはハードディスク１０５、フロッ
ピーディスク１０７等の記録媒体に記録されたプログラ
ムに記載された命令にしたがってＣＰＵ１０１等が命令
処理を実行することにより、各部の機能を実現するもの
である。The file storage unit 200, vector calculation unit 201, input unit 202, first document extraction unit 203, second document extraction unit 204, document classification unit 205, document rearrangement unit 206 and display unit 207 are ROM10 respectively
2, the RAM 103 or the hard disk 105, the floppy disk 107 or the like, the CPU 101 or the like executes the command processing according to the command described in the program recorded in the recording medium, thereby realizing the function of each unit.

【００７０】つぎに、本実施の形態にかかる文書検索装
置の一連の処理について説明する。図５は、本実施の形
態にかかる文書検索装置の処理の手順を示すフローチャ
ートである。入力部２０２から入力された指示はまず図
示しないコマンド解析部によって解析され、当該指示が
ファイル記憶部２００からの文書の検索であると判定さ
れた場合に、本フローチャートによる処理を開始する。Next, a series of processes of the document search device according to the present embodiment will be described. FIG. 5 is a flowchart showing a procedure of processing of the document search device according to the present embodiment. The instruction input from the input unit 202 is first analyzed by a command analysis unit (not shown), and when it is determined that the instruction is a document search from the file storage unit 200, the process according to this flowchart is started.

【００７１】図５のフローチャートにおいて、まずステ
ップＳ５０１で、第１の文書抽出部２０３および第１の
文書抽出部２０３により呼び出されたベクトル計算部２
０１によって、ファイル記憶部２００に記憶された文書
の中から、入力された検索文に適合する文書を抽出す
る。続くステップＳ５０２において、ステップＳ５０１
で抽出された文書の表題をファイル記憶部２００から取
得し、それらの文書の表題とファイル名とを文書並び替
え部２０６に対して出力する。In the flowchart of FIG. 5, first, in step S501, the first document extraction unit 203 and the vector calculation unit 2 called by the first document extraction unit 203.
By 01, a document matching the input search sentence is extracted from the documents stored in the file storage unit 200. In the following step S502, the step S501
The titles of the documents extracted in step 3 are acquired from the file storage unit 200, and the titles and file names of those documents are output to the document sorting unit 206.

【００７２】ステップＳ５０３において、入力された指
示の中で、適合文書のみを検索するか、あるいは適合文
書とともにそれらと同時期に作成された文書も検索する
か、のいずれが指定されているかを判定する。そして適
合文書のみを検索する旨の指示であると判定されたとき
は（ステップＳ５０３肯定）、ステップＳ５０８に移行
する。In step S503, it is determined which of the designated instructions is to be searched, only the matching documents are searched, or the matching documents and documents created at the same time as those are searched. To do. When it is determined that the instruction is to retrieve only the matching documents (Yes in step S503), the process proceeds to step S508.

【００７３】また、適合文書とともにそれらと同時期に
作成された文書も検索する旨の指示であると判定された
ときは（ステップＳ５０３否定）、ステップＳ５０４に
おいて、第２の文書抽出部２０４によって、ファイル記
憶部２００に記憶された文書の中からステップＳ５０１
で抽出された適合文書と同時期に作成された文書を検索
し、検索された文書のファイル名を文書分類部２０５に
対して出力する。If it is determined that the instruction is to retrieve the matching documents as well as the documents created at the same time (NO at step S503), at step S504, the second document extracting unit 204 Step S501 from the documents stored in the file storage unit 200
A document created at the same time as the relevant document extracted in step S3 is searched, and the file name of the searched document is output to the document classification unit 205.

【００７４】ステップＳ５０５において、文書分類部２
０５により呼び出されたベクトル計算部２０１によっ
て、ステップＳ５０４で抽出された文書どうしの類似度
を算出する。そしてステップＳ５０６において、類似度
が所定のしきい値を超えた文書どうしをグループとして
まとめることにより、ステップＳ５０４で抽出された文
書を分類し、処理結果を第２の文書抽出部２０４に対し
て出力する。In step S505, the document classification unit 2
The vector calculation unit 201 called in 05 calculates the similarity between the documents extracted in step S504. Then, in step S506, the documents extracted in step S504 are classified by grouping the documents whose similarity exceeds a predetermined threshold value, and the processing result is output to the second document extraction unit 204. To do.

【００７５】ステップＳ５０７において、第２の文書抽
出部２０４によって、文書分類部２０５から入力したい
ずれか（またはすべて）のグループに属する文書の表題
をファイル記憶部２００から取得する。そしてこれらの
文書の表題とファイル名とを、文書並び替え部２０６に
対して出力する。In step S 507, the second document extraction unit 204 acquires, from the file storage unit 200, the titles of the documents belonging to any (or all) groups input from the document classification unit 205. Then, the titles and file names of these documents are output to the document rearrangement unit 206.

【００７６】ステップＳ５０８において、文書並び替え
部２０６によって、第１の文書抽出部２０３および第２
の文書抽出部２０４から入力した文書の作成日時をファ
イル記憶部２００を検索して取得する。さらにステップ
Ｓ５０９において、それらの文書の表題および作成日時
をその作成日時等の順に並び替え、処理結果を表示部２
０７に対して出力する。In step S508, the document rearrangement unit 206 causes the first document extraction unit 203 and the second document extraction unit 203 to
The document storage unit 200 searches the file storage unit 200 for the creation date and time of the document input from the document extraction unit 204. Further, in step S509, the titles and creation dates and times of those documents are rearranged in the order of creation date and time, and the processing result is displayed on the display unit 2.
It outputs to 07.

【００７７】そしてステップＳ５１０において、文書並
び替え部２０６から入力した処理結果を表示画面に表示
して、本フローチャートによる処理を終了する。Then, in step S510, the processing result input from the document rearrangement unit 206 is displayed on the display screen, and the processing according to this flowchart ends.

【００７８】以上説明したように本実施の形態によれ
ば、ある検索要求に適合する文書、言い換えれば同一ま
たは類似する事象や主題について記述した文書が、その
作成日時順に表示されるので、操作者はある事柄に関す
る経緯等を容易に把握することができる。As described above, according to the present embodiment, the documents conforming to a certain retrieval request, in other words, the documents describing the same or similar phenomenon or subject are displayed in the order of the creation date and time. It is possible to easily understand the history of a certain matter.

【００７９】また適合文書と同時期に作成された文書
が、意味内容の類似するグループに分類して表示される
ので、ある事象と前後して起こった事象が何であったか
を容易に知ることができる。Further, since the documents created at the same time as the conforming document are classified and displayed in groups having similar meanings, it is possible to easily know what happened before and after a certain event. .

【００８０】事象の発生時期が近いことは単なる偶然の
可能性もあるが、すでに発見されている、あるいはまだ
発見されていないなんらかの関連性を示唆している可能
性もあるので、このように適合文書と同時期に作成され
た文書もあわせて検索することにより、事象間の関連性
を調査しようとする操作者に有益な情報を提供すること
ができる。The fact that the time of occurrence of the event is close may be just a coincidence, but it may indicate some relation that has already been discovered or has not yet been discovered. By searching for the documents created at the same time as the documents, it is possible to provide useful information to the operator who is going to investigate the relation between the events.

【００８１】なお、本実施の形態で説明した文書検索方
法は、あらかじめ用意されたプログラムをパーソナルコ
ンピュータやワークステーション等のコンピュータで実
行することにより実現される。このプログラムは、ハー
ドディスク、フロッピーディスク、ＣＤ−ＲＯＭ、Ｍ
Ｏ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体
に記録され、コンピュータによって記録媒体から読み出
されて実行される。またこのプログラムは、上記記録媒
体を介して、インターネット等のネットワークを介して
配布することができる。The document search method described in this embodiment is realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is a hard disk, floppy disk, CD-ROM, M
It is recorded on a computer-readable recording medium such as O or DVD, and is read from the recording medium by the computer and executed. The program can be distributed via the recording medium and a network such as the Internet.

【００８２】[0082]

【発明の効果】以上説明したように、請求項１の発明に
よれば、第１の文書抽出手段が電子化された複数の文書
の中から入力された検索要求に適合する文書を抽出し、
第２の文書抽出手段が、前記第１の文書抽出手段により
抽出された文書が作成された日時に関する情報を取得
し、取得された情報にかかる日時の前後の所定期間にお
いて作成された文書を、前記複数の文書の中から、当該
複数の文書の日時に関する情報を参照して抽出し、文書
類似度算出手段が前記第２の文書抽出手段により抽出さ
れた文書どうしの類似度を算出し、文書分類手段が前記
文書類似度算出手段により算出された類似度が所定のし
きい値を超えた文書どうしをまとめることにより前記第
２の文書抽出手段により抽出された文書を分類するた
め、入力された検索要求に適合する文書とともに、当該
適合文書と同時期に作成された文書もあわせて検索さ
れ、これによって、電子化された複数の文書に記載され
た事象間の、時間的な共起関係を操作者に分かりやすく
提示することが可能な文書検索装置が得られるという効
果を奏する。As described above, according to the first aspect of the present invention, the first document extracting means extracts a document matching the input search request from a plurality of digitized documents,
The second document extracting means acquires information on the date and time when the document extracted by the first document extracting means is created, and the documents created in a predetermined period before and after the date and time according to the acquired information, From the plurality of documents, information regarding the dates and times of the plurality of documents is referred to and extracted, and the document similarity calculation unit calculates the similarity between the documents extracted by the second document extraction unit, The classification unit collects the documents whose similarity calculated by the document similarity calculation unit exceeds a predetermined threshold to classify the documents extracted by the second document extraction unit. Documents that meet the search request as well as documents that were created at the same time as the relevant documents are searched, and as a result, the temporal sharing between events described in multiple digitized documents is performed. There is an effect that the document search apparatus capable of presenting an easy-to-understand relationship operator is obtained.

【００８３】また、請求項２の発明によれば、第１の文
書抽出工程が電子化された複数の文書の中から入力され
た検索要求に適合する文書を抽出し、第２の文書抽出工
程が、前記第１の文書抽出工程により抽出された文書が
作成された日時に関する情報を取得し、取得された情報
にかかる日時の前後の所定期間において作成された文書
を、前記複数の文書の中から、当該複数の文書の日時に
関する情報を参照して抽出し、文書類似度算出工程が前
記第２の文書抽出工程により抽出された文書どうしの類
似度を算出し、文書分類工程が前記文書類似度算出工程
により算出された類似度が所定のしきい値を超えた文書
どうしをまとめることにより前記第２の文書抽出工程に
より抽出された文書を分類するため、入力された検索要
求に適合する文書とともに、当該適合文書と同時期に作
成された文書もあわせて検索され、これによって、電子
化された複数の文書に記載された事象間の、時間的な共
起関係を操作者に分かりやすく提示することが可能な文
書検索方法が得られるという効果を奏する。According to the second aspect of the present invention, the first document extracting step extracts a document matching the input search request from the plurality of digitized documents, and the second document extracting step Of the plurality of documents, the information about the date and time when the document extracted by the first document extracting step is acquired, and the document created during a predetermined period before and after the date and time according to the acquired information From the document, the document similarity calculation step calculates the similarity between the documents extracted by the second document extraction step, and the document classification step calculates the document similarity. The documents extracted in the second document extraction step are classified by grouping together the documents whose similarity calculated in the index calculation step exceeds a predetermined threshold value, so that the documents satisfy the input search request. At the same time, documents that were created at the same time as the relevant documents are also searched, and this allows the operator to easily understand the temporal co-occurrence relationship between events described in multiple digitized documents. An effect is obtained that a document search method that can be performed is obtained.

【００８４】また、請求項３の発明によれば、請求項２
に記載された方法をコンピュータに実行させるプログラ
ムを記録したことで、そのプログラムが機械読み取り可
能となり、これによって、請求項２の動作をコンピュー
タによって実現することが可能な記録媒体が得られると
いう効果を奏する。According to the invention of claim 3, claim 2
By recording a program for causing a computer to execute the method described in (1), the program becomes machine-readable, and as a result, a recording medium capable of realizing the operation of claim 2 by a computer can be obtained. Play.

[Brief description of drawings]

【図１】本実施の形態にかかる文書検索装置のハードウ
エア構成を示すブロック図である。FIG. 1 is a block diagram showing a hardware configuration of a document search device according to the present embodiment.

【図２】本実施の形態にかかる文書検索装置の構成を機
能的に示すブロック図である。FIG. 2 is a block diagram functionally showing the configuration of the document search device according to the present embodiment.

【図３】本実施の形態にかかる表示部２０７によって表
示される画面の一例を示す説明図である。FIG. 3 is an explanatory diagram showing an example of a screen displayed by a display unit 207 according to the present embodiment.

【図４】本実施の形態にかかる表示部２０７によって表
示される画面のほかの一例を示す説明図である。FIG. 4 is an explanatory diagram showing another example of a screen displayed by the display unit 207 according to the present embodiment.

【図５】本実施の形態にかかる文書検索装置の処理の手
順を示すフローチャートである。FIG. 5 is a flowchart showing a processing procedure of the document search device according to the present embodiment.

[Explanation of symbols]

１００バス１０１ＣＰＵ１０２ＲＯＭ１０３ＲＡＭ１０４ＨＤＤ１０５ＨＤ１０６ＦＤＤ１０７ＦＤ１０８ディスプレイ１０９Ｉ／Ｆ１１０通信回線１１１キーボード１１２マウス１１３スキャナ１１４プリンタ１１５ＣＤ−ＲＯＭ１１６ＣＤ−ＲＯＭドライブ２００ファイル記憶部２０１ベクトル計算部２０１ａクエリーベクトル作成部２０１ｂ文書ベクトル作成部２０１ｃベクトル類似度算出部２０２入力部２０３第１の文書抽出部２０４第２の文書抽出部２０５文書分類部２０６文書並び替え部２０７表示部 100 bus 101 CPU 102 ROM 103 RAM 104 HDD 105 HD 106 FDD 107 FD 108 display 109 I / F 110 communication line 111 keyboard 112 mice 113 scanner 114 printer 115 CD-ROM 116 CD-ROM drive 200 file storage 201 Vector calculator 201a Query vector creation unit 201b Document vector creation unit 201c vector similarity calculation unit 202 Input section 203 first document extraction unit 204 second document extraction unit 205 Document classification section 206 Document Sorting Unit 207 display

フロントページの続き (56)参考文献特開平９−101990（ＪＰ，Ａ) 藪暁彦，必要な情報を即座に取り出すデータ検索ソフトＤａｔａｈｕｎｔｅｒＶｅｒ．１．１，ＩＮＴＥＲＮＥＴｍａｇａｚｉｎｅ，日本，株式会社インプレス，1990年４月１日，第51 号，第239頁情報検索・活用ソフト使ってみようＤａｔａｈｕｎｔｅｒ，日本，シャープ株式会社，1999年７月９日，第１− 16頁住田一男、三池誠司，情報フィルタリング技術，東芝レビュー，日本，東芝, 1996年１月１日，Ｖｏｌ．51、Ｎｏ．１，第42〜44頁那須川哲哉、諸橋正幸、長野徹，特集フィールドを広げる自然言語処理２テキストマイニング−膨大な文書データの自動分析による知識発見，情報処理, 日本，社団法人情報処理学会，1999年４月15日，第40巻、第４号，358−364頁 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 - 17/30 419 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References Japanese Patent Laid-Open No. 9-101990 (JP, A) Akihiko Yabu, Data search software Datahunter Ver. 1.1, INTERNET magazine, Japan, Impress Co., Ltd., April 1, 1990, No. 51, page 239 Information Retrieval and Utilization Software Let's Use Datahunter, Japan, Sharp Inc., July 9, 1999 , Pp. 1-16 Kazuo Sumita, Seiji Miike, Information Filtering Technology, Toshiba Review, Japan, Toshiba, January 1, 1996, Vol. 51, No. 1, pp. 42-44 Tetsuya Nasukawa, Masayuki Morohashi, Tohru Nagano, Special Issue Natural Language Processing to Expand Fields 2 Text Mining-Knowledge Discovery by Automatic Analysis of Enormous Document Data, Information Processing, Japan, Information Processing Society of Japan, April 15, 1999, Vol. 40, No. 4, pp. 358-364 (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/30-17/30 419 JISST file (JOIS)

Claims

(57) [Claims]

1. A first document extracting unit for extracting a document matching a search request input from a plurality of digitized documents, and a document extracted by the first document extracting unit is created. The information about the date and time that has been acquired, and extracting the documents created in the predetermined period before and after the date and time related to the acquired information from the plurality of documents by referring to the information on the date and time of the plurality of documents. Second document extracting means, a document similarity calculating means for calculating the similarity between the documents extracted by the second document extracting means, and the similarity calculated by the document similarity calculating means is a predetermined threshold. A document retrieving apparatus comprising: a document classifying unit that classifies the documents extracted by the second document extracting unit by collecting documents that exceed a value.

2. A first document extracting step of extracting a document that matches an input search request from a plurality of digitized documents, and a document extracted by the first document extracting step is created. The information about the date and time that has been acquired, and extracting the documents created in the predetermined period before and after the date and time related to the acquired information from the plurality of documents by referring to the information on the date and time of the plurality of documents. Second document extracting step, a document similarity calculating step of calculating the similarity between the documents extracted by the second document extracting step, and a similarity calculated by the document similarity calculating step having a predetermined threshold. A document retrieval method comprising: a document classification step of classifying the documents extracted by the second document extraction step by collecting documents that exceed a value.

3. A computer-readable recording medium on which a program for causing a computer to execute the method according to claim 2 is recorded.