JP2013196264A

JP2013196264A - Similarity search device and computer program and similarity search method

Info

Publication number: JP2013196264A
Application number: JP2012061609A
Authority: JP
Inventors: Yuichiro Muramatsu; 祐一郎村松
Original assignee: Mitsubishi Electric Information Technology Corp
Current assignee: Mitsubishi Electric Information Technology Corp
Priority date: 2012-03-19
Filing date: 2012-03-19
Publication date: 2013-09-30

Abstract

PROBLEM TO BE SOLVED: To provide a proper storage place where a file is stored to a user under the consideration of the content of the file.SOLUTION: A file storage part 20 stores a plurality of files. An entry word storage part 25 stores an entry word extracted from each file. A data input part 21 inputs data. An entry word extraction part 24 extracts the entry word from the data input by the data input part 21. A search part 26 searches the entry word matching the entry word extracted by the entry word extraction part 24 from among entry words stored in the entry word storage part 25. A similarity calculation part 27 calculates similarity between each file and the data input by the data input part 21 on the basis of the entry word searched by the search part 26.

Description

この発明は、蓄積した多数のファイルのなかから類似したファイルを検索する類似検索装置に関する。 The present invention relates to a similarity search apparatus that searches for a similar file from among a large number of accumulated files.

電子ファイルを適正なファイル保存領域に格納するため、アクセス頻度が高いファイル保存領域を提示して、ファイル保存領域をユーザに選択させる技術がある。 In order to store an electronic file in an appropriate file storage area, there is a technique that presents a file storage area with high access frequency and allows the user to select a file storage area.

特開２０１０−１７０４７２号公報JP 2010-170472 A 特開２０１２−３３１７１号公報JP 2012-33171 A

アクセス頻度が高いファイル保存領域を提示する方式は、電子ファイルの内容をまったく考慮していないので、必ずしも適正なファイル保存領域を提示できるとは限らない。
この発明は、例えば、ファイルの内容を考慮することにより、利用者に対して、ファイルを保存する適正な保管場所を提示することを目的とする。 Since the method for presenting a file storage area with high access frequency does not take into account the contents of the electronic file at all, an appropriate file storage area cannot always be presented.
An object of the present invention is to present a proper storage location for storing a file to a user by considering the contents of the file, for example.

この発明にかかる類似検索装置は、
データを記憶する記憶装置と、ファイル記憶部と、見出し語記憶部と、データ入力部と、見出し語抽出部と、検索部と、類似度算出部とを有し、
上記ファイル記憶部は、上記記憶装置を用いて、複数のファイルを記憶し、
上記見出し語記憶部は、上記記憶装置を用いて、上記ファイル記憶部が記憶したファイルそれぞれについて、上記ファイルから抽出した見出し語を記憶し、
上記データ入力部は、データを入力し、
上記見出し語抽出部は、上記データ入力部が入力したデータから見出し語を抽出し、
上記検索部は、上記見出し語記憶部が記憶した見出し語のなかから、上記見出し語抽出部が抽出した見出し語と一致する見出し語を検索し、
上記類似度算出部は、上記検索部が検索した見出し語に基づいて、上記ファイル記憶部が記憶したファイルそれぞれについて、上記データ入力部が入力したデータとの類似度を算出する
ことを特徴とする。 The similarity search device according to the present invention is:
A storage device for storing data, a file storage unit, a headword storage unit, a data input unit, a headword extraction unit, a search unit, and a similarity calculation unit;
The file storage unit stores a plurality of files using the storage device,
The headword storage unit stores headwords extracted from the file for each file stored by the file storage unit using the storage device,
The data input unit inputs data,
The headword extraction unit extracts headwords from the data input by the data input unit,
The search unit searches for a headword that matches the headword extracted by the headword extraction unit from the headwords stored in the headword storage unit,
The similarity calculation unit calculates a similarity between each file stored in the file storage unit and data input by the data input unit based on an entry word searched by the search unit. .

この発明にかかる類似検索装置によれば、例えば、データ入力部が入力したデータとファイルとの類似度を算出するので、データを保存する適正な保管場所を提示することができる。 According to the similarity search device according to the present invention, for example, since the similarity between the data input by the data input unit and the file is calculated, an appropriate storage location for storing the data can be presented.

実施の形態１におけるファイル蓄積システム１０の全体構成の一例を示す図。1 is a diagram illustrating an example of the overall configuration of a file storage system 10 according to Embodiment 1. FIG. 実施の形態１におけるコンピュータ９０のハードウェア資源の一例を示す図。FIG. 3 is a diagram illustrating an example of hardware resources of the computer 90 according to the first embodiment. 実施の形態１におけるファイル記憶装置１２の機能ブロックの構成の一例を示す図。FIG. 3 is a diagram illustrating an example of a functional block configuration of a file storage device 12 according to the first embodiment. 実施の形態１におけるファイル情報Ｄ２０の一例を示す図。FIG. 6 shows an example of file information D20 in the first embodiment. 実施の形態１における見出し語Ｄ４０の一例を示す図。FIG. 6 shows an example of a headword D40 in the first embodiment. 実施の形態１におけるインデックス情報Ｄ５０の一例を示す図。FIG. 10 shows an example of index information D50 in the first embodiment. 実施の形態１における類似度情報Ｄ７０の一例を示す図。FIG. 6 is a diagram illustrating an example of similarity information D70 in the first embodiment. 実施の形態１における類似出力情報Ｄ８０の一例を示す図。FIG. 6 is a diagram showing an example of similar output information D80 in the first embodiment. 実施の形態１におけるファイル記憶装置１２の処理の流れの一例を示すフロー図。FIG. 3 is a flowchart showing an example of a process flow of the file storage device 12 according to the first embodiment. 実施の形態１における書込み処理Ｓ１２の流れの一例を示すフロー図。FIG. 4 is a flowchart showing an example of a flow of a writing process S12 in the first embodiment. 実施の形態１における類似検索処理Ｓ１４の流れの一例を示すフロー図。FIG. 6 is a flowchart showing an example of a flow of similarity search processing S14 in the first embodiment. 実施の形態２におけるファイル記憶装置１２の機能ブロックの構成の一例を示す図。FIG. 10 is a diagram illustrating an example of a functional block configuration of a file storage device 12 according to a second embodiment. 実施の形態２における見出し語数情報Ｄ６０の一例を示す図。The figure which shows an example of the headword number information D60 in Embodiment 2. FIG. 実施の形態２における類似度情報Ｄ７０の一例を示す図。FIG. 11 is a diagram illustrating an example of similarity information D70 in the second embodiment. 実施の形態２における類似出力情報Ｄ８０の一例を示す図。FIG. 10 shows an example of similar output information D80 in the second embodiment. 実施の形態２における書込み処理Ｓ１２の流れの一例を示すフロー図。FIG. 9 is a flowchart showing an example of a flow of write processing S12 in the second embodiment. 実施の形態２における類似検索処理Ｓ１４の流れの一例を示すフロー図。FIG. 11 is a flowchart showing an example of a flow of similarity search processing S14 in the second embodiment. 実施の形態３におけるファイル記憶装置１２の機能ブロックの構成の一例を示す図。FIG. 10 is a diagram illustrating an example of a functional block configuration of a file storage device 12 according to a third embodiment. 実施の形態３における適合度情報Ｄ９０の一例を示す図。FIG. 10 shows an example of fitness information D90 in the third embodiment. 実施の形態３におけるファイル記憶装置１２の処理の流れの一例を示すフロー図。FIG. 14 is a flowchart showing an example of a process flow of the file storage device 12 according to the third embodiment. 実施の形態３における場所検索処理Ｓ１５の流れの一例を示すフロー図。FIG. 10 is a flowchart showing an example of a flow of location search processing S15 in the third embodiment.

実施の形態１．
実施の形態１について、図１〜図１１を用いて説明する。 Embodiment 1 FIG.
The first embodiment will be described with reference to FIGS.

図１は、この実施の形態におけるファイル蓄積システム１０の全体構成の一例を示す図である。 FIG. 1 is a diagram showing an example of the overall configuration of a file storage system 10 in this embodiment.

ファイル蓄積システム１０は、例えば、ファイル編集装置１１と、ファイル記憶装置１２とを有する。
ファイル記憶装置１２は、多数のファイルを記憶する。ファイルとは、ファイル記憶装置１２が記憶した電子的なデータをオペレーティングシステム（ＯＳ）などが管理する単位である。
ファイル編集装置１１は、利用者の操作にしたがって、ファイル記憶装置１２が記憶したファイルを閲覧したり、編集したりするための装置である。また、ファイル編集装置１１は、利用者の操作にしたがって、新たなデータを入力し、入力したデータを新たなファイルとして、ファイル記憶装置１２に記憶させる。 The file storage system 10 includes, for example, a file editing device 11 and a file storage device 12.
The file storage device 12 stores a large number of files. A file is a unit in which electronic data stored in the file storage device 12 is managed by an operating system (OS) or the like.
The file editing device 11 is a device for browsing and editing a file stored in the file storage device 12 in accordance with a user operation. Further, the file editing device 11 inputs new data in accordance with the user's operation, and stores the input data in the file storage device 12 as a new file.

図２は、この実施の形態におけるコンピュータ９０のハードウェア資源の一例を示す図である。 FIG. 2 is a diagram illustrating an example of hardware resources of the computer 90 in this embodiment.

ファイル編集装置１１やファイル記憶装置１２は、例えば、コンピュータ９０である。
コンピュータ９０は、例えば、処理装置９１と、入力装置９２と、出力装置９３と、記憶装置９４とを有する。 The file editing device 11 and the file storage device 12 are, for example, a computer 90.
The computer 90 includes, for example, a processing device 91, an input device 92, an output device 93, and a storage device 94.

記憶装置９４は、処理装置９１が実行するコンピュータプログラムや、処理装置９１が処理するデータなどを記憶する。記憶装置９４は、例えば、半導体メモリなどの内部記憶装置や、磁気ディスク装置や光学ディスク装置などの外部記憶装置である。
処理装置９１は、記憶装置９４が記憶したコンピュータプログラムを実行することにより、データを処理し、コンピュータ９０全体を制御する。
入力装置９２は、外部から情報を入力し、処理装置９１が処理するデータに変換する。入力装置９２が変換したデータは、処理装置９１が直接処理する構成でもよいし、記憶装置９４が一時的に記憶する構成でもよい。入力装置９２は、例えば、キーボードやマウスなどの操作入力装置、マイクなどの音声入力装置、カメラやスキャナなどの画像入力装置、センサ、アナログデジタル変換装置、受信装置などである。
出力装置９３は、処理装置９１が処理したデータや記憶装置９４が記憶したデータを変換して外部へ出力する。出力装置９３は、例えば、スピーカなどの音声出力装置、画像表示装置、印刷装置、デジタルアナログ変換装置、送信装置などである。 The storage device 94 stores a computer program executed by the processing device 91, data processed by the processing device 91, and the like. The storage device 94 is, for example, an internal storage device such as a semiconductor memory, or an external storage device such as a magnetic disk device or an optical disk device.
The processing device 91 processes the data by executing the computer program stored in the storage device 94 and controls the entire computer 90.
The input device 92 inputs information from outside and converts it into data to be processed by the processing device 91. The data converted by the input device 92 may be processed directly by the processing device 91 or may be stored temporarily by the storage device 94. The input device 92 is, for example, an operation input device such as a keyboard or a mouse, a voice input device such as a microphone, an image input device such as a camera or a scanner, a sensor, an analog-digital conversion device, or a reception device.
The output device 93 converts the data processed by the processing device 91 and the data stored in the storage device 94 and outputs them to the outside. The output device 93 is, for example, a sound output device such as a speaker, an image display device, a printing device, a digital / analog conversion device, a transmission device, or the like.

ファイル記憶装置１２などの機能ブロックは、例えば、処理装置９１がコンピュータプログラムを実行することにより実現される。しかし、これらの機能ブロックは、他の電気的構成や機械的構成などによって実現されるものであってもよい。
また、ファイル記憶装置１２などは、一台のコンピュータ９０ではなく、複数のコンピュータ９０によって構成されるものであってもよい。逆に、一台のコンピュータ９０が、ファイル編集装置１１とファイル記憶装置１２との双方を構成するものであってもよい。 Functional blocks such as the file storage device 12 are realized by the processing device 91 executing a computer program, for example. However, these functional blocks may be realized by other electrical configurations or mechanical configurations.
Further, the file storage device 12 or the like may be configured by a plurality of computers 90 instead of a single computer 90. Conversely, a single computer 90 may constitute both the file editing device 11 and the file storage device 12.

図３は、この実施の形態におけるファイル記憶装置１２の機能ブロックの構成の一例を示す図である。 FIG. 3 is a diagram showing an example of the functional block configuration of the file storage device 12 in this embodiment.

ファイル記憶装置１２は、例えば、ファイル記憶部２０と、指示入力部２２と、データ入力部２１と、ファイル出力部２３と、見出し語抽出部２４と、見出し語記憶部２５と、検索部２６と、類似度算出部２７と、類似出力部２８とを有する。 The file storage device 12 includes, for example, a file storage unit 20, an instruction input unit 22, a data input unit 21, a file output unit 23, a headword extraction unit 24, a headword storage unit 25, and a search unit 26. A similarity calculation unit 27 and a similarity output unit 28.

ファイル記憶部２０は、記憶装置９４を用いて、複数のファイルを記憶する。ファイル記憶部２０が記憶したファイルは、例えば文書ＩＤなどのファイル識別子によって識別される。また、利用者がファイルの内容を理解しやすいように、ファイル記憶部２０が記憶したファイルには、利用者などが命名したファイル名がつけられている。また、利用者がファイルを分類整理するために、ファイル記憶部２０が記憶したファイルは、利用者が指定したフォルダなどの保管場所に収められる。なお、保管場所は、記憶装置９４の物理的な違いを反映したものであってもよいし、記憶装置９４の物理的な違いとは無関係な論理的なものであってもよい。 The file storage unit 20 stores a plurality of files using the storage device 94. The file stored in the file storage unit 20 is identified by a file identifier such as a document ID, for example. In addition, the file name stored by the file storage unit 20 is given a file name named by the user or the like so that the user can easily understand the contents of the file. In addition, the file stored in the file storage unit 20 is stored in a storage location such as a folder designated by the user so that the user can sort and organize the file. The storage location may reflect a physical difference between the storage devices 94 or may be a logical one that is unrelated to the physical difference between the storage devices 94.

データ入力部２１は、ファイル編集装置１１からデータを入力する。 The data input unit 21 inputs data from the file editing device 11.

指示入力部２２は、入力装置９２を用いて、ファイル編集装置１１から利用者による指示を入力する。利用者による指示には、例えば、書込み指示、読出し指示、類似検索指示などがある。
書込み指示とは、データ入力部２１が入力したデータをファイル記憶部２０にファイルとして記憶させることを指示するものである。書込み指示は、例えば、ファイル記憶部２０に記憶させるファイルのファイル名や保管場所を含む。
読出し指示とは、ファイル記憶部２０が記憶したファイルの内容を出力することを指示するものである。読出し指示は、例えば、出力するファイルのファイル名や保管場所を含む。
類似検索指示とは、ファイル記憶部２０が記憶したファイルのなかから、データ入力部２１が入力したデータと類似した内容を持つファイルを探すことを指示するものである。 The instruction input unit 22 inputs an instruction by the user from the file editing apparatus 11 using the input device 92. Examples of instructions by the user include a write instruction, a read instruction, and a similarity search instruction.
The write instruction is an instruction to store the data input by the data input unit 21 in the file storage unit 20 as a file. The write instruction includes, for example, the file name and storage location of the file stored in the file storage unit 20.
The read instruction is an instruction to output the contents of the file stored in the file storage unit 20. The read instruction includes, for example, the file name and storage location of the output file.
The similarity search instruction is an instruction to search for a file having contents similar to the data input by the data input unit 21 from the files stored in the file storage unit 20.

ファイル出力部２３は、出力装置９３を用いて、ファイル記憶部２０が記憶したファイルの内容を出力する。 The file output unit 23 uses the output device 93 to output the contents of the file stored in the file storage unit 20.

見出し語抽出部２４は、処理装置９１を用いて、データ入力部２１が入力したデータや、ファイル記憶部２０が記憶したファイルの内容から、見出し語を抽出する。見出し語とは、そのデータやファイルの内容を表わすキーワードである。見出し語の抽出方式には、例えば、単語分割法やＮグラム法などがある。
単語分割法とは、データを単語に分割して、見出し語とする方式である。例えば、見出し語抽出部２４は、あらかじめ単語辞書を記憶しておき、単語辞書に登録された単語がデータのなかに現れる場合に、その単語を見出し語として抽出する。
Ｎグラム法とは、意味にかかわらず、決められた長さの文字列を切り出す方式である。例えば、見出し語抽出部２４は、データの１文字目からＮ文字目までを第一の見出し語として抽出し、２文字目から（Ｎ＋１）文字目までを第二の見出し語として抽出し、というように、１文字ずつシフトしながら見出し語を抽出する。
なお、見出し語抽出部２４が見出し語を抽出する方式は、単語分割法やＮグラム法に限らず、他の方式であってもよい。 The headword extraction unit 24 uses the processing device 91 to extract headwords from the data input by the data input unit 21 and the contents of the file stored by the file storage unit 20. A headword is a keyword representing the contents of the data or file. Examples of the headword extraction method include a word division method and an N-gram method.
The word division method is a method in which data is divided into words and used as a headword. For example, the headword extraction unit 24 stores a word dictionary in advance, and extracts a word as a headword when a word registered in the word dictionary appears in the data.
The N-gram method is a method of cutting out a character string of a predetermined length regardless of the meaning. For example, the headword extraction unit 24 extracts data from the first character to the Nth character as the first headword, and extracts from the second character to the (N + 1) th character as the second headword. Thus, the headword is extracted while shifting character by character.
The method by which the headword extraction unit 24 extracts headwords is not limited to the word division method or the N-gram method, and may be another method.

見出し語記憶部２５は、記憶装置９４を用いて、ファイル記憶部２０が記憶したファイルから見出し語抽出部２４が抽出した見出し語を、ファイルごとに記憶する。 The headword storage unit 25 stores, for each file, the headword extracted by the headword extraction unit 24 from the file stored by the file storage unit 20 using the storage device 94.

検索部２６は、処理装置９１を用いて、見出し語記憶部２５が記憶した見出し語のなかから、データ入力部２１が入力したデータから見出し語抽出部２４が抽出した見出し語と一致するものを検索する。 The search unit 26 uses the processing device 91 to search for a word that matches the headword extracted by the headword extraction unit 24 from the data input by the data input unit 21 from the headwords stored in the headword storage unit 25. Search for.

類似度算出部２７は、処理装置９１を用いて、検索部２６が検索した見出し語に基づいて、データ入力部２１が入力したデータと、ファイル記憶部２０が記憶したそれぞれのファイルとの間の類似度を算出する。
例えば、類似度算出部２７は、それぞれのファイルについて、そのファイルから抽出した見出し語のうち、データ入力部２１が入力したデータから抽出した見出し語と一致する見出し語の数を数えて、類似度とする。この場合、類似度の値が大きいほど、データ入力部２１が入力したデータとそのファイルの内容とが類似していることを表わす。 The similarity calculation unit 27 uses the processing device 91 to determine between the data input by the data input unit 21 and each file stored in the file storage unit 20 based on the headword searched by the search unit 26. Calculate similarity.
For example, for each file, the similarity calculation unit 27 counts the number of headwords that match the headword extracted from the data input by the data input unit 21 among the headwords extracted from the file. And In this case, the greater the similarity value, the more similar the data input by the data input unit 21 and the content of the file.

類似出力部２８は、処理装置９１を用いて、類似度算出部２７が算出した類似度に基づいて、ファイル記憶部２０が記憶したファイルのなかから、データ入力部２１が入力したデータと類似しているファイルを抽出する。類似出力部２８は、出力装置９３を用いて、抽出したファイルのファイル名や保管場所などの情報を出力する。
例えば、類似出力部２８は、類似度算出部２７が算出した類似度が高いほうから順に、所定の数のファイルを抽出する。 The similarity output unit 28 is similar to the data input by the data input unit 21 from among the files stored in the file storage unit 20 based on the similarity calculated by the similarity calculation unit 27 using the processing device 91. Files that are The similar output unit 28 uses the output device 93 to output information such as the file name and storage location of the extracted file.
For example, the similarity output unit 28 extracts a predetermined number of files in order from the highest similarity calculated by the similarity calculation unit 27.

図４は、この実施の形態におけるファイル情報Ｄ２０の一例を示す図である。 FIG. 4 is a diagram showing an example of the file information D20 in this embodiment.

ファイル記憶部２０は、例えば、ファイル情報Ｄ２０を記憶している。ファイル情報Ｄ２０は、例えば、複数のファイル識別子Ｄ２１と、複数のファイル名Ｄ２２と、複数の保管場所Ｄ２３と、複数の内容データＤ２４とを含む。一つのファイル識別子Ｄ２１には、一つのファイル名Ｄ２２、一つの保管場所Ｄ２３、一つの内容データＤ２４が対応づけられている。ファイル識別子Ｄ２１は、ファイルを一意に識別する識別子である。ファイル名Ｄ２２は、そのファイルの名称を表わす。保管場所Ｄ２３は、そのファイルの保管場所を表わす。内容データＤ２４は、そのファイルの内容を表わす。 The file storage unit 20 stores, for example, file information D20. The file information D20 includes, for example, a plurality of file identifiers D21, a plurality of file names D22, a plurality of storage locations D23, and a plurality of contents data D24. One file identifier D21 is associated with one file name D22, one storage location D23, and one content data D24. The file identifier D21 is an identifier that uniquely identifies a file. The file name D22 represents the name of the file. The storage location D23 represents the storage location of the file. Content data D24 represents the content of the file.

図５は、この実施の形態における見出し語Ｄ４０の一例を示す図である。 FIG. 5 is a diagram showing an example of the headword D40 in this embodiment.

この例は、見出し語抽出部２４がＮグラム法（Ｎ＝２）によって見出し語を抽出する場合の例である。見出し語抽出部２４は、例えば、内容データＤ２４が「本日晴天なり。明日雨天なり。」であるファイルから、「本日」「日晴」「晴天」「天な」「なり」「り。」「。明」「明日」「日雨」「雨天」という１０個の見出し語Ｄ４０を抽出する。「天な」「なり」「り。」の３つは、内容データＤ２４のなかに２回現れるが、この例において、見出し語抽出部２４は、同じ見出し語を２回抽出せず、異なる見出し語だけを抽出する。なお、見出し語抽出部２４は、同じ見出し語を複数回抽出する構成であってもよい。 In this example, the headword extraction unit 24 extracts headwords by the N-gram method (N = 2). The headword extraction unit 24, for example, from a file whose content data D24 is “Today's sunny weather. Ten headwords D40, “Ming”, “Tomorrow”, “Sunlight” and “rainy weather” are extracted. “Ten”, “Nari”, and “Ri.” Appear twice in the content data D24. In this example, the headword extraction unit 24 does not extract the same headword twice, but different headlines. Extract words only. The headword extraction unit 24 may be configured to extract the same headword a plurality of times.

図６は、この実施の形態におけるインデックス情報Ｄ５０の一例を示す図である。 FIG. 6 is a diagram showing an example of the index information D50 in this embodiment.

見出し語記憶部２５は、例えば、インデックス情報Ｄ５０を記憶する。インデックス情報Ｄ５０は、例えば、複数の見出し語Ｄ５１と、複数のファイル識別子Ｄ５２とを含む。一つの見出し語Ｄ５１には、一つのファイル識別子Ｄ５２が対応づけられている。見出し語Ｄ５１は、見出し語抽出部２４がファイルから抽出した見出し語である。ファイル識別子Ｄ５２は、見出し語抽出部２４が見出し語Ｄ５１を抽出したファイルのファイル識別子である。
見出し語抽出部２４は、通常、一つのファイルから複数の見出し語を抽出する。このため、ファイル識別子Ｄ５２には、重複するものが含まれる。また、見出し語抽出部２４が、異なるファイルから同じ見出し語を抽出する場合もある。このため、見出し語Ｄ５１にも、重複するものが含まれる。 The headword storage unit 25 stores, for example, index information D50. The index information D50 includes, for example, a plurality of headwords D51 and a plurality of file identifiers D52. One file identifier D52 is associated with one headword D51. The headword D51 is a headword extracted from the file by the headword extraction unit 24. The file identifier D52 is the file identifier of the file from which the headword extraction unit 24 has extracted the headword D51.
The headword extraction unit 24 normally extracts a plurality of headwords from one file. For this reason, the file identifier D52 includes duplicates. In addition, the headword extraction unit 24 may extract the same headword from different files. For this reason, the headword D51 also includes overlapping words.

なお、見出し語抽出部２４が一つのファイルから同じ見出し語を複数抽出する構成である場合、インデックス情報Ｄ５０は、例えば、更に、ファイル識別子Ｄ５２で識別されるファイルから見出し語抽出部２４が見出し語Ｄ５１を抽出した数を表わす抽出件数を含む構成であってもよい。
また、インデックス情報Ｄ５０は、この例に示した構成に限らず、見出し語から、その見出し語が抽出された抽出元のファイルが検索可能な構成であればよい。例えば、インデックス情報Ｄ５０は、見出し語Ｄ５１が重複するものを一つにまとめ、ファイル識別子Ｄ５２の代わりに、その見出し語が抽出されたファイルのファイル識別子のリストを含む構成であってもよい。 Note that when the headword extraction unit 24 is configured to extract a plurality of the same headwords from one file, the index information D50 includes, for example, the headword extraction unit 24 from the file identified by the file identifier D52. The structure including the extraction number showing the number which extracted D51 may be sufficient.
Further, the index information D50 is not limited to the configuration shown in this example, and any configuration may be used as long as the file from which the headword is extracted can be searched from the headword. For example, the index information D50 may be configured so that overlapping entries of the entry word D51 are combined into one, and a list of file identifiers of the file from which the entry word is extracted is included instead of the file identifier D52.

図７は、この実施の形態における類似度情報Ｄ７０の一例を示す図である。 FIG. 7 is a diagram showing an example of the similarity information D70 in this embodiment.

類似度算出部２７は、例えば、類似度情報Ｄ７０を生成する。類似度情報Ｄ７０は、例えば、複数のファイル識別子Ｄ７１と、複数のヒット件数Ｄ７２とを有する。一つのファイル識別子Ｄ７１には、一つのヒット件数Ｄ７２が対応づけられている。ファイル識別子Ｄ７１は、データ入力部２１が入力したデータと同じ見出し語が抽出されたファイルのファイル識別子である。ヒット件数Ｄ７２は、データ入力部２１が入力したデータから抽出された見出し語と、ファイル識別子Ｄ７１で識別されるファイルから抽出された見出し語との間で、一致する見出し語の数を表わす。
データ入力部２１が入力したデータのなかに同じ見出し語が複数回出現する場合や、一つのファイルのなかに同じ見出し語が複数回出現する場合でも、類似度算出部２７は、ヒット件数１件として数える。 For example, the similarity calculation unit 27 generates similarity information D70. The similarity information D70 includes, for example, a plurality of file identifiers D71 and a plurality of hit counts D72. One hit identifier D72 is associated with one file identifier D71. The file identifier D71 is a file identifier of a file from which the same headword as the data input by the data input unit 21 is extracted. The hit count D72 represents the number of matching headwords between the headword extracted from the data input by the data input unit 21 and the headword extracted from the file identified by the file identifier D71.
Even when the same headword appears multiple times in the data input by the data input unit 21 or when the same headword appears multiple times in one file, the similarity calculation unit 27 determines the number of hits as one. Count as.

なお、見出し語抽出部２４が一つのファイルから同じ見出し語を複数抽出する構成である場合において、ある見出し語が、データ入力部２１が入力したデータからａ回抽出され、ファイル識別子Ｄ７１で識別されるファイルからｂ回抽出されたとすると、類似度算出部２７は、その一つの見出し語だけでヒット件数ａ×ｂ件（あるいはｂ件）と数える構成であってもよい。これにより、出現回数の多い見出し語が共通している場合、類似度算出部２７が算出する類似度が高くなる。 When the headword extraction unit 24 is configured to extract a plurality of the same headwords from one file, a headword is extracted a times from the data input by the data input unit 21 and identified by the file identifier D71. Assuming that the file is extracted b times from the file, the similarity calculation unit 27 may be configured to count the number of hits a × b (or b) with only one headword. As a result, when headwords with a large number of appearances are common, the similarity calculated by the similarity calculation unit 27 increases.

図８は、この実施の形態における類似出力情報Ｄ８０の一例を示す図である。 FIG. 8 is a diagram showing an example of the similar output information D80 in this embodiment.

類似出力部２８は、例えば、類似出力情報Ｄ８０を出力する。類似出力情報Ｄ８０は、例えば、複数のファイル識別子Ｄ８１と、複数のファイル名Ｄ８２と、複数の保管場所Ｄ８３と、複数のヒット件数Ｄ８４とを含む。一つのファイル識別子Ｄ８１には、一つのファイル名Ｄ８２と、一つの保管場所Ｄ８３と、一つのヒット件数Ｄ８４とが対応づけられている。ファイル識別子Ｄ８１は、データ入力部２１が入力したデータと類似するファイルのファイル識別子である。ファイル名Ｄ８２は、そのファイルのファイル名を表わす。保管場所Ｄ８３は、そのファイルの保管場所を表わす。ヒット件数Ｄ８４は、データ入力部２１が入力したデータから抽出された見出し語と、ファイル識別子Ｄ７１で識別されるファイルから抽出された見出し語との間で、一致する見出し語の数を表わす。 The similar output unit 28 outputs, for example, similar output information D80. The similar output information D80 includes, for example, a plurality of file identifiers D81, a plurality of file names D82, a plurality of storage locations D83, and a plurality of hit counts D84. One file identifier D81 is associated with one file name D82, one storage location D83, and one hit count D84. The file identifier D81 is a file identifier of a file similar to the data input by the data input unit 21. The file name D82 represents the file name of the file. A storage location D83 represents a storage location of the file. The hit count D84 represents the number of matching headwords between the headword extracted from the data input by the data input unit 21 and the headword extracted from the file identified by the file identifier D71.

図９は、この実施の形態におけるファイル記憶装置１２の処理の流れの一例を示すフロー図である。 FIG. 9 is a flowchart showing an example of the processing flow of the file storage device 12 in this embodiment.

ファイル記憶装置１２は、例えば、指示入力工程Ｓ１１と、書込み処理Ｓ１２と、読出し処理Ｓ１３と、類似検索処理Ｓ１４とを実行する。 The file storage device 12 executes, for example, an instruction input process S11, a write process S12, a read process S13, and a similarity search process S14.

指示入力工程Ｓ１１において、指示入力部２２は、ファイル編集装置１１から利用者による指示を入力する。
入力した指示が書込み指示である場合、指示入力部２２は、書込み処理Ｓ１２へ処理を進める。
入力した指示が読出し指示である場合、指示入力部２２は、読出し処理Ｓ１３へ処理を進める。
入力した指示が類似検索指示である場合、指示入力部２２は、類似検索処理Ｓ１４へ処理を進める。 In the instruction input step S 11, the instruction input unit 22 inputs an instruction from the user from the file editing device 11.
If the input instruction is a write instruction, the instruction input unit 22 proceeds to the write process S12.
If the input instruction is a read instruction, the instruction input unit 22 advances the process to the read process S13.
If the input instruction is a similarity search instruction, the instruction input unit 22 advances the process to the similarity search process S14.

書込み処理Ｓ１２において、ファイル記憶装置１２は、データ入力部２１が入力したデータをファイル記憶部２０に記憶する。書込み処理Ｓ１２の詳細については、後述する。書込み処理Ｓ１２の終了後、指示入力部２２は、指示入力工程Ｓ１１に処理を戻し、次の指示を待つ。 In the writing process S 12, the file storage device 12 stores the data input by the data input unit 21 in the file storage unit 20. Details of the writing process S12 will be described later. After the end of the writing process S12, the instruction input unit 22 returns the process to the instruction input process S11 and waits for the next instruction.

読出し処理Ｓ１３において、ファイル記憶部２０は、指示入力工程Ｓ１１で指示入力部２２が入力した読出し指示から、ファイル名と保管場所とを取得する。ファイル記憶部２０は、記憶したファイル情報Ｄ２０のなかから、取得したファイル名と一致するファイル名Ｄ２２に対応づけられ、かつ、取得した保管場所と一致する保管場所Ｄ２３に対応づけられたファイル識別子Ｄ２１を抽出する。
条件を満たすファイル識別子Ｄ２１が存在する場合、ファイル記憶部２０は、抽出したファイル識別子Ｄ２１に対応づけられた内容データＤ２４を取得する。ファイル出力部２３は、ファイル記憶部２０が取得した内容データＤ２４を、ファイル編集装置１１に対して出力する。
条件を満たすファイル識別子Ｄ２１が存在しない場合、ファイル出力部２３は、ファイル編集装置１１に対して、エラーを出力する。
指示入力部２２は、指示入力工程Ｓ１１に処理を戻し、次の指示を待つ。 In the read process S13, the file storage unit 20 acquires the file name and the storage location from the read instruction input by the instruction input unit 22 in the instruction input step S11. From the stored file information D20, the file storage unit 20 is associated with the file name D22 that matches the acquired file name, and with the file identifier D21 that is associated with the storage location D23 that matches the acquired storage location. To extract.
When there is a file identifier D21 that satisfies the condition, the file storage unit 20 acquires content data D24 associated with the extracted file identifier D21. The file output unit 23 outputs the content data D24 acquired by the file storage unit 20 to the file editing apparatus 11.
If the file identifier D21 that satisfies the condition does not exist, the file output unit 23 outputs an error to the file editing apparatus 11.
The instruction input unit 22 returns the process to the instruction input step S11 and waits for the next instruction.

類似検索処理Ｓ１４において、ファイル記憶部２０は、ファイル記憶部２０が記憶したファイルのなかから、データ入力部２１が入力したデータと類似するファイルを探す。類似検索処理Ｓ１４の詳細については、後述する。類似検索処理Ｓ１４の終了後、指示入力部２２は、指示入力工程Ｓ１１に処理を戻し、次の指示を待つ。 In the similarity search process S 14, the file storage unit 20 searches the file stored in the file storage unit 20 for a file similar to the data input by the data input unit 21. Details of the similarity search processing S14 will be described later. After the similarity search process S14 ends, the instruction input unit 22 returns the process to the instruction input step S11 and waits for the next instruction.

図１０は、この実施の形態における書込み処理Ｓ１２の流れの一例を示すフロー図である。 FIG. 10 is a flowchart showing an example of the flow of the writing process S12 in this embodiment.

書込み処理Ｓ１２は、例えば、上書き判定工程Ｓ２１と、見出し語削除工程Ｓ２２と、データ入力工程Ｓ２３と、ファイル記憶工程Ｓ２４と、見出し語抽出工程Ｓ２５と、見出し語記憶工程Ｓ２６とを有する。 The writing process S12 includes, for example, an overwrite determination step S21, a headword deletion step S22, a data input step S23, a file storage step S24, a headword extraction step S25, and a headword storage step S26.

上書き判定工程Ｓ２１において、ファイル記憶部２０は、指示入力工程Ｓ１１で指示入力部２２が入力した書込み指示から、ファイル名と保管場所とを取得する。ファイル記憶部２０は、記憶したファイル情報Ｄ２０のなかから、書込み指示から取得したファイル名と一致するファイル名Ｄ２２に対応づけられ、かつ、書込み指示から取得した保管場所と一致する保管場所Ｄ２３に対応づけられたファイル識別子Ｄ２１を抽出する。
条件を満たすファイル識別子Ｄ２１が存在する場合、そのファイルを上書きする。ファイル記憶部２０は、抽出したファイル識別子Ｄ２１に対応づけられた内容データＤ２４を消去する。ファイル記憶部２０は、見出し語削除工程Ｓ２２へ処理を進める。
条件を満たすファイル識別子Ｄ２１が存在しない場合、新しいファイルを作成する。ファイル記憶部２０は、新たなファイル識別子を生成してファイル識別子Ｄ２１として記憶する。ファイル記憶部２０は、書込み指示から取得したファイル名を、そのファイル識別子Ｄ２１に対応づけられたファイル名Ｄ２２として記憶する。ファイル記憶部２０は、書込み指示から取得した保管場所を、そのファイル識別子Ｄ２１に対応づけられた保管場所Ｄ２３として記憶する。ファイル記憶部２０は、データ入力工程Ｓ２３へ処理を進める。 In the overwrite determination step S21, the file storage unit 20 acquires the file name and the storage location from the write instruction input by the instruction input unit 22 in the instruction input process S11. The file storage unit 20 corresponds to the storage location D23 that is associated with the file name D22 that matches the file name acquired from the write instruction, and that matches the storage location acquired from the write instruction, from the stored file information D20. The attached file identifier D21 is extracted.
If the file identifier D21 that satisfies the condition exists, the file is overwritten. The file storage unit 20 deletes the content data D24 associated with the extracted file identifier D21. The file storage unit 20 proceeds to the entry word deletion step S22.
If the file identifier D21 that satisfies the condition does not exist, a new file is created. The file storage unit 20 generates a new file identifier and stores it as a file identifier D21. The file storage unit 20 stores the file name acquired from the write instruction as the file name D22 associated with the file identifier D21. The file storage unit 20 stores the storage location acquired from the write instruction as the storage location D23 associated with the file identifier D21. The file storage unit 20 proceeds to the data input process S23.

見出し語削除工程Ｓ２２において、見出し語記憶部２５は、記憶したインデックス情報Ｄ５０のなかから、上書き判定工程Ｓ２１でファイル記憶部２０が抽出したファイル識別子Ｄ２１と一致するファイル識別子Ｄ５２を抽出する。
条件を満たすファイル識別子Ｄ５２が存在する場合、見出し語記憶部２５は、抽出したファイル識別子Ｄ５２と、それに対応づけられた見出し語Ｄ５１とをすべて削除する。 In the headword deletion step S22, the headword storage unit 25 extracts a file identifier D52 that matches the file identifier D21 extracted by the file storage unit 20 in the overwrite determination step S21 from the stored index information D50.
When there is a file identifier D52 that satisfies the condition, the headword storage unit 25 deletes all of the extracted file identifier D52 and the headword D51 associated therewith.

データ入力工程Ｓ２３において、データ入力部２１は、ファイル編集装置１１からデータを入力する。データ入力部２１は、データを一文字ずつ順に入力する。
データの最後に到達し、入力する文字がなくなった場合、データ入力部２１は、書込み処理Ｓ１２を終了する。
まだデータの最後に到達せず、一文字分のデータを入力した場合、データ入力部２１は、ファイル記憶工程Ｓ２４へ処理を進める。 In the data input step S 23, the data input unit 21 inputs data from the file editing device 11. The data input unit 21 inputs data one by one in order.
When the end of the data is reached and there are no more characters to be input, the data input unit 21 ends the writing process S12.
If the end of the data has not yet been reached and data for one character has been input, the data input unit 21 proceeds to the file storage step S24.

ファイル記憶工程Ｓ２４において、ファイル記憶部２０は、データ入力工程Ｓ２３で入力した一文字分のデータを、上書き判定工程Ｓ２１で抽出あるいは生成したファイル識別子Ｄ２１に対応づけられた内容データＤ２４の最後に追加して記憶する。 In the file storage step S24, the file storage unit 20 adds the data for one character input in the data input step S23 to the end of the content data D24 associated with the file identifier D21 extracted or generated in the overwrite determination step S21. Remember.

見出し語抽出工程Ｓ２５において、見出し語抽出部２４は、データ入力工程Ｓ２３でデータ入力部２１が入力した文字をＮ回分遡り、Ｎ文字からなる見出し語とする。ただし、Ｎは、１以上の整数である。 In the headword extraction step S25, the headword extraction unit 24 traces the characters input by the data input unit 21 in the data input step S23 N times and sets them as headwords composed of N characters. However, N is an integer of 1 or more.

見出し語記憶工程Ｓ２６において、見出し語記憶部２５は、記憶したインデックス情報Ｄ５０のなかから、見出し語抽出工程Ｓ２５で見出し語抽出部２４が抽出した見出し語と一致する見出し語Ｄ５１と、上書き判定工程Ｓ２１でファイル記憶部２０が抽出あるいは生成したファイル識別子Ｄ２１と一致するファイル識別子Ｄ５２とが対応づけられている組を抽出する。
条件を満たす見出し語Ｄ５１とファイル識別子Ｄ５２との組が存在する場合、その見出し語は、既に抽出済である。
条件を満たす見出し語Ｄ５１とファイル識別子Ｄ５２との組が存在しない場合、その見出し語は、未抽出である。見出し語記憶部２５は、見出し語抽出工程Ｓ２５で見出し語抽出部２４が抽出した見出し語を、見出し語Ｄ５１として記憶する。見出し語記憶部２５は、上書き判定工程Ｓ２１でファイル記憶部２０が抽出あるいは生成したファイル識別子Ｄ２１を、その見出し語Ｄ５１に対応づけられたファイル識別子Ｄ５２として記憶する。
データ入力部２１は、データ入力工程Ｓ２３に処理を戻し、次の文字を入力する。 In the headword storage step S26, the headword storage unit 25 uses the headword D51 that matches the headword extracted by the headword extraction unit 24 in the headword extraction step S25 from the stored index information D50, and the overwrite determination step. In S21, a set in which the file identifier D52 that matches the file identifier D21 extracted or generated by the file storage unit 20 is associated is extracted.
When there is a set of a headword D51 and a file identifier D52 that satisfy the condition, the headword has already been extracted.
When there is no combination of the headword D51 and the file identifier D52 that satisfy the condition, the headword is not extracted. The headword storage unit 25 stores the headword extracted by the headword extraction unit 24 in the headword extraction step S25 as a headword D51. The headword storage unit 25 stores the file identifier D21 extracted or generated by the file storage unit 20 in the overwrite determination step S21 as the file identifier D52 associated with the headword D51.
The data input unit 21 returns the process to the data input step S23 and inputs the next character.

このように、ファイル記憶装置１２は、新たなファイルを記憶する際、そのファイルから見出し語を抽出して、あらかじめインデックスを作成しておく。 As described above, when storing a new file, the file storage device 12 extracts an entry word from the file and creates an index in advance.

なお、データを入力しながらインデックスを作成するのではなく、まず、データを入力して記憶したのちに、インデックスを作成する構成であってもよい。 Instead of creating an index while inputting data, the index may be created after first inputting and storing the data.

図１１は、この実施の形態における類似検索処理Ｓ１４の流れの一例を示すフロー図である。 FIG. 11 is a flowchart showing an example of the flow of the similarity search process S14 in this embodiment.

類似検索処理Ｓ１４は、例えば、初期化工程Ｓ４０と、データ入力工程Ｓ４１と、見出し語抽出工程Ｓ４２と、見出し語検索工程Ｓ４３と、ファイル選択工程Ｓ４４と、一致計数工程Ｓ４５と、並べ替え工程Ｓ５０と、ファイル選択工程Ｓ５１と、類似出力工程Ｓ５２とを有する。 The similarity search processing S14 includes, for example, an initialization step S40, a data input step S41, a headword extraction step S42, a headword search step S43, a file selection step S44, a coincidence counting step S45, and a rearrangement step S50. And a file selection step S51 and a similar output step S52.

初期化工程Ｓ４０において、見出し語抽出部２４は、抽出済の見出し語のリストを初期化する。例えば、見出し語抽出部２４は、抽出済の見出し語のリストとして空のリストを記憶する。
類似度算出部２７は、類似度情報Ｄ７０を初期化する。例えば、類似度算出部２７は、記憶した類似度情報Ｄ７０を削除する。 In the initialization step S40, the headword extraction unit 24 initializes a list of extracted headwords. For example, the headword extraction unit 24 stores an empty list as a list of extracted headwords.
The similarity calculation unit 27 initializes the similarity information D70. For example, the similarity calculation unit 27 deletes the stored similarity information D70.

データ入力工程Ｓ４１において、データ入力部２１は、ファイル編集装置１１からデータを入力する。データ入力部２１は、データを一文字ずつ順に入力する。
データの最後に到達し、入力する文字がなくなった場合、データ入力部２１は、並べ替え工程Ｓ５０へ処理を進める。
まだデータの最後に到達せず、一文字分のデータを入力した場合、データ入力部２１は、見出し語抽出工程Ｓ４２へ処理を進める。 In the data input step S 41, the data input unit 21 inputs data from the file editing device 11. The data input unit 21 inputs data one by one in order.
When the end of the data is reached and there are no more characters to be input, the data input unit 21 proceeds to the rearrangement step S50.
If the end of the data has not been reached yet and one character's worth of data has been input, the data input unit 21 proceeds to the headword extraction step S42.

見出し語抽出工程Ｓ４２において、見出し語抽出部２４は、データ入力工程Ｓ４１で入力した文字をＮ回分遡り、Ｎ文字からなる見出し語とする。ただし、Ｎは、１以上の整数である。
見出し語抽出部２４は、抽出済の見出し語のリストのなかに、今回の見出し語が存在するか否かを判定する。
抽出済の見出し語のリストのなかに今回の見出し語が存在する場合、指示入力部２２は、データ入力工程Ｓ４１に処理を戻し、次の文字を入力する。
抽出済の見出し語のリストのなかに今回の見出し語が存在しない場合、見出し語抽出部２４は、今回の見出し語を、抽出済の見出し語のリストに加えて記憶する。見出し語抽出部２４は、見出し語検索工程Ｓ４３へ処理を進める。 In the headword extraction step S42, the headword extraction unit 24 goes back N times for the character input in the data input step S41 and sets it as a headword composed of N characters. However, N is an integer of 1 or more.
The headword extraction unit 24 determines whether or not the current headword is present in the extracted list of headwords.
If the current headword is present in the extracted headword list, the instruction input unit 22 returns the process to the data input step S41 and inputs the next character.
If the current headword does not exist in the list of extracted headwords, the headword extraction unit 24 stores the current headword in addition to the extracted headword list. The headword extraction unit 24 proceeds to the headword search step S43.

見出し語検索工程Ｓ４３において、検索部２６は、見出し語記憶部２５が記憶したインデックス情報Ｄ５０のなかから、見出し語抽出工程Ｓ４２で見出し語抽出部２４が抽出した見出し語と一致する見出し語Ｄ５１を抽出する。
条件を満たす見出し語Ｄ５１が存在する場合、検索部２６は、一致計数工程Ｓ４５へ処理を進める。
条件を満たす見出し語Ｄ５１が存在しない場合、データ入力部２１は、データ入力工程Ｓ４１に処理を戻し、次の文字を入力する。 In the headword search step S43, the search unit 26 selects a headword D51 that matches the headword extracted by the headword extraction unit 24 in the headword extraction step S42 from the index information D50 stored in the headword storage unit 25. Extract.
If there is a headword D51 that satisfies the condition, the search unit 26 advances the process to the coincidence counting step S45.
If there is no headword D51 that satisfies the condition, the data input unit 21 returns the process to the data input step S41 and inputs the next character.

ファイル選択工程Ｓ４４において、類似度算出部２７は、見出し語検索工程Ｓ４３で検索部２６が抽出した見出し語Ｄ５１に対応づけられたファイル識別子Ｄ５２のなかから、まだ選択していないファイル識別子Ｄ５２を一つ選択する。
検索部２６が抽出した見出し語Ｄ５１に対応づけられたファイル識別子Ｄ５２がすべて選択済であり、まだ選択していないファイル識別子Ｄ５２がない場合、データ入力部２１は、データ入力工程Ｓ４１に処理を戻し、次の文字を入力する。
検索部２６が抽出した見出し語Ｄ５１に対応づけられたファイル識別子Ｄ５２のなかに、まだ選択していないファイル識別子Ｄ５２がある場合、類似度算出部２７は、まだ選択していないファイル識別子Ｄ５２のなかから、ファイル識別子Ｄ５２を一つ選択する。 In the file selection step S44, the similarity calculation unit 27 sets a file identifier D52 that has not yet been selected from the file identifiers D52 associated with the headword D51 extracted by the search unit 26 in the headword search step S43. Select one.
If all the file identifiers D52 associated with the headword D51 extracted by the search unit 26 have been selected and there is no file identifier D52 that has not yet been selected, the data input unit 21 returns the process to the data input step S41. Enter the next character.
When there is a file identifier D52 that has not yet been selected among the file identifiers D52 that are associated with the headword D51 extracted by the search unit 26, the similarity calculation unit 27 selects the file identifier D52 that has not yet been selected. Then, one file identifier D52 is selected.

一致計数工程Ｓ４５において、類似度算出部２７は、記憶した類似度情報Ｄ７０のなかから、ファイル選択工程Ｓ４４で選択したファイル識別子Ｄ５２と一致するファイル識別子Ｄ７１を抽出する。
ファイル識別子Ｄ５２と一致するファイル識別子Ｄ７１が存在する場合、類似度算出部２７は、そのファイル識別子Ｄ７１に対応づけられたヒット件数Ｄ７２に１を加える。
ファイル識別子Ｄ５２と一致するファイル識別子Ｄ７１が存在しない場合、類似度算出部２７は、ファイル選択工程Ｓ４４で選択したファイル識別子Ｄ５２をファイル識別子Ｄ７１として記憶する。類似度算出部２７は、そのファイル識別子Ｄ７１に対応付けられたヒット件数Ｄ７２として１を記憶する。
類似度算出部２７は、ファイル選択工程Ｓ４４に処理を戻し、次のファイル識別子Ｄ５２を選択する。 In the coincidence counting step S45, the similarity calculating unit 27 extracts a file identifier D71 that matches the file identifier D52 selected in the file selection step S44 from the stored similarity information D70.
When there is a file identifier D71 that matches the file identifier D52, the similarity calculation unit 27 adds 1 to the number of hits D72 associated with the file identifier D71.
When there is no file identifier D71 that matches the file identifier D52, the similarity calculation unit 27 stores the file identifier D52 selected in the file selection step S44 as the file identifier D71. The similarity calculation unit 27 stores 1 as the hit count D72 associated with the file identifier D71.
The similarity calculation unit 27 returns the process to the file selection step S44 and selects the next file identifier D52.

並べ替え工程Ｓ５０において、類似出力部２８は、類似度算出部２７が記憶した類似度情報Ｄ７０を、ヒット件数Ｄ７２が大きい順に並べ替える。 In the rearrangement step S50, the similarity output unit 28 rearranges the similarity information D70 stored by the similarity calculation unit 27 in descending order of the hit count D72.

ファイル選択工程Ｓ５１において、類似出力部２８は、並べ替え工程Ｓ５０で並べ替えた類似度情報Ｄ７０のなかから、まだ選択していないファイル識別子Ｄ７１を選択する。
すべてのファイル識別子Ｄ７１が選択済であり、まだ選択していないファイル識別子Ｄ７１がない場合や、選択済のファイル識別子Ｄ７１の数が所定の数に達した場合、類似出力部２８は、類似検索処理Ｓ１４を終了する。
まだ選択していないファイル識別子Ｄ７１があり、選択済のファイル識別子Ｄ７１の数がまだ所定の数に達していない場合、類似出力部２８は、まだ選択していないファイル識別子Ｄ７１のなかから、対応づけられたヒット件数Ｄ７２が最大であるファイル識別子Ｄ７１を選択する。 In the file selection step S51, the similarity output unit 28 selects a file identifier D71 that has not been selected from the similarity information D70 rearranged in the rearrangement step S50.
When all the file identifiers D71 have been selected and there is no file identifier D71 that has not yet been selected, or when the number of selected file identifiers D71 reaches a predetermined number, the similar output unit 28 performs similar search processing. S14 ends.
If there is a file identifier D71 that has not yet been selected, and the number of selected file identifiers D71 has not yet reached a predetermined number, the similar output unit 28 associates the file identifier D71 that has not yet been selected. The file identifier D71 having the largest hit count D72 is selected.

類似出力工程Ｓ５２において、類似出力部２８は、ファイル記憶部２０が記憶したファイル情報Ｄ２０のなかから、ファイル選択工程Ｓ５１で選択したファイル識別子Ｄ７１と一致するファイル識別子Ｄ２１に対応づけられたファイル名Ｄ２２及び保管場所Ｄ２３を取得する。類似出力部２８は、ファイル選択工程Ｓ５１で選択したファイル識別子Ｄ７１をファイル識別子Ｄ８１として出力する。類似出力部２８は、ファイル情報Ｄ２０から取得したファイル名Ｄ２２を、ファイル識別子Ｄ８１に対応づけられたファイル名Ｄ８２として出力する。類似出力部２８は、ファイル情報Ｄ２０から取得した保管場所Ｄ２３を、ファイル識別子Ｄ８１に対応づけられた保管場所Ｄ８３として出力する。類似出力部２８は、ファイル選択工程Ｓ５１で選択したファイル識別子Ｄ７１に対応づけられたヒット件数Ｄ７２を、ファイル識別子Ｄ８１に対応づけられたヒット件数Ｄ８４として出力する。
類似出力部２８は、ファイル選択工程Ｓ５１に処理を戻し、次のファイル識別子Ｄ７１を選択する。 In the similar output step S52, the similar output unit 28 selects the file name D22 associated with the file identifier D21 that matches the file identifier D71 selected in the file selection step S51 from the file information D20 stored in the file storage unit 20. And the storage location D23 is acquired. The similar output unit 28 outputs the file identifier D71 selected in the file selection step S51 as the file identifier D81. The similar output unit 28 outputs the file name D22 acquired from the file information D20 as the file name D82 associated with the file identifier D81. The similar output unit 28 outputs the storage location D23 acquired from the file information D20 as the storage location D83 associated with the file identifier D81. The similar output unit 28 outputs the hit count D72 associated with the file identifier D71 selected in the file selection step S51 as the hit count D84 associated with the file identifier D81.
The similar output unit 28 returns the process to the file selection step S51 and selects the next file identifier D71.

このように、一致する見出し語の数が多いファイルを、類似したファイルとして判定する。あらかじめインデックスを作成してあるので、ファイル記憶部２０が記憶しているファイルの数が多い場合でも、類似したファイルを素早く見つけることができる。 As described above, a file having a large number of matching headwords is determined as a similar file. Since an index is created in advance, a similar file can be quickly found even if the number of files stored in the file storage unit 20 is large.

なお、データを入力しながら検索するのではなく、まず、データを入力して一時的に記憶したのちに、検索をする構成であってもよい。 Instead of searching while inputting data, a configuration may be used in which searching is performed after data is first input and temporarily stored.

以上のように、ファイルを保管する際に、見出し語を抽出する。保管するファイルの見出し語を既存のインデックス（インデックス情報Ｄ５０）と付け合せ、ビット件数を文書ＩＤ（ファイル識別子）別に集計する。既存のファイル情報Ｄ２０からファイル名、保管場所などの情報を付加して、出力する。これにより、類似したファイルを取得でき、参照したり流用したりすることができる。 As described above, the headword is extracted when the file is stored. The headword of the file to be stored is added to the existing index (index information D50), and the number of bits is totaled by document ID (file identifier). Information such as the file name and storage location is added from the existing file information D20 and output. Thereby, a similar file can be acquired, and can be referred to and used.

実施の形態２．
実施の形態２について、図１２〜図１７を用いて説明する。
なお、実施の形態１と共通する部分については、同一の符号を付し、説明を省略する。 Embodiment 2. FIG.
The second embodiment will be described with reference to FIGS.
In addition, about the part which is common in Embodiment 1, the same code | symbol is attached | subjected and description is abbreviate | omitted.

この実施の形態では、類似度算出部２７が算出する類似度の別の例について説明する。 In this embodiment, another example of the similarity calculated by the similarity calculation unit 27 will be described.

図１２は、この実施の形態におけるファイル記憶装置１２の機能ブロックの構成の一例を示す図である。 FIG. 12 is a diagram showing an example of the functional block configuration of the file storage device 12 in this embodiment.

見出し語記憶部２５は、記憶装置９４を用いて、見出し語抽出部２４が抽出した見出し語に加えて、更に、見出し語抽出部２４がそれぞれのファイルから抽出した見出し語の数を、ファイルごとに記憶する。 In addition to the headword extracted by the headword extraction unit 24 using the storage device 94, the headword storage unit 25 further calculates the number of headwords extracted from each file by the headword extraction unit 24 for each file. To remember.

類似度算出部２７は、処理装置９１を用いて、それぞれのファイルについて、そのファイルから抽出した見出し語のうち、データ入力部２１が入力したデータから抽出した見出し語と一致する見出し語の割合を、類似度とする。例えば、類似度算出部２７は、実施の形態１で説明した類似度を算出し、算出した類似度を、見出し語記憶部２５がそのファイルについて記憶した見出し語の数で割った商を算出して、この実施の形態における類似度とする。 The similarity calculation unit 27 uses the processing device 91 to determine the ratio of headwords that match the headwords extracted from the data input by the data input unit 21 among the headwords extracted from the files for each file. Let the similarity be. For example, the similarity calculation unit 27 calculates the similarity described in the first embodiment, and calculates a quotient obtained by dividing the calculated similarity by the number of headwords stored for the file by the headword storage unit 25. Thus, the similarity is used in this embodiment.

図１３は、この実施の形態における見出し語数情報Ｄ６０の一例を示す図である。 FIG. 13 is a diagram showing an example of headword number information D60 in this embodiment.

見出し語記憶部２５は、例えば、見出し語数情報Ｄ６０を記憶する。見出し語数情報Ｄ６０は、例えば、複数のファイル識別子Ｄ６１と、複数の総見出し語数Ｄ６２とを含む。一つのファイル識別子Ｄ６１には、一つの総見出し語数Ｄ６２が対応づけられている。ファイル識別子Ｄ６１は、ファイル記憶部２０が記憶したファイルのファイル識別子である。総見出し語数Ｄ６２は、そのファイルから見出し語抽出部２４が抽出した見出し語の総数を表わす。 The headword storage unit 25 stores headword number information D60, for example. The headword number information D60 includes, for example, a plurality of file identifiers D61 and a plurality of total headword numbers D62. One file identifier D61 is associated with one total headword number D62. The file identifier D61 is a file identifier of the file stored in the file storage unit 20. The total number of headwords D62 represents the total number of headwords extracted by the headword extraction unit 24 from the file.

図１４は、この実施の形態における類似度情報Ｄ７０の一例を示す図である。 FIG. 14 is a diagram showing an example of the similarity information D70 in this embodiment.

類似度情報Ｄ７０は、実施の形態１で説明したデータに加えて、更に、複数のヒット割合Ｄ７３を含む。一つのファイル識別子Ｄ７１には、一つのヒット割合Ｄ７３が対応づけられている。ヒット割合Ｄ７３は、対応づけられたファイル識別子Ｄ７１で識別されるファイルから見出し語抽出部２４が抽出した見出し語のうち、データ入力部２１が入力したデータから見出し語抽出部２４が抽出した見出し語と一致する見出し語の割合を表わす。 The similarity information D70 further includes a plurality of hit ratios D73 in addition to the data described in the first embodiment. One hit ratio D73 is associated with one file identifier D71. The hit ratio D73 is the headword extracted by the headword extraction unit 24 from the data input by the data input unit 21 among the headwords extracted by the headword extraction unit 24 from the file identified by the associated file identifier D71. Represents the percentage of headwords that match.

図１５は、この実施の形態における類似出力情報Ｄ８０の一例を示す図である。 FIG. 15 is a diagram showing an example of similar output information D80 in this embodiment.

類似出力情報Ｄ８０は、実施の形態１で説明したデータに加えて、更に、複数のヒット割合Ｄ８５を含む。一つのファイル識別子Ｄ８１には、一つのヒット割合Ｄ８５が対応づけられている。ヒット割合Ｄ８５は、対応づけられたファイル識別子Ｄ７１で識別されるファイルから見出し語抽出部２４が抽出した見出し語のうち、データ入力部２１が入力したデータから見出し語抽出部２４が抽出した見出し語と一致する見出し語の割合を表わす。 The similar output information D80 further includes a plurality of hit ratios D85 in addition to the data described in the first embodiment. One hit ratio D85 is associated with one file identifier D81. The hit ratio D85 is the headword extracted by the headword extraction unit 24 from the data input by the data input unit 21 among the headwords extracted by the headword extraction unit 24 from the file identified by the associated file identifier D71. Represents the percentage of headwords that match.

図１６は、この実施の形態における書込み処理Ｓ１２の流れの一例を示すフロー図である。 FIG. 16 is a flowchart showing an example of the flow of the writing process S12 in this embodiment.

書込み処理Ｓ１２は、実施の形態１で説明した工程に加えて、更に、見出し語計数工程Ｓ２７と、見出し語数記憶工程Ｓ２８とを有する。 In addition to the steps described in the first embodiment, the writing process S12 further includes a headword counting step S27 and a headword number storage step S28.

見出し語記憶工程Ｓ２６において、見出し語が抽出済である場合、データ入力部２１は、データ入力工程Ｓ２３に戻り、次の文字を入力する。
見出し語が未抽出である場合、見出し語記憶部２５は、見出し語計数工程Ｓ２７へ処理を進める。 If the headword has been extracted in the headword storage step S26, the data input unit 21 returns to the data input step S23 and inputs the next character.
When the headword has not been extracted, the headword storage unit 25 advances the processing to the headword counting step S27.

見出し語計数工程Ｓ２７において、見出し語記憶部２５は、総見出し語数に１を加える。
データ入力部２１は、データ入力工程Ｓ２３に戻り、次の文字を入力する。 In the headword counting step S27, the headword storage unit 25 adds 1 to the total number of headwords.
The data input unit 21 returns to the data input step S23 and inputs the next character.

データ入力工程Ｓ２３において、データの最後に到達した場合、データ入力部２１は、見出し語数記憶工程Ｓ２８へ処理を進める。 In the data input step S23, when the end of the data is reached, the data input unit 21 advances the process to the headword number storage step S28.

見出し語数記憶工程Ｓ２８において、見出し語記憶部２５は、上書き判定工程Ｓ２１で抽出あるいは生成したファイル識別子Ｄ２１を、ファイル識別子Ｄ６１として記憶する。見出し語記憶部２５は、算出した総見出し語数を、そのファイル識別子Ｄ６１に対応づけられた総見出し語数Ｄ６２として記憶する。 In the headword number storage step S28, the headword storage unit 25 stores the file identifier D21 extracted or generated in the overwrite determination step S21 as the file identifier D61. The headword storage unit 25 stores the calculated total headword count as the total headword count D62 associated with the file identifier D61.

このように、インデックス作成時に、あらかじめ総見出し語数を算出しておく。 Thus, the total number of headwords is calculated in advance when creating an index.

図１７は、この実施の形態における類似検索処理Ｓ１４の流れの一例を示すフロー図である。 FIG. 17 is a flowchart showing an example of the flow of the similarity search process S14 in this embodiment.

類似検索処理Ｓ１４は、実施の形態１で説明した工程に加えて、更に、ファイル選択工程Ｓ４６と、割合算出工程Ｓ４７とを有する。 The similarity search process S14 further includes a file selection process S46 and a ratio calculation process S47 in addition to the processes described in the first embodiment.

データ入力工程Ｓ４１において、データの最後に到達した場合、データ入力部２１は、ファイル選択工程Ｓ４６へ処理を進める。 In the data input step S41, when the end of the data is reached, the data input unit 21 advances the process to the file selection step S46.

ファイル選択工程Ｓ４６において、類似度算出部２７は、記憶した類似度情報Ｄ７０のなかから、まだ選択していないファイル識別子Ｄ７１を一つ選択する。
すべてのファイル識別子Ｄ７１が選択済であり、まだ選択していないファイル識別子Ｄ７１がない場合、類似度算出部２７は、並べ替え工程Ｓ５０へ処理を進める。
まだ選択していないファイル識別子Ｄ７１がある場合、類似度算出部２７は、まだ選択していないファイル識別子Ｄ７１のなかから、ファイル識別子Ｄ７１を一つ選択して、割合算出工程Ｓ４７へ処理を進める。 In the file selection step S46, the similarity calculation unit 27 selects one file identifier D71 that has not been selected from the stored similarity information D70.
When all the file identifiers D71 have been selected and there is no file identifier D71 that has not yet been selected, the similarity calculation unit 27 advances the processing to the rearrangement step S50.
When there is a file identifier D71 that has not been selected, the similarity calculation unit 27 selects one file identifier D71 from among the file identifiers D71 that have not yet been selected, and proceeds to the ratio calculation step S47.

割合算出工程Ｓ４７において、類似度算出部２７は、見出し語記憶部２５が記憶した見出し語数情報Ｄ６０のなかから、ファイル選択工程Ｓ４６で選択したファイル識別子Ｄ７１と一致するファイル識別子Ｄ６１を抽出する。類似度算出部２７は、見出し語数情報Ｄ６０から抽出したファイル識別子Ｄ６１に対応づけられた総見出し語数Ｄ６２を取得する。類似度算出部２７は、ファイル選択工程Ｓ４６で選択したファイル識別子Ｄ７１に対応づけられたヒット件数Ｄ７２を、見出し語数情報Ｄ６０から取得した総見出し語数Ｄ６２で割った商を算出する。類似度算出部２７は、ファイル選択工程Ｓ４６で選択したファイル識別子Ｄ７１に対応づけられたヒット割合Ｄ７３として、算出した商を記憶する。
類似度算出部２７は、ファイル選択工程Ｓ４６に処理を戻し、次のファイル識別子Ｄ７１を選択する。 In the ratio calculation step S47, the similarity calculation unit 27 extracts the file identifier D61 that matches the file identifier D71 selected in the file selection step S46 from the headword number information D60 stored in the headword storage unit 25. The similarity calculation unit 27 acquires the total number of headwords D62 associated with the file identifier D61 extracted from the headword number information D60. The similarity calculation unit 27 calculates a quotient obtained by dividing the hit number D72 associated with the file identifier D71 selected in the file selection step S46 by the total headword number D62 acquired from the headword number information D60. The similarity calculation unit 27 stores the calculated quotient as the hit ratio D73 associated with the file identifier D71 selected in the file selection step S46.
The similarity calculation unit 27 returns the process to the file selection step S46, and selects the next file identifier D71.

このように、ヒット件数をそのまま類似度とするのではなく、ヒット件数を総見出し語数で割った商を類似度とする。一つのファイルから見出し語抽出部２４が抽出した見出し語の数が多いほうが、ヒット件数が多くなる傾向がある。ヒット割合を類似度とすることにより、一つのファイルから見出し語抽出部２４が抽出した見出し語の数の多少に左右されることなく、類似したファイルを見つけることができる。
また、あらかじめ総見出し語数を算出してあるので、ファイル記憶部２０が記憶しているファイルの数が多い場合でも、類似したファイルを素早く見つけることができる。 In this way, the number of hits is not directly used as the similarity, but the quotient obtained by dividing the number of hits by the total number of headwords is used as the similarity. As the number of headwords extracted by the headword extraction unit 24 from one file increases, the number of hits tends to increase. By using the hit ratio as the similarity, a similar file can be found without depending on the number of headwords extracted by the headword extraction unit 24 from one file.
In addition, since the total number of headwords is calculated in advance, a similar file can be quickly found even when the number of files stored in the file storage unit 20 is large.

実施の形態３．
実施の形態３について、図１８〜図２１を用いて説明する。
なお、実施の形態１または実施の形態２と共通する部分については、同一の符号を付し、説明を省略する。 Embodiment 3 FIG.
The third embodiment will be described with reference to FIGS.
Note that portions common to Embodiment 1 or Embodiment 2 are denoted by the same reference numerals and description thereof is omitted.

この実施の形態では、類似度に基づいて、データをファイルとして記憶する保管場所の候補を抽出する構成について説明する。 In this embodiment, a configuration for extracting a storage location candidate for storing data as a file based on the similarity will be described.

図１８は、この実施の形態におけるファイル記憶装置１２の機能ブロックの構成の一例を示す図である。 FIG. 18 is a diagram showing an example of the functional block configuration of the file storage device 12 in this embodiment.

ファイル記憶装置１２は、実施の形態２で説明した構成に加えて、更に、適合度算出部２９と、保管場所候補抽出部３０とを有する。 In addition to the configuration described in the second embodiment, the file storage device 12 further includes a fitness calculation unit 29 and a storage location candidate extraction unit 30.

適合度算出部２９は、処理装置９１を用いて、類似度算出部２７が算出した類似度に基づいて、データ入力部２１が入力したデータを記憶する保管場所として適している度合いを表わす適合度を、ファイル記憶部２０がファイルを記憶している保管場所ごとに算出する。
適合度算出部２９は、例えば、それぞれの保管場所について、その保管場所に記憶されているすべてのファイルについて類似度算出部２７が算出した類似度を合計して、その保管場所の適合度とする。 The fitness level calculation unit 29 uses the processing device 91 and based on the similarity level calculated by the similarity level calculation unit 27, the fitness level representing a level suitable as a storage location for storing data input by the data input unit 21. Is calculated for each storage location where the file storage unit 20 stores the file.
For example, for each storage location, the suitability calculation unit 29 sums up the similarities calculated by the similarity calculation unit 27 for all the files stored in the storage location to obtain the suitability of the storage location. .

保管場所候補抽出部３０は、処理装置９１を用いて、適合度算出部２９が算出した適合度に基づいて、ファイル記憶部２０がファイルを記憶している保管場所のなかから、データ入力部２１が入力したデータを記憶する保管場所の候補を抽出する。保管場所候補抽出部３０は、出力装置９３を用いて、抽出した保管場所の候補を出力する。
例えば、保管場所候補抽出部３０は、適合度算出部２９が算出した適合度が高いほうから順に、所定の数の保管場所を、保管場所の候補として抽出する。 The storage location candidate extraction unit 30 uses the processing device 91 to select the data input unit 21 from the storage locations in which the file storage unit 20 stores files based on the fitness calculated by the fitness calculation unit 29. The candidate of the storage place which memorize | stores the data input by is extracted. The storage location candidate extraction unit 30 uses the output device 93 to output the extracted storage location candidates.
For example, the storage location candidate extraction unit 30 extracts a predetermined number of storage locations as storage location candidates in descending order of the fitness calculated by the fitness calculation unit 29.

図１９は、この実施の形態における適合度情報Ｄ９０の一例を示す図である。 FIG. 19 is a diagram showing an example of the fitness information D90 in this embodiment.

適合度算出部２９は、例えば、適合度情報Ｄ９０を生成する。適合度情報Ｄ９０は、例えば、複数の保管場所Ｄ９１と、複数の適合度Ｄ９２とを含む。一つの保管場所Ｄ９１には、一つの適合度Ｄ９２が対応づけられている。保管場所Ｄ９１は、ファイル記憶部２０がファイルを記憶している保管場所を表わす。適合度Ｄ９２は、対応づけられた保管場所Ｄ９１について適合度算出部２９が算出した適合度を表わす。 The fitness level calculation unit 29 generates, for example, fitness level information D90. The fitness information D90 includes, for example, a plurality of storage locations D91 and a plurality of fitness levels D92. One fitness level D92 is associated with one storage location D91. The storage location D91 represents a storage location where the file storage unit 20 stores a file. The fitness level D92 represents the fitness level calculated by the fitness level calculating unit 29 for the associated storage location D91.

図２０は、この実施の形態におけるファイル記憶装置１２の処理の流れの一例を示すフロー図である。 FIG. 20 is a flowchart showing an example of the processing flow of the file storage device 12 in this embodiment.

ファイル記憶装置１２は、実施の形態１で説明した処理に加えて、更に、場所検索処理Ｓ１５を実行する。 In addition to the processing described in the first embodiment, the file storage device 12 further executes a location search process S15.

指示入力工程Ｓ１１において、指示入力部２２が入力した指示が場所検索指示である場合、指示入力部２２は、場所検索処理Ｓ１５へ処理を進める。
場所検索指示とは、データ入力部２１が入力したデータを保管すべき保管場所の候補の検索を指示するものである。 In the instruction input step S11, when the instruction input by the instruction input unit 22 is a location search instruction, the instruction input unit 22 advances the process to the location search process S15.
The location search instruction is an instruction to search for a candidate for a storage location where the data input by the data input unit 21 should be stored.

図２１は、この実施の形態における場所検索処理Ｓ１５の流れの一例を示すフロー図である。 FIG. 21 is a flowchart showing an example of the flow of the location search processing S15 in this embodiment.

場所検索処理Ｓ１５は、例えば、初期化工程Ｓ４０と、データ入力工程Ｓ４１と、見出し語抽出工程Ｓ４２と、見出し語検索工程Ｓ４３と、ファイル選択工程Ｓ４４と、一致計数工程Ｓ４５と、ファイル選択工程Ｓ４６と、割合算出工程Ｓ４７と、集計工程Ｓ４８と、並べ替え工程Ｓ５０と、場所選択工程Ｓ５３と、場所出力工程Ｓ５４とを有する。このうち、実施の形態２で説明した類似検索処理Ｓ１４と共通する符号を付した工程は、類似検索処理Ｓ１４の工程と同様である。 The location search process S15 includes, for example, an initialization step S40, a data input step S41, a headword extraction step S42, a headword search step S43, a file selection step S44, a coincidence counting step S45, and a file selection step S46. And a ratio calculation step S47, a tabulation step S48, a rearrangement step S50, a location selection step S53, and a location output step S54. Of these steps, the steps denoted by the same reference numerals as those of the similar search processing S14 described in the second embodiment are the same as the steps of the similar search processing S14.

割合算出工程Ｓ４７が終了したのち、類似度算出部２７は、集計工程Ｓ４８へ処理を進める。 After the ratio calculation step S47 is completed, the similarity calculation unit 27 advances the processing to the counting step S48.

集計工程Ｓ４８において、適合度算出部２９は、ファイル記憶部２０が記憶したファイル情報Ｄ２０のなかから、ファイル選択工程Ｓ４６で類似度算出部２７が選択したファイル識別子Ｄ７１と一致するファイル識別子Ｄ２１を抽出する。適合度算出部２９は、抽出したファイル識別子Ｄ２１に対応づけられた保管場所Ｄ２３を取得する。
適合度算出部２９は、記憶した適合度情報Ｄ９０のなかから、取得した保管場所Ｄ２３と一致する保管場所Ｄ９１を抽出する。
条件を満たす保管場所Ｄ９１が存在する場合、適合度算出部２９は、抽出した保管場所Ｄ９１に対応づけられた適合度Ｄ９２に、割合算出工程Ｓ４７で類似度算出部２７が算出した類似度を加える。
条件を満たす保管場所Ｄ９１が存在しない場合、適合度算出部２９は、ファイル記憶部２０から取得した保管場所Ｄ２３を保管場所Ｄ９１として記憶する。適合度算出部２９は、割合算出工程Ｓ４７で類似度算出部２７が算出した類似度を、その保管場所Ｄ９１に対応づけられた適合度Ｄ９２として記憶する。
類似度算出部２７は、ファイル選択工程Ｓ４６に処理を戻し、次のファイル識別子Ｄ７１を選択する。 In the counting step S48, the fitness level calculation unit 29 extracts the file identifier D21 that matches the file identifier D71 selected by the similarity level calculation unit 27 in the file selection step S46 from the file information D20 stored in the file storage unit 20. To do. The fitness level calculation unit 29 acquires the storage location D23 associated with the extracted file identifier D21.
The fitness level calculation unit 29 extracts a storage location D91 that matches the acquired storage location D23 from the stored fitness level information D90.
When there is a storage location D91 that satisfies the condition, the fitness calculation unit 29 adds the similarity calculated by the similarity calculation unit 27 in the ratio calculation step S47 to the fitness D92 associated with the extracted storage location D91. .
When there is no storage location D91 that satisfies the condition, the fitness level calculation unit 29 stores the storage location D23 acquired from the file storage unit 20 as the storage location D91. The fitness level calculation unit 29 stores the similarity level calculated by the similarity level calculation unit 27 in the ratio calculation step S47 as the fitness level D92 associated with the storage location D91.
The similarity calculation unit 27 returns the process to the file selection step S46, and selects the next file identifier D71.

並べ替え工程Ｓ５０において、保管場所候補抽出部３０は、適合度算出部２９が記憶した適合度情報Ｄ９０を、適合度Ｄ９２が大きい順に並べ替える。 In the rearrangement step S50, the storage location candidate extraction unit 30 rearranges the fitness level information D90 stored by the fitness level calculation unit 29 in descending order of the fitness level D92.

場所選択工程Ｓ５３において、保管場所候補抽出部３０は、並べ替え工程Ｓ５０で並べ替えた適合度情報Ｄ９０のなかから、まだ選択していない保管場所Ｄ９１を選択する。
すべての保管場所Ｄ９１が選択済であり、まだ選択していない保管場所Ｄ９１がない場合や、選択済の保管場所Ｄ９１の数が所定の数に達した場合、保管場所候補抽出部３０は、場所検索処理Ｓ１５を終了する。
まだ選択していない保管場所Ｄ９１があり、選択済の保管場所Ｄ９１の数がまだ所定の数に達していない場合、保管場所候補抽出部３０は、まだ選択していない保管場所Ｄ９１のなかから、対応づけられた適合度Ｄ９２が最大である保管場所Ｄ９１を選択する。 In the location selection step S53, the storage location candidate extraction unit 30 selects a storage location D91 that has not been selected from the fitness information D90 rearranged in the rearrangement step S50.
When all the storage locations D91 have been selected and there is no storage location D91 that has not yet been selected, or when the number of selected storage locations D91 reaches a predetermined number, the storage location candidate extraction unit 30 The search process S15 is terminated.
If there is a storage location D91 that has not yet been selected, and the number of selected storage locations D91 has not yet reached the predetermined number, the storage location candidate extraction unit 30 selects the storage location D91 that has not yet been selected, A storage location D91 having the highest matching degree D92 is selected.

場所出力工程Ｓ５４において、保管場所候補抽出部３０は、場所選択工程Ｓ５３で選択した保管場所Ｄ９１と、その保管場所Ｄ９１に対応づけられた適合度Ｄ９２とを出力する。
保管場所候補抽出部３０は、場所選択工程Ｓ５３に処理を戻し、次の保管場所Ｄ９１を選択する。 In the location output step S54, the storage location candidate extraction unit 30 outputs the storage location D91 selected in the location selection step S53 and the fitness D92 associated with the storage location D91.
The storage location candidate extraction unit 30 returns the processing to the location selection step S53 and selects the next storage location D91.

このように、ある保管場所に記憶されているファイルについて算出した類似度に基づいて、その保管場所の適合度を算出するので、類似するファイルを多く含む保管場所が、データを保管する保管場所の候補として抽出される。類似するファイルを多く含む保管場所を、データを保管する保管場所の候補として抽出するので、そのデータの内容に相応しい保管場所の候補を提示することができる。 As described above, the suitability of the storage location is calculated based on the similarity calculated for the file stored in a certain storage location. Therefore, the storage location that contains many similar files is the storage location that stores the data. Extracted as a candidate. Since a storage location containing many similar files is extracted as a storage location candidate for storing data, a storage location candidate suitable for the contents of the data can be presented.

なお、適合度算出部２９は、それぞれの保管場所について、その保管場所に記憶されているすべてのファイルについて類似度算出部２７が算出した類似度を平均して、その保管場所の適合度とする構成であってもよい。 The degree-of-fit calculation unit 29 averages the similarities calculated by the degree-of-similarity calculation unit 27 for all the files stored in the respective storage locations, and obtains the suitability of the storage locations. It may be a configuration.

また、類似度算出部２７は、ヒット割合ではなく、ヒット件数を類似度とする構成であってもよい。 Further, the similarity calculation unit 27 may be configured so that the number of hits is not the hit ratio but the similarity.

なお、同じ保管場所には、近い関係にあるファイルであると利用者が判断したファイルが置かれる場合が多い。このため、一致する見出し語の数が少ないファイルであっても、同じ保管場所に類似度の高いファイルが多く記憶されている場合、そのファイルも、何らかの関係を有するファイルである可能性が高い。したがって、場所検索指示に対して保管場所候補抽出部３０が出力した保管場所は、利用者がデータを保管する保管場所を決めるためだけでなく、近い関係にあるファイルを探したいときにも利用できる。 In many cases, files that the user has determined to be closely related files are placed in the same storage location. For this reason, even if the number of matching headwords is small, if many files with high similarity are stored in the same storage location, it is highly likely that the file is also a file having some relationship. Therefore, the storage location output by the storage location candidate extraction unit 30 in response to the location search instruction can be used not only for the user to determine the storage location for storing the data, but also when searching for a closely related file. .

また、類似検索指示に対して、類似出力部２８は、類似度算出部２７が算出した類似度が低いファイルであっても、適合度算出部２９が算出した適合度が高い保管場所に記憶されているファイルを、データ入力部２１が入力したデータに類似するファイルとして出力する構成であってもよい。例えば、類似出力部２８は、適合度算出部２９が算出した適合度に所定の係数（例えば、０．１）を乗じた積と、類似度算出部２７が算出した類似度との和を算出し、算出した和が大きい順に、ファイルを出力する。
これにより、類似度が低くても近い関係にあるファイルを見つけることができる。 Further, in response to the similarity search instruction, the similarity output unit 28 stores a file having a low similarity calculated by the similarity calculation unit 27 in a storage location having a high fitness calculated by the fitness calculation unit 29. The file may be output as a file similar to the data input by the data input unit 21. For example, the similarity output unit 28 calculates the sum of the product obtained by multiplying the fitness calculated by the fitness calculation unit 29 by a predetermined coefficient (for example, 0.1) and the similarity calculated by the similarity calculation unit 27. Then, files are output in descending order of the calculated sum.
This makes it possible to find files that are close to each other even if the degree of similarity is low.

以上、各実施の形態で説明した構成は、一例であり、他の構成であってもよい。例えば、異なる実施の形態で説明した構成を組み合わせた構成であってもよいし、本質的でない部分の構成を、他の構成で置き換えた構成であってもよい。 As described above, the configuration described in each embodiment is an example, and another configuration may be used. For example, the structure which combined the structure demonstrated in different embodiment may be sufficient, and the structure which replaced the structure of the non-essential part with the other structure may be sufficient.

以上説明した類似検索装置（ファイル記憶装置１２）は、データを記憶する記憶装置（９４）と、ファイル記憶部（２０）と、見出し語記憶部（２５）と、データ入力部（２１）と、見出し語抽出部（２４）と、検索部（２６）と、類似度算出部（２７）とを有する。
上記ファイル記憶部は、上記記憶装置を用いて、複数のファイルを記憶する。
上記見出し語記憶部は、上記記憶装置を用いて、上記ファイル記憶部が記憶したファイルそれぞれについて、上記ファイルから抽出した見出し語を記憶する。
上記データ入力部は、データを入力する。
上記見出し語抽出部は、上記データ入力部が入力したデータから見出し語を抽出する。
上記検索部は、上記インデックス記憶部が記憶した見出し語のなかから、上記見出し語抽出部が抽出した見出し語と一致する見出し語を検索する。
上記類似度算出部は、上記検索部が抽出した見出し語に基づいて、上記ファイル記憶部が記憶したファイルそれぞれについて、上記データ入力部が入力したデータとの類似度を算出する。 The similarity search device (file storage device 12) described above includes a storage device (94) for storing data, a file storage unit (20), a headword storage unit (25), a data input unit (21), The headword extraction unit (24), the search unit (26), and the similarity calculation unit (27) are included.
The file storage unit stores a plurality of files using the storage device.
The headword storage unit stores headwords extracted from the file for each file stored in the file storage unit using the storage device.
The data input unit inputs data.
The headword extraction unit extracts headwords from the data input by the data input unit.
The search unit searches for a headword that matches the headword extracted by the headword extraction unit from the headwords stored in the index storage unit.
The similarity calculation unit calculates the similarity between each file stored in the file storage unit and the data input by the data input unit, based on the headword extracted by the search unit.

データ入力部が入力したデータとファイルとの類似度を算出するので、データを保存する適正な保管場所を提示することができる。 Since the similarity between the data input by the data input unit and the file is calculated, an appropriate storage location for storing the data can be presented.

上記類似検索装置（１２）は、更に、適合度算出部（２９）と、保管場所候補抽出部（３０）とを有する。
上記ファイル記憶部（２０）は、上記複数のファイルそれぞれを、複数の保管場所のいずれかに記憶する。
上記適合度算出部は、上記複数の保管場所それぞれについて、上記ファイル記憶部が上記保管場所に記憶したファイルについて上記類似度算出部（２７）が算出した類似度に基づいて、上記データ入力部が入力したデータに対する上記保管場所の適合度を算出する。
上記保管場所候補抽出部は、上記保管場所適合度算出部が算出した適合度に基づいて、上記複数の保管場所のなかから、上記データ入力部（２１）が入力したデータの保管場所の候補を抽出する。 The similarity search device (12) further includes a fitness calculation unit (29) and a storage location candidate extraction unit (30).
The file storage unit (20) stores each of the plurality of files in any of a plurality of storage locations.
For each of the plurality of storage locations, the fitness calculation unit is configured so that the data input unit determines whether the file storage unit stores the file stored in the storage location based on the similarity calculated by the similarity calculation unit (27). Calculate the suitability of the storage location for the entered data.
The storage location candidate extraction unit selects a storage location candidate of the data input by the data input unit (21) from the plurality of storage locations based on the fitness calculated by the storage location fitness calculation unit. Extract.

類似度に基づいて適合度を算出するので、データを保存する適正な保管場所を提示することができる。 Since the fitness is calculated based on the similarity, an appropriate storage location for storing data can be presented.

上記適合度算出部（２９）は、上記保管場所に記憶されたファイルについて上記類似度算出部（２７）が算出した類似度の合計または平均を算出して、上記保管場所の適合度とする。 The fitness level calculation unit (29) calculates the total or average of the similarities calculated by the similarity level calculation unit (27) for the files stored in the storage location to obtain the fitness level of the storage location.

類似度の合計または平均を適合度とするので、データを保存する適正な保管場所を提示することができる。 Since the relevance is the sum or average of the similarities, an appropriate storage location for storing data can be presented.

上記類似度算出部（２７）は、上記ファイル記憶部（２０）が記憶したファイルから抽出した見出し語のうち、上記データ入力部（２１）が入力したデータから上記見出し語抽出部（２４）が抽出した見出し語と一致する見出し語の数または割合を算出して、上記ファイルと上記データとの類似度とする。 The similarity calculation unit (27) is configured so that the headword extraction unit (24) uses the data input by the data input unit (21) among the headwords extracted from the file stored in the file storage unit (20). The number or ratio of headwords that match the extracted headwords is calculated and used as the similarity between the file and the data.

一致する見出し語の数または割合を類似度とするので、データを保存する適正な保管場所を提示することができる。 Since the number or ratio of matching headwords is the similarity, an appropriate storage location for storing data can be presented.

１０ファイル蓄積システム、１１ファイル編集装置、１２ファイル記憶装置、２０ファイル記憶部、２１データ入力部、２２指示入力部、２３ファイル出力部、２４見出し語抽出部、２５見出し語記憶部、２６検索部、２７類似度算出部、２８類似出力部、２９適合度算出部、３０保管場所候補抽出部、９０コンピュータ、９１処理装置、９２入力装置、９３出力装置、９４記憶装置、Ｄ２０ファイル情報、Ｄ２１，Ｄ５２，Ｄ６１，Ｄ７１，Ｄ８１ファイル識別子、Ｄ２２，Ｄ８２ファイル名、Ｄ２３，Ｄ８３，Ｄ９１保管場所、Ｄ２４内容データ、Ｄ４０，Ｄ５１見出し語、Ｄ５０インデックス情報、Ｄ６０見出し語数情報、Ｄ６２総見出し語数、Ｄ７０類似度情報、Ｄ７２，Ｄ８４ヒット件数、Ｄ７３，Ｄ８５ヒット割合、Ｄ８０類似出力情報、Ｄ９０適合度情報、Ｄ９２適合度、Ｓ１１指示入力工程、Ｓ１２書込み処理、Ｓ１３読出し処理、Ｓ１４類似検索処理、Ｓ１５場所検索処理、Ｓ２１上書き判定工程、Ｓ２２見出し語削除工程、Ｓ２３，Ｓ４１データ入力工程、Ｓ２４ファイル記憶工程、Ｓ２５，Ｓ４２見出し語抽出工程、Ｓ２６見出し語記憶工程、Ｓ２７見出し語計数工程、Ｓ２８見出し語数記憶工程、Ｓ４０初期化工程、Ｓ４３見出し語検索工程、Ｓ４４，Ｓ４６，Ｓ５１ファイル選択工程、Ｓ４５一致計数工程、Ｓ４７割合算出工程、Ｓ４８集計工程、Ｓ５０並べ替え工程、Ｓ５２類似出力工程、Ｓ５３場所選択工程、Ｓ５４場所出力工程。 10 file storage system, 11 file editing device, 12 file storage device, 20 file storage unit, 21 data input unit, 22 instruction input unit, 23 file output unit, 24 headword extraction unit, 25 headword storage unit, 26 search unit , 27 Similarity calculation unit, 28 Similar output unit, 29 Conformity calculation unit, 30 Storage location candidate extraction unit, 90 Computer, 91 Processing device, 92 Input device, 93 Output device, 94 Storage device, D20 File information, D21, D52, D61, D71, D81 File identifier, D22, D82 File name, D23, D83, D91 Storage location, D24 Content data, D40, D51 Headword, D50 Index information, D60 Headword number information, D62 Total headword number, D70 Similar Degree information, D72, D84 hits , D73, D85 hit ratio, D80 similarity output information, D90 fitness information, D92 fitness, S11 instruction input process, S12 write process, S13 read process, S14 location search process, S15 location search process, S21 overwrite determination process, S22 Headword deletion step, S23, S41 Data input step, S24 File storage step, S25, S42 Headword extraction step, S26 Headword storage step, S27 Headword count step, S28 Headword count storage step, S40 Initialization step, S43 Heading Word search step, S44, S46, S51 File selection step, S45 coincidence counting step, S47 ratio calculation step, S48 tabulation step, S50 rearrangement step, S52 similar output step, S53 place selection step, S54 place output step.

Claims

A storage device for storing data, a file storage unit, a headword storage unit, a data input unit, a headword extraction unit, a search unit, and a similarity calculation unit;
The file storage unit stores a plurality of files using the storage device,
The headword storage unit stores headwords extracted from the file for each file stored by the file storage unit using the storage device,
The data input unit inputs data,
The headword extraction unit extracts headwords from the data input by the data input unit,
The search unit searches for a headword that matches the headword extracted by the headword extraction unit from the headwords stored in the headword storage unit,
The similarity calculation unit calculates a similarity between each file stored in the file storage unit and data input by the data input unit based on an entry word searched by the search unit. Similarity search device.

The similarity search device further includes a fitness calculation unit and a storage location candidate extraction unit,
The file storage unit stores each of the plurality of files in any of a plurality of storage locations,
For each of the plurality of storage locations, the fitness calculation unit is configured to receive data input by the data input unit based on the similarity calculated by the similarity calculation unit for the files stored in the storage location by the file storage unit. Calculate the suitability of the above storage location for
The storage location candidate extraction unit extracts a storage location candidate for the data input by the data input unit from the plurality of storage locations based on the fitness calculated by the storage location suitability calculation unit. The similarity search apparatus according to claim 1.

The degree of conformity calculation unit calculates the sum or average of the similarities calculated by the similarity degree calculation unit for the files stored in the storage location to obtain the conformity of the storage location. 3. The similarity search device according to 2.

The similar search device further includes a similar file extraction unit,
The similar file extraction unit extracts a file similar to the data input by the data input unit from the plurality of files based on the similarity calculated by the similarity calculation unit. The similarity search device according to any one of claims 1 to 3.

The similarity calculation unit includes, among the headwords extracted from the file stored in the file storage unit, the number of headwords matching the headword extracted by the headword extraction unit from the data input by the data input unit or 5. The similarity search apparatus according to claim 1, wherein a ratio is calculated to obtain a similarity between the file and the data.

A computer program that, when executed by a computer, causes the computer to function as the similarity search device according to any one of claims 1 to 5.

A storage device stores a plurality of files,
The storage device stores the headword extracted from the file for each stored file,
The input device inputs the data,
The processing device extracts headwords from the data input by the input device,
The processing device searches for a headword that matches the extracted headword from the headwords stored in the storage device,
A similarity search method, wherein the processing device calculates a similarity between each file stored in the storage device and data input by the input device based on the searched entry word.