JP2003330940A

JP2003330940A - Creation method for retrieval index data, creation device for retrieval index data, and file retrieval device

Info

Publication number: JP2003330940A
Application number: JP2002136320A
Authority: JP
Inventors: Naoya Uematsu; 直也植松
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2002-05-10
Filing date: 2002-05-10
Publication date: 2003-11-21
Anticipated expiration: 2022-05-10
Also published as: JP4974436B2

Abstract

<P>PROBLEM TO BE SOLVED: To solve troubles in managing a large amount of file by classifying it according to its contents. <P>SOLUTION: A file retrieval server 100 is provided with a word extract part 102 extracting a retrieval work included in a file to be an object of a processing and an item specification part 104 specifying a retrieval item expressing the utilization of the extracted retrieval word. An index storage part 44 retains index data created in a form of corresponding the retrieval word to the retrieval item. A comparison processing part 32 acquires a target index data showing the retrieval conditions and compares the similarity between the target index data and the index data of a plurality of files. A file extract part 116 extracts a file having the contents conceptually resembling to the retrieval conditions from a plurality of files. A result provision part 118 presents the extracted file. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、検索用索引デー
タの生成方法および装置と、これらを利用可能なファイ
ル検索装置に関する。この発明は特に、多数のファイル
から効率よく目的のファイルを抽出する技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a search index data generation method and device, and a file search device that can utilize them. The present invention particularly relates to a technique for efficiently extracting a target file from a large number of files.

【０００２】[0002]

【従来の技術】近年、ＰＣ（パーソナルコンピュータ）
の普及により、あらゆる書類の電子化が進んでいる。ワ
ードプロセッサなどの文書作成ソフトウエアを用いてフ
ァイルを電子的に作成し、これがハードディスクに蓄積
されていく。企業内の環境ではコンピュータ同士をネッ
トワークで接続し、大量の文書ファイルを複数ユーザ間
で共有することも多い。インターネットが普及した現在
では、ウェブページや電子メールなどの社外から受け取
るデータも増えている。こうした大量のファイルの中か
ら誰でも所望のファイルを探せるように、所定の管理者
が予め分類しておくこともある。2. Description of the Related Art In recent years, a PC (personal computer)
With the spread of, all documents are being digitized. Files are created electronically using document creation software such as a word processor, and these are stored in the hard disk. In an enterprise environment, computers are often connected to each other via a network, and a large number of document files are often shared by multiple users. Now that the Internet has spread, the amount of data received from outside such as web pages and emails is increasing. A predetermined administrator may classify the files in advance so that anyone can search for a desired file from such a large number of files.

【０００３】[0003]

【発明が解決しようとする課題】ここで、複数のファイ
ルを分類する方法として、ファイルの内容に応じてグル
ープ化してそれぞれを別々のフォルダに格納する方法が
ある。しかしながら、ファイルのグループを一義的に定
めることは困難であるばかりか、大量のファイルを特定
の管理者が予め分類したとしても分類の基準が管理者の
主観に依存して却って検索が困難となる場合もある。こ
うして管理と検索の双方が容易でないために、貴重な資
料が再利用されずに眠ったままになることは決して珍し
くない。As a method of classifying a plurality of files, there is a method of grouping according to the contents of the files and storing each in a separate folder. However, it is not only difficult to uniquely define a group of files, but even if a large number of files are classified in advance by a particular administrator, the criteria for classification depend on the subjectivity of the administrator, making it difficult to search. In some cases. Thus, it is not uncommon for valuable material to remain asleep without being reused because it is both difficult to manage and retrieve.

【０００４】一方、複数ユーザ間で共有されるファイル
は、多くのユーザにとって利用価値の高いものもあれ
ば、そもそもほんの一部のユーザにしか利用価値がない
ものもある。また、頻繁に再利用されるものやそうでな
いものも含まれる。したがって、大量のファイルのうち
大半が各ユーザにとって不要である可能性が高い。その
ようなファイルが混在した状態でファイルを種類別に分
類しても、必ずしも検索の容易化にはつながらない。真
に利用価値の高いファイルだけを簡単に探し出すことが
できれば、多くのユーザに作業効率の向上をもたらすこ
とになる。On the other hand, some files shared by a plurality of users have a high utility value for many users, and some have a utility value for only a few users in the first place. It also includes those that are frequently reused and those that are not. Therefore, it is likely that most of the large number of files will not be needed by each user. Classification of files by type in a state where such files are mixed does not necessarily lead to easy search. If you can easily find only the files that are really useful, many users will be able to work more efficiently.

【０００５】本発明者は以上の認識に基づき本発明をな
したもので、その目的は、利便性の高い方法で多数のフ
ァイルから効率よく目的のファイルを抽出するための技
術の提供にある。本発明のまた別の目的は、処理の対象
となるファイルの概念を示す索引データを精度よく生成
する技術の提供にある。The present inventor has made the present invention based on the above recognition, and an object thereof is to provide a technique for efficiently extracting a target file from a large number of files by a highly convenient method. Still another object of the present invention is to provide a technique for accurately generating index data indicating the concept of a file to be processed.

【０００６】[0006]

【課題を解決するための手段】本発明のある態様は、検
索用索引データの生成方法に関する。この方法は、検索
の対象となるファイルに含まれる語句を検索語句として
参照するときに、用途に応じて語句の類否を判断するた
めに、検索条件との比較に先立ち、検索語句の用途を表
現した検索項目を特定し、検索語句を検索項目に対応付
けて保持する。検索語句が検索項目に対応付けられたか
たちで索引データが生成されてよい。One aspect of the present invention relates to a method of generating search index data. In this method, when referring to a word or phrase contained in the file to be searched as a search word or phrase, the purpose of the search word or phrase is determined prior to comparison with the search condition in order to determine the similarity of the word or phrase according to the purpose. The expressed search item is specified, and the search term is held in association with the search item. Index data may be generated in such a manner that search terms are associated with search items.

【０００７】ファイルとは文書ファイル、ＨＴＭＬやＸ
ＭＬなどの記述言語を用いて生成されたファイル、画像
ファイルなど、多様な形式のファイルを含む。またここ
でいうファイルとは、複数の項目を含むレコードであっ
てもよく、ある実体の属性からなる一まとまりのデータ
群であってもよい。属性とは、実体の具体的な性質であ
ってよい。なお、ここで検索項目とは、レコードにおけ
る項目や実体の属性であってよい。ファイルは、自然文
であってもよく、また予め設定された検索項目に対応づ
けて検索語句が入力された形式であってもよい。また、
ファイルは自然文と予め設定された検索項目に対応づけ
て検索語句が入力された形式との組み合わせであっても
よい。A file is a document file, HTML or X
It includes files of various formats such as files generated using a description language such as ML and image files. The file mentioned here may be a record including a plurality of items, or may be a group of data including attributes of a certain entity. An attribute may be a concrete property of an entity. The search item may be an attribute of an item or entity in the record. The file may be a natural sentence, or may be in a format in which search terms are input in association with preset search items. Also,
The file may be a combination of a natural sentence and a format in which search terms are input in association with preset search items.

【０００８】本発明の別の態様は、検索の対象となるフ
ァイルにおいて、そのファイルに含まれる文字列の出現
頻度に基づく統計的な処理によりそのファイルの索引デ
ータを生成する方法に関する。この方法は、同じ概念を
表象する文字列であっても、用途が異なる場合は、それ
らの文字列を異なる文字列として統計的な処理を行う。Another aspect of the present invention relates to a method for generating index data of a file to be searched by statistical processing based on the appearance frequency of character strings contained in the file. In this method, even if the character strings that represent the same concept are used for different purposes, the character strings are statistically processed as different character strings.

【０００９】本発明の別の態様は、検索用索引データの
生成装置に関する。この装置は、処理の対象となるファ
イルに含まれる語句を抽出する語句抽出部と、抽出され
た語句を検索語句として参照するときに、用途に応じて
語句の類否を判断するために、検索語句の用途を表現し
た検索項目を特定する項目特定部と、検索語句を検索項
目に対応付けて保持する索引データ保持部とを備える。Another aspect of the present invention relates to a search index data generating apparatus. This device is a word extraction unit that extracts words and phrases included in a file to be processed, and when referring to the extracted words and phrases as search words and phrases, in order to determine the similarity of words and phrases according to the use, An item specifying unit that specifies a search item that expresses the use of the phrase and an index data holding unit that holds the search phrase in association with the search item are provided.

【００１０】本発明の別の態様もまた検索用索引データ
の生成装置に関する。この装置は、検索時に参照する検
索語句を、用途に応じて語句の類否を判断するために、
検索語句の用途を表現した検索項目に対応づけられた所
定の場所に入力させ、入力された検索語句をその検索項
目に対応づけて取得する語句取得部と、検索語句を検索
項目に対応づけて保持する索引データ保持部とを備え
る。Another aspect of the present invention also relates to an apparatus for generating search index data. This device, in order to determine the similarity of the search term referred to at the time of search, depending on the use,
A word acquisition unit that causes a search term that expresses the purpose of the search term to be entered in a predetermined location associated with the search term, and that acquires the entered search term by associating the search term with the search term. And an index data holding unit for holding.

【００１１】ここで、語句取得部は、予め設定された検
索項目に対応づけて検索語句が入力されるアンケート、
パンフレット、カタログ、規格表などの穴埋め形式で検
索語句を取得してよい。なお、検索項目が特に設定され
ていない欄に自然文が入力された場合は、語句取得部は
その自然文から検索語句を抽出し、抽出した検索語句の
用途を表現した検索項目を特定してよい。このとき、語
句取得部は、自然文に含まれる語句の出現頻度に基づく
統計的な処理により検索語句を抽出してよい。また、検
索項目は、同一ファイル中の他の欄や他のファイルにお
いて予め設定された検索項目を参照して特定されてもよ
い。Here, the word / phrase acquiring unit is a questionnaire in which search words / phrases are input in association with preset search items,
The search term may be acquired in a fill-in-the-blank format such as a brochure, a catalog, or a standard table. When a natural sentence is entered in a field in which no search item is set, the phrase acquisition unit extracts the search phrase from the natural sentence and identifies the search item expressing the usage of the extracted search phrase. Good. At this time, the phrase acquisition unit may extract the search phrase by statistical processing based on the appearance frequency of the phrase included in the natural sentence. Further, the search item may be specified by referring to another field in the same file or a search item set in advance in another file.

【００１２】この装置は、検索語句およびそれに対応付
けられた検索項目との組み合わせごとの出現頻度に基づ
く統計的な処理を行う統計処理部をさらに備えてもよ
く、索引データ保持部は、統計的な処理の結果を保持し
てもよい。統計処理部は、同じ表現であっても、異なる
検索項目に対応付けられた検索語句は、異なる語句とし
て扱ってよい。また統計処理部は、異なる表現であって
も、対応付けられた検索項目を考慮すると実質的に同義
となる検索語句は、同一視して扱ってよい。The apparatus may further include a statistical processing section for performing statistical processing based on the appearance frequency for each combination of the search term and the search item associated with the search term. The results of various processes may be retained. The statistical processing unit may treat search terms associated with different search items, even if they have the same expression, as different terms. In addition, the statistical processing unit may treat the search terms that are substantially synonymous in consideration of the associated search items, even if the expressions are different, as the same.

【００１３】この装置は、検索語句が互いに異なる場合
であっても、同一の検索項目に対応付けられ、その検索
項目を考慮すると実質的に同義となる検索語句を同一視
する処理を行う類似語句調整部をさらに有してよい。こ
の装置は、検索項目に対して入力されるべき検索語句を
検索項目に対応づけて保持する項目別の類似語句格納部
をさらに有してよく、類似語句調整部は、同一の検索項
目に対応付けられた検索語句の一方が抽象的な用語であ
る場合に、類似語句格納部を参照して、検索語句を具体
化した用語との対応付けを行う。Even if the search terms are different from each other, this apparatus associates with the same search item, and when the search item is taken into consideration, a similar phrase that performs a process of equating the search terms that are substantially synonymous to each other is considered. You may further have an adjustment part. The device may further include a similar phrase storage unit for each item that holds a search phrase to be input to the search item in association with the search item, and the similar phrase adjusting unit corresponds to the same search item. When one of the attached search terms is an abstract term, the similar term storage unit is referred to and the search term is associated with the embodied term.

【００１４】この装置は、検索条件の主題となる実体を
設定する設定部をさらに備えてもよく、統計処理部は、
検索語句に対応づけられた検索項目が実体に関連する場
合は、その検索語句に対しては重み付けを高くして統計
的な処理を行ってよい。実体に関連するとは、検索項目
が実体を含む場合であってよく、また検索項目が実体の
属性である場合であってもよい。The apparatus may further include a setting unit for setting the substance which is the subject of the search condition, and the statistical processing unit is
When the search item associated with the search term is related to the substance, the search term may be weighted higher and statistically processed. Relevance to an entity may mean that the search item includes the entity, or that the search item is an attribute of the entity.

【００１５】この装置は、処理の対象となるファイルの
主題となる実体を設定する設定部をさらに備えてもよ
く、項目特定部は、実体の属性を考慮して検索項目を特
定してもよい。属性とは、実体の具体的な性質であって
よい。設定部は、ファイルに含まれる語句の出現頻度に
基づく統計的な処理によりそのファイルの概念を特定
し、その概念に基づき実体を設定してよい。また、検索
項目は、同一ファイル中の予め設定された他の検索項目
や、同一の実体が設定された他のファイルに含まれる検
索項目を参照して特定されてもよい。The apparatus may further include a setting unit for setting the substance that is the subject of the file to be processed, and the item specifying unit may specify the search item in consideration of the attribute of the substance. . An attribute may be a concrete property of an entity. The setting unit may specify the concept of the file by statistical processing based on the appearance frequency of the words included in the file, and set the substance based on the concept. The search item may be specified by referring to another preset search item in the same file or a search item included in another file in which the same entity is set.

【００１６】本発明の別の態様は、ファイル検索装置に
関する。この装置は、検索の対象となる複数のファイル
において、ファイルごとに、そのファイルに含まれる語
句を検索語句として参照するときに、用途に応じて語句
の類否を判断するために、検索語句の用途を表現した検
索項目を検索語句に対応づけて索引データとして取得す
る対象取得部と、検索条件を示す目標索引データを取得
する条件取得部と、目標索引データと、複数のファイル
の索引データとを比較して、索引データ間の類似度をも
とに、複数のファイルの中から検索条件と概念的に内容
が類似するファイルを抽出されたファイル抽出部と、抽
出したファイルを提示する結果提示部とを備える。索引
データは、検索に先立ち生成されてもよく、検索時に略
リアルタイムで生成されてもよい。目標索引データは、
検索の対象となる索引データと同様の手法で生成されて
よい。Another aspect of the present invention relates to a file search device. This device, in a plurality of files to be searched, for each file, when referring to a phrase contained in the file as a search phrase, in order to determine the similarity of the phrase according to the purpose, A target acquisition unit that acquires a search item that expresses the purpose as index data by associating it with a search term, a condition acquisition unit that acquires target index data indicating a search condition, target index data, and index data of multiple files. Based on the similarity between index data, a file extraction unit that extracts files that are conceptually similar to the search condition from multiple files, and a result presentation that presents the extracted files And a section. The index data may be generated prior to the search, or may be generated in real time during the search. The target index data is
It may be generated by the same method as the index data to be searched.

【００１７】この装置は単体のＰＣで構成されてもよ
く、互いにネットワークで接続されたサーバおよびユー
ザ端末を組み合わせたシステムの形で構成してもよい。
後者の場合、本装置に含まれるべき各機能ブロックを、
システムを構成するサーバおよびユーザ端末のいずれに
包含させてもよい。例えば、対象取得部、条件取得部、
ファイル抽出部、および結果提示部は、それぞれサーバ
およびユーザ端末のいずれか一方または双方に包含され
てもよく、いずれの場合にも同一の呼称で表現してもよ
い。これらの機能をソフトウェアモジュールの形で提供
する場合、サーバまたはユーザ端末のいずれにおいて実
行してもよい。This device may be composed of a single PC, or may be composed of a system in which a server and a user terminal connected to each other via a network are combined.
In the latter case, each functional block that should be included in this device
It may be included in any of the server and the user terminal constituting the system. For example, the target acquisition unit, the condition acquisition unit,
The file extraction unit and the result presentation unit may be included in either or both of the server and the user terminal, and may be expressed by the same name in any case. When these functions are provided in the form of software modules, they may be executed by either the server or the user terminal.

【００１８】なお、以上の構成要素の任意の組み合わせ
や、本発明の構成要素や表現を方法、装置、システム、
コンピュータプログラム、コンピュータプログラムを格
納した記録媒体などの間で相互に置換したものもまた、
本発明の態様として有効である。It should be noted that any combination of the above components, and the components and expressions of the present invention can be realized by a method, apparatus, system,
Computer programs, recording media that store computer programs, and the like, which are mutually replaced, are also
It is effective as an aspect of the present invention.

【００１９】[0019]

【発明の実施の形態】（前提技術）この前提技術におけ
るファイル検索装置は、検索条件としてユーザが指定し
た文章に類似するファイルを複数のファイルの中から検
索する。これにより、予め内容に応じてファイルを分類
しておかなくともファイルの検索が容易となり、大量の
ファイルを管理する負担が軽減される。BEST MODE FOR CARRYING OUT THE INVENTION (Prerequisite Technology) A file search device according to this prerequisite technology searches a plurality of files for a file similar to a sentence designated by a user as a search condition. As a result, it becomes easy to search for files without classifying the files according to the contents in advance, and the burden of managing a large number of files is reduced.

【００２０】図１は、前提技術におけるファイル検索装
置の構成を示す機能ブロック図である。ファイル検索装
置１０は、複数のファイルから所望のファイルを検索す
る際に参照される索引データの生成に必要な処理をなす
処理ユニット２０と、ユーザから指定された条件に基づ
いて検索処理をなす検索ユニット３０と、検索対象とな
る複数のファイル（以下、「蓄積ファイル」という。）
や検索処理に必要なデータを保持する保持ユニット４０
と、本装置と外部との間でデータの入出力を処理する入
出力処理部５０と、を有する。FIG. 1 is a functional block diagram showing the configuration of a file search device in the base technology. The file search device 10 includes a processing unit 20 that performs a process required to generate index data that is referred to when a desired file is searched from a plurality of files, and a search process that performs a search process based on a condition designated by a user. The unit 30 and a plurality of files to be searched (hereinafter referred to as "accumulation files")
Holding unit 40 for holding data necessary for search processing
And an input / output processing unit 50 that processes input / output of data between this device and the outside.

【００２１】このファイル検索装置１０は、ハードウエ
ア的にはコンピュータのＣＰＵやメモリなどの構成で実
現でき、ソフトウエア的にはファイル管理やファイル検
索機能のあるプログラムなどによって実現できるが、本
図ではそれらの連携によって実現される機能ブロックを
描いている。したがって、これらの機能ブロックはハー
ドウエア、ソフトウエアの組み合わせによっていろいろ
なかたちで実現できる。The file search apparatus 10 can be realized by hardware such as a CPU and a memory of a computer, and can be realized by software such as a program having a file management or file search function. It depicts the functional blocks that are realized through these collaborations. Therefore, these functional blocks can be realized in various ways depending on the combination of hardware and software.

【００２２】処理ユニット２０は、保持ユニット４０が
保持する複数の蓄積ファイルを処理対象とし、それぞれ
から特徴的な文字列を複数抽出する。この複数の特徴的
な文字列は、その蓄積ファイルの内容を端的に示したコ
ンセプト（概念）を形成するものとし、このコンセプト
を索引データとして記録する。処理ユニット２０は、蓄
積ファイルに含まれる文字列を言語解析する解析処理部
２２と、その解析結果に基づいて索引データを生成する
生成処理部２４と、を含む。The processing unit 20 processes a plurality of accumulated files held by the holding unit 40 and extracts a plurality of characteristic character strings from each of them. The plurality of characteristic character strings form a concept (concept) that briefly shows the contents of the storage file, and the concept is recorded as index data. The processing unit 20 includes an analysis processing unit 22 that linguistically analyzes a character string included in the accumulated file, and a generation processing unit 24 that generates index data based on the analysis result.

【００２３】解析処理部２２は、前処理部２６と文字列
抽出部２７を含む。前処理部２６は、言語解析に先だっ
て前処理を行う。例えば、処理対象となる蓄積ファイル
からファイル形式や文書形式を検出し、これに基づいて
その蓄積ファイルをテキスト形式などの非定型な形式に
変換して解析容易な状態を形成してもよい。ひとつの蓄
積ファイルを複数のブロックに分割して解析に適した状
態を形成してもよい。このとき形態素解析、構文解析、
意味解析などの技術を利用してもよい。The analysis processing section 22 includes a preprocessing section 26 and a character string extraction section 27. The preprocessing unit 26 performs preprocessing prior to language analysis. For example, a file format or a document format may be detected from the storage file to be processed, and the storage file may be converted into an atypical format such as a text format based on the file format or the document format to form an easily analyzed state. One storage file may be divided into a plurality of blocks to form a state suitable for analysis. At this time, morphological analysis, syntactic analysis,
A technique such as semantic analysis may be used.

【００２４】文字列抽出部２７は、処理対象の蓄積ファ
イルから複数の文字列を抽出する。後述する単語辞書に
含まれる単語を抽出する形でもよいし、スペースやブラ
ンクで区切られた文字列を単語として認識してもよい。The character string extraction unit 27 extracts a plurality of character strings from the storage file to be processed. The words included in a word dictionary described later may be extracted, or a character string delimited by spaces or blanks may be recognized as a word.

【００２５】生成処理部２４は、統計処理部２８と索引
生成部２９を含む。統計処理部２８は、抽出された文字
列のその蓄積ファイルにおける出現頻度を計数するとと
もに、ファイル格納部４２が保持する複数の蓄積ファイ
ル全体にわたるその文字列の出現頻度を計数する。この
とき文字列同士の類似性を考慮する。例えば、類義語、
同義語、統制語として定義された相互に意味が近似する
複数の単語間の相違を吸収して文字列の出現頻度を計数
する。The generation processing unit 24 includes a statistical processing unit 28 and an index generation unit 29. The statistical processing unit 28 counts the appearance frequency of the extracted character string in the storage file and also the appearance frequency of the character string in all the plurality of storage files held by the file storage unit 42. At this time, the similarity between the character strings is considered. For example, synonyms,
The appearance frequency of a character string is counted by absorbing the difference between a plurality of words defined as synonyms and controlled words and having similar meanings.

【００２６】索引生成部２９は、統計処理部２８によっ
て計数された文字列の出現頻度に基づいて索引データを
生成する。この索引データは、抽出された複数の文字列
にそれぞれの出現頻度に応じた重み付けが付加された一
覧として構成される。各文字列に対する重み付けは、処
理対象の蓄積ファイルにおける出願頻度が高い文字列ほ
ど重み付けを高くする一方で、ファイル格納部４２が保
持する複数の蓄積ファイル全体にわたって出現頻度が高
い文字列に対しては重み付けを低くする。その結果、そ
の蓄積ファイルに特有な文字列を統計的な手法で顕在化
させることができる。各蓄積ファイルが前処理部２６に
よって複数のブロックに分割された場合はブロックごと
に索引データが生成される。The index generator 29 generates index data based on the appearance frequency of the character string counted by the statistical processor 28. This index data is configured as a list in which weights are added to the extracted plurality of character strings according to their respective appearance frequencies. As for the weighting for each character string, the character string having a higher application frequency in the storage file to be processed is set to have a higher weighting, while the character string having a high appearance frequency over all the plurality of storage files held by the file storage unit 42 is set. Lower the weight. As a result, the character string unique to the accumulated file can be revealed by a statistical method. When each accumulated file is divided into a plurality of blocks by the preprocessing unit 26, index data is generated for each block.

【００２７】保持ユニット４０は、ファイル格納部４
２、索引格納部４４、辞書格納部４６、および関連デー
タ格納部４８を含む。ファイル格納部４２は、複数の蓄
積ファイルを保持する。例えばワードプロセッサなどの
文書作成ソフトウエアによって生成された文書ファイ
ル、ＨＴＭＬ（Hyper Text Markup Language）やＸＭＬ
（eXtensible Markup Language）などの記述言語を用い
て生成されたファイルなど、多様な形式のファイルを含
み、その内容は必ずしも文章でなくともよい。また、蓄
積ファイル自体は、検索を前提とした分類および定型化
が予めなされることを必要としない。The holding unit 40 includes a file storage unit 4
2, an index storage unit 44, a dictionary storage unit 46, and a related data storage unit 48. The file storage unit 42 holds a plurality of accumulated files. For example, a document file generated by document creation software such as a word processor, HTML (Hyper Text Markup Language) or XML.
It includes files in various formats, such as files generated using a description language such as (eXtensible Markup Language), and the contents thereof need not necessarily be sentences. Further, the accumulated file itself does not need to be classified and standardized in advance for retrieval.

【００２８】索引格納部４４は、処理ユニット２０によ
って生成された索引データを蓄積ファイルと対応づけら
れたかたちで保持する。辞書格納部４６は、単語辞書、
類義語辞書、同義語辞書、統制語辞書など、処理ユニッ
ト２０による言語解析や統計処理において参照されるデ
ータを保持する。関連データ格納部４８は、検索ユニッ
ト３０による処理においてオプション的に利用されるデ
ータを保持する。例えば、検索条件として指定された言
葉を上位概念の単語、下位概念の単語、関連性をもつ単
語などに置き換えるために参照する関連辞書を保持す
る。こうしたデータを処理ユニット２０が蓄積ファイル
から抽出して生成してもよい。The index storage unit 44 holds the index data generated by the processing unit 20 in a form associated with the storage file. The dictionary storage unit 46 is a word dictionary,
Data such as a synonym dictionary, a synonym dictionary, and a controlled word dictionary that are referred to in the language analysis and statistical processing by the processing unit 20 are held. The related data storage unit 48 holds data that is optionally used in the processing by the search unit 30. For example, it holds a related dictionary that is referred to in order to replace a word specified as a search condition with a word of a superordinate concept, a word of a subordinate concept, a word having relevance, or the like. Such data may be generated by the processing unit 20 by extracting it from the accumulated file.

【００２９】検索ユニット３０は、ユーザから検索条件
を受け取り、これに適合する蓄積ファイルをファイル格
納部４２から抽出する。検索ユニット３０は、検索条件
と索引データを比較する比較処理部３２と、比較結果に
基づいて検索条件に適合する蓄積ファイルをユーザに提
示する結果処理部３４を含む。The search unit 30 receives a search condition from the user and extracts a storage file matching the search condition from the file storage section 42. The search unit 30 includes a comparison processing unit 32 that compares the search condition with the index data, and a result processing unit 34 that presents to the user an accumulation file that matches the search condition based on the comparison result.

【００３０】比較処理部３２は、条件設定部３６および
類似度判断部３７を含む。条件設定部３６は、ユーザか
ら検索条件を取得する。この検索条件は、自然文によっ
て記述された文章のかたちでもよいし、何らかの文字列
を含んだファイルのかたちでもよい。その検索条件は処
理ユニット２０に送られて前述した索引データの生成過
程と同様の処理対象となり、その検索条件のコンセプト
が生成される。The comparison processing section 32 includes a condition setting section 36 and a similarity determination section 37. The condition setting unit 36 acquires the search condition from the user. This search condition may be in the form of a sentence described by a natural sentence or in the form of a file containing some character string. The search condition is sent to the processing unit 20 and is subjected to the same processing as the above-described index data generation process, and the concept of the search condition is generated.

【００３１】類似度判断部３７は、検索条件のコンセプ
トと索引データとして記録されたコンセプト同士を比較
することにより、検索条件と蓄積ファイルの類似度を検
出する。比較の際に、辞書格納部４６や関連データ格納
部４８が保持する各種辞書に基づき、検索条件に含まれ
る文字列と関連する他の文字列を追加してその検索条件
を補完してもよい。The similarity determination unit 37 detects the similarity between the search condition and the accumulated file by comparing the concept of the search condition and the concepts recorded as index data. At the time of comparison, based on various dictionaries stored in the dictionary storage unit 46 or the related data storage unit 48, another character string related to the character string included in the search condition may be added to complement the search condition. .

【００３２】ここで、検索条件と索引データの比較には
ベクトル空間モデルを利用する。すなわち、検索条件の
コンセプトと索引データのコンセプトをそれぞれ多次元
空間上のベクトルとして表現し、これらを比較する。コ
ンセプトにｎ個の文字列が含まれる場合はｎ次元のベク
トル空間が形成され、各文字列の出現頻度に応じた重み
付けが各成分に加えられる。こうして形成されるベクト
ル同士の近似度が検索条件と蓄積ファイルの類似度とな
る。Here, a vector space model is used to compare the search condition with the index data. That is, the concept of search conditions and the concept of index data are expressed as vectors in a multidimensional space, and these are compared. When the concept includes n character strings, an n-dimensional vector space is formed and each component is weighted according to the appearance frequency of each character string. The degree of approximation between the vectors thus formed is the degree of similarity between the search condition and the accumulated file.

【００３３】結果処理部３４は、一覧生成部３８および
表示処理部３９を含む。一覧生成部３８は、類似度の高
い順に蓄積ファイルの一覧を生成する。このとき一覧に
含まれる蓄積ファイルの数が適当な数に限定されるよう
調整してもよい。The result processing section 34 includes a list generation section 38 and a display processing section 39. The list generation unit 38 generates a list of accumulated files in descending order of similarity. At this time, the number of stored files included in the list may be adjusted to an appropriate number.

【００３４】表示処理部３９は、検索結果として蓄積フ
ァイルの一覧を画面に表示させる。蓄積ファイルの一覧
は、ファイル名とその内容の要約で構成してもよい。The display processing unit 39 displays a list of accumulated files as a search result on the screen. The list of accumulated files may consist of file names and a summary of their contents.

【００３５】入出力処理部５０は、ファイル検索装置１
０に対する各種処理の指示、検索条件の入力、検索結果
の出力など、ファイル検索装置１０とその外部との間で
データを入出力するインタフェイスである。ファイル検
索装置１０がスタンドアロンで実現される場合にはユー
ザと本装置を結ぶインタフェイスとなり、ファイル検索
装置１０がネットワークサーバとして実現される場合に
は本装置をクライアント端末とネットワークを介して接
続させる通信インタフェイスとなる。The input / output processing unit 50 is used by the file search device 1
This is an interface for inputting and outputting data between the file search device 10 and the outside thereof, such as various processing instructions for 0, input of search conditions, output of search results, and the like. When the file search device 10 is realized as a stand-alone, it serves as an interface connecting a user and this device, and when the file search device 10 is realized as a network server, communication for connecting this device with a client terminal via a network. Become an interface.

【００３６】図２は、前提技術における索引データの生
成過程を示すフローチャートである。まず、複数のファ
イルから処理対象となる蓄積ファイルを設定し（Ｓ１
０）、その蓄積ファイルに前処理を施し（Ｓ１２）、そ
の蓄積ファイルから形態素解析などの処理により文字列
を抽出する（Ｓ１４）。抽出された文字列ごとに出現頻
度などの統計的なデータを算出し（Ｓ１６）、これをも
とに索引データを生成する（Ｓ１８）。まだ索引データ
生成がされていない蓄積ファイルがファイル格納部４２
に残っている場合（Ｓ２０Ｙ）、その残りファイルを処
理対象にしてＳ１０〜Ｓ１８の処理を施し、すべての蓄
積ファイルを処理するまでこれを繰り返す（Ｓ２０）。FIG. 2 is a flowchart showing the index data generation process in the base technology. First, a storage file to be processed is set from a plurality of files (S1
0), preprocesses the accumulated file (S12), and extracts a character string from the accumulated file by a process such as morphological analysis (S14). Statistical data such as the appearance frequency is calculated for each extracted character string (S16), and index data is generated based on this (S18). The storage file that has not been generated as index data is the file storage unit 42.
If the remaining files remain (S20Y), the remaining files are subjected to the processes of S10 to S18, and this is repeated until all the accumulated files are processed (S20).

【００３７】図３は、前提技術における検索過程を示す
フローチャートである。まず、検索条件となる文章をユ
ーザが自然文の形で指定すると（Ｓ３０）、処理ユニッ
ト２０がその検索条件から文字列を抽出して索引データ
を生成する（Ｓ３２）。その索引データと索引格納部４
４が保持する複数の索引データを照合してそれぞれの類
似度を判断し（Ｓ３４）、その類似度の順に蓄積ファイ
ルの一覧を生成し（Ｓ３６）、これを検索結果として画
面に表示させる（Ｓ３８）。FIG. 3 is a flowchart showing the search process in the base technology. First, when the user specifies a sentence as a search condition in the form of a natural sentence (S30), the processing unit 20 extracts a character string from the search condition and generates index data (S32). The index data and the index storage unit 4
4 compares a plurality of index data held therein to determine the degree of similarity (S34), generates a list of accumulated files in the order of the degree of similarity (S36), and displays this as a search result on the screen (S38). ).

【００３８】以上の前提技術との対比において、以下、
実施の形態を説明する。なお、前提技術に含まれる機能
ブロックと同じ働きをなす機能ブロックに対しては同じ
名称と符号を付すとともに、その説明を適宜省略する。In comparison with the above base technology,
An embodiment will be described. In addition, the same names and reference numerals are given to the functional blocks having the same functions as the functional blocks included in the base technology, and the description thereof will be appropriately omitted.

【００３９】以下の実施の形態においては、上述した前
提技術において説明した文字列と同様に、検索の対象と
なる蓄積ファイルや検索条件を示すファイルから検索語
句を抽出して索引データを生成する。その際に、抽出し
た検索語句をその用途を表現した検索項目に対応付け、
検索語句と検索項目との組み合わせを考慮して索引デー
タを生成する。検索語句を検索項目に対応付けて取り扱
うことにより、例えば同じ語句であっても用途が異なる
語句を区別したり、異なる語句であっても実質的に同義
である語句を同一視したりすることができるので、精度
よく目的のファイルを抽出することができる。In the following embodiments, similar to the character string described in the above-mentioned base technology, the search word is extracted from the accumulated file to be searched or the file indicating the search condition to generate the index data. At that time, the extracted search terms are associated with the search items expressing their usage,
Index data is generated in consideration of a combination of a search word and a search item. By handling search terms in correspondence with search items, for example, it is possible to distinguish words that have different uses even if they are the same, or to identify words that are substantially synonymous with each other even if they are different words. As a result, the target file can be extracted accurately.

【００４０】（実施の形態）本実施の形態においては、
予め設定された検索項目に対応付けられた所定の場所に
検索語句が入力された形式のファイルを対象として処理
を行う。(Embodiment) In the present embodiment,
Processing is performed for a file in a format in which a search phrase is input in a predetermined location associated with a preset search item.

【００４１】図４は、本実施の形態における検索システ
ムの全体構成を示す機能ブロック図である。検索システ
ム８０において、ファイル検索サーバ１００はネットワ
ーク９０を介して複数のユーザ端末９２と接続される。
ファイル検索サーバ１００は、検索対象となる複数の蓄
積ファイルを保持する。ユーザ端末９２は、ＰＣなどの
情報処理装置である。ネットワーク９０は、例えばイン
ターネットである。FIG. 4 is a functional block diagram showing the overall configuration of the search system in this embodiment. In the search system 80, the file search server 100 is connected to a plurality of user terminals 92 via the network 90.
The file search server 100 holds a plurality of accumulated files to be searched. The user terminal 92 is an information processing device such as a PC. The network 90 is, for example, the Internet.

【００４２】図５は、ファイル検索サーバの構成を示す
機能ブロック図である。ファイル検索サーバ１００は、
前提技術において説明した検索装置１０と同様に処理ユ
ニット２０、検索ユニット３０、保持ユニット４０、お
よび入出力処理部５０を有する。入出力処理部５０は、
ネットワーク９０を介してユーザ端末９２との間でデー
タを送受信する。FIG. 5 is a functional block diagram showing the structure of the file search server. The file search server 100 is
Similar to the search device 10 described in the base technology, it has a processing unit 20, a search unit 30, a holding unit 40, and an input / output processing unit 50. The input / output processing unit 50 is
Data is transmitted and received to and from the user terminal 92 via the network 90.

【００４３】処理ユニット２０は、解析処理部２２およ
び生成処理部２４を含む。本実施の形態において、解析
処理部２２は前提技術の文字列抽出部２７に代えて、語
句抽出部１０２、項目特定部１０４および対応リスト生
成部１０６を含む。The processing unit 20 includes an analysis processing section 22 and a generation processing section 24. In the present embodiment, the analysis processing unit 22 includes a phrase extraction unit 102, an item identification unit 104, and a correspondence list generation unit 106 instead of the character string extraction unit 27 of the base technology.

【００４４】語句抽出部１０２は、各ファイルに含まれ
る検索語句を抽出する。項目特定部１０４は、抽出され
た各検索語句の用途を表現した検索項目を特定する。ま
た、項目特定部１０４は、例えば異なる語句であっても
実質的に同義である検索項目の用語を統一するなど、検
索項目の表現の調整を行う。項目特定部１０４は、処理
の対象となるファイルの主題を考慮して検索項目の表現
を調整してよい。対応リスト生成部１０６は、検索語句
をその検索項目に対応付けたかたちの対応リストを生成
する。The word / phrase extracting unit 102 extracts a search word / phrase included in each file. The item specifying unit 104 specifies a search item expressing the usage of each extracted search term. Further, the item specifying unit 104 adjusts the expression of the search items, for example, unifying the terms of the search items that are substantially synonymous with each other even if the phrases are different. The item specifying unit 104 may adjust the expression of the search item in consideration of the subject of the file to be processed. The correspondence list generation unit 106 generates a correspondence list in which search terms are associated with the search item.

【００４５】生成処理部２４は、前提技術の統計処理部
２８および索引生成部２９に加えて、類似語句調整部１
０８を含む。類似語句調整部１０８は、表現が異なる検
索語句であっても、同一の検索項目に対応付けられ、そ
の検索項目を考慮すると実質的に同義となる複数の検索
語句を同一視する処理を行う。類似語句調整部１０８
は、例えば同一の検索項目に対応付けられた検索語句の
一方が抽象的な用語である場合に、検索項目に対して入
力されるべき検索語句を具体化した用語全体を考慮し
て、それらの検索語句の対応付けを行う。The generation processing unit 24 includes the similar word adjustment unit 1 in addition to the statistical processing unit 28 and the index generation unit 29 of the base technology.
Including 08. The similar word adjusting unit 108 performs a process of identifying a plurality of search terms that are associated with the same search item even if the search terms have different expressions, and that are substantially synonymous in consideration of the search item. Similar word adjusting unit 108
For example, if one of the search terms associated with the same search item is an abstract term, consider all the terms that embody the search term that should be input for the search item, and Match search terms.

【００４６】本実施の形態において、統計処理部２８
は、抽出された検索語句とそれに対応付けられた検索項
目との組み合わせごとの出現頻度を計数する。索引生成
部２９は、検索語句をその検索項目に対応付けたかたち
で索引データを生成する。索引データは、抽出された複
数の検索語句とその検索項目との組み合わせに、それぞ
れの出現頻度に応じた重み付けが付加された一覧として
構成される。また、重み付けは、そのファイルの主題お
よび検索項目を考慮して行われてよい。例えば、そのフ
ァイルの主題に関連する検索項目に対応付けられた検索
語句への重み付けは高くされてよい。このように重み付
けを行うことにより、同じ語句であっても、重要な語句
への重み付けを高くして重要でない語句への重み付けを
低くすることができ、そのファイルの概念を精度よく示
す索引データを生成することができる。In the present embodiment, the statistical processing unit 28
Counts the appearance frequency for each combination of the extracted search term and the search item associated therewith. The index generating unit 29 generates index data by associating a search term with the search item. The index data is configured as a list in which a plurality of combinations of the extracted search terms and their search items are weighted according to their respective appearance frequencies. Weighting may also be done considering the subject matter of the file and the search terms. For example, search terms associated with search terms related to the subject of the file may be weighted higher. By weighting in this way, it is possible to increase the weight of important words and lower the weight of unimportant words, even for the same words, and to obtain index data that accurately indicates the concept of the file. Can be generated.

【００４７】保持ユニット４０は、前提技術のファイル
格納部４２、索引格納部４４、辞書格納部４６、および
関連データ格納部４８に加えて、対応リスト格納部１１
０、項目候補格納部１１２および類似語句格納部１１４
を含む。The holding unit 40 includes the correspondence list storage unit 11 in addition to the file storage unit 42, the index storage unit 44, the dictionary storage unit 46, and the related data storage unit 48 of the base technology.
0, item candidate storage unit 112 and similar phrase storage unit 114
including.

【００４８】対応リスト格納部１１０は、対応リスト生
成部１０６によって生成された対応リストを蓄積ファイ
ルと対応付けたかたちで保持する。項目候補格納部１１
２は、蓄積ファイルの主題となる実体を考慮して検索項
目となり得る候補をその実体に対応づけて保持する。検
索項目は、蓄積ファイルの実体の属性であってよい。ま
た、項目候補格納部１１２は、表現が異なっていても実
質的に同義である複数の検索項目を互いに対応づけて保
持する。The correspondence list storage unit 110 holds the correspondence list generated by the correspondence list generation unit 106 in association with the accumulated file. Item candidate storage unit 11
The item 2 holds candidates that can be search items in association with the substance in consideration of the substance that is the subject of the accumulated file. The search item may be an attribute of the entity of the accumulated file. Further, the item candidate storage unit 112 holds a plurality of search items that are substantially synonymous with each other even if the expressions are different, in association with each other.

【００４９】項目特定部１０４は、項目候補格納部１１
２を参照して検索項目を特定してもよい。また、項目特
定部１０４は、項目候補格納部１１２に保持されていな
い検索項目を特定した場合には、その検索項目を実体に
対応づけて保持させてよい。このように新たな検索項目
を順次実体に対応付けて保持させることにより、その後
の検索項目の特定を容易に行うことができる。The item specifying unit 104 includes an item candidate storage unit 11
You may specify a search item with reference to 2. Further, when the item specifying unit 104 specifies a search item that is not stored in the item candidate storage unit 112, the item specifying unit 104 may store the search item in association with the substance. In this manner, by sequentially storing new search items in association with the entity, subsequent search items can be easily specified.

【００５０】類似語句格納部１１４は、表現が異なる検
索語句であっても、検索項目との組み合わせにおいて、
その検索項目を考慮すると実質的に同義となる複数の検
索語句を検索項目に対応づけて保持する。例えば検索項
目が「年齢」である場合、検索語句としては具体的な数
値が入力される場合と、「若い」、「中年」、「老人」
などの抽象的な語句が入力される場合とがある。類似語
句格納部１１４は、「若い」などの抽象的な用語と、
「１５歳」などの数値を具体化した用語を互いに対応づ
けて保持する。この例のように具体化した用語が数値で
ある場合は、例えば類似語句格納部１１４に中間となる
数値を基準として保持させ、類似語句調整部１０８は、
基準となる数値との比較で検索語句の類否判断を行って
よい。The similar phrase storage unit 114, even if the search phrase has a different expression, in combination with the search item,
Considering the search item, a plurality of search terms that are substantially synonymous with each other are held in association with the search item. For example, when the search item is "age", when a specific numerical value is input as the search term, "young", "middle-aged", and "old"
There is a case where an abstract word such as is input. The similar word storage unit 114 stores abstract terms such as “young”,
A term that embodies a numerical value such as “15 years old” is held in association with each other. When the embodied term is a numerical value as in this example, for example, the similar word storage unit 114 holds the intermediate numerical value as a reference, and the similar word adjustment unit 108
The similarity of the search term may be determined by comparison with a reference numerical value.

【００５１】検索ユニット３０は、前提技術の比較処理
部３２および結果処理部３４に加えて、ファイル抽出部
１１６および結果提示部１１８を含む。本実施の形態に
おいて、比較処理部３２の条件設定部３６は、アンケー
トなどのように、予め設定された検索項目に対応付けら
れた所定の場所に検索語句が入力された形式で検索条件
を取得する。検索条件を示すファイルは処理ユニット２
０に送られ、前述した蓄積ファイルと同様の処理によ
り、検索条件の索引データが生成される。The search unit 30 includes a file extraction unit 116 and a result presentation unit 118 in addition to the comparison processing unit 32 and the result processing unit 34 of the base technology. In the present embodiment, the condition setting unit 36 of the comparison processing unit 32 acquires the search condition in a format in which the search phrase is input in a predetermined location associated with a preset search item, such as a questionnaire. To do. The file indicating the search condition is the processing unit 2
0, and the index data of the search condition is generated by the same processing as the above-mentioned storage file.

【００５２】ファイル抽出部１１６は、検索条件を示す
ファイルの索引データと、蓄積ファイルの索引データと
を比較して、索引データ間の類似度をもとに、複数のフ
ァイルの中から検索条件と概念的に内容が類似するファ
イルを抽出する。結果提示部１１８は、抽出されたファ
イルを提示する。The file extraction unit 116 compares the index data of the file indicating the search condition with the index data of the accumulated file, and selects the search condition from the plurality of files based on the similarity between the index data. Extract files that are conceptually similar in content. The result presentation unit 118 presents the extracted file.

【００５３】なお、蓄積ファイルおよび検索条件を示す
ファイルには、備考欄やその他欄など、所定の検索項目
に対応付けられていない自然文を入力する欄が設けられ
てもよい。この場合、語句抽出部１０２は自然文から検
索語句を抽出する。検索語句は、前提技術で説明した文
字列と同様にして抽出されてよい。項目特定部１０４
は、抽出された検索語句の用途を示す検索項目を特定す
る。項目特定部１０４は、項目候補格納部１１２を参照
して検索項目を特定してもよく、形態素解析、構文解
析、意味解析などの技術を利用して検索項目を特定して
もよい。The accumulated file and the file indicating the search condition may be provided with a field for inputting a natural sentence that is not associated with a predetermined search item, such as a remarks field and other fields. In this case, the phrase extraction unit 102 extracts the search phrase from the natural sentence. The search term may be extracted in the same manner as the character string described in the base technology. Item identification unit 104
Specifies a search item indicating the use of the extracted search term. The item specifying unit 104 may specify the search item by referring to the item candidate storage unit 112, or may specify the search item by using a technique such as morphological analysis, syntactic analysis, and semantic analysis.

【００５４】図６は、項目候補格納部１１２の内部構成
の一例を示す図である。このファイルは車の商品案内が
主題であり、実体は「車」である。項目候補格納部１１
２は、「商品名」、「メーカー」、「生産国」、「車の
色」などを検索項目の候補として実体に対応付けて保持
する。また、例えば「商品名」と同義の検索項目として
「車種」が、「車の色」と同義の検索項目として「車体
カラー」などが保持される。このように、表現が異なっ
ていても実質的に同義である複数の検索項目を互いに対
応付けて保持しておくことにより、非定型のファイル間
の比較を容易に行うことができる。FIG. 6 is a diagram showing an example of the internal configuration of the item candidate storage unit 112. The subject of this file is vehicle product information, and the substance is "car". Item candidate storage unit 11
The item 2 holds “product name”, “maker”, “country of production”, “color of car” and the like as candidates for search items in association with the substance. Further, for example, "vehicle type" is held as a search item synonymous with "product name", and "vehicle body color" is held as a search item synonymous with "car color". In this way, by storing a plurality of search items that have substantially the same meaning even if the expressions are different in association with each other, comparison between atypical files can be easily performed.

【００５５】図７は、検索対象となる蓄積ファイルの一
例を示す図である。検索対象となる蓄積ファイルは、例
えば製品カタログやパンフレットである。ここでは、こ
の蓄積ファイルは、車の商品案内画面１３０としてユー
ザ端末９２に表示される。画面１３０は、車種欄１３
２、生産国欄１３４、メーカー欄１３６、ボディタイプ
欄１３８、定員欄１４０、車体カラー欄１４２、価格欄
１４４および備考欄１４６で構成される。語句抽出部１
０２は、検索語句として「Ｂ２３４」、「ドイツ」、
「Ｂ社」、「セダン」、「５人」、「赤」、「３００万
円」などを抽出する。FIG. 7 is a diagram showing an example of a storage file to be searched. The stored files to be searched are, for example, product catalogs and pamphlets. Here, this accumulated file is displayed on the user terminal 92 as a car product guide screen 130. The screen 130 shows the vehicle type column 13
2, a country of origin column 134, a manufacturer column 136, a body type column 138, a capacity column 140, a vehicle body color column 142, a price column 144 and a remarks column 146. Phrase extractor 1
02 is a search term "B234", "Germany",
"B company", "sedan", "5 people", "red", "3 million yen", etc. are extracted.

【００５６】また、語句抽出部１０２は、備考欄１４６
に記入された自然文から、「スポーティー」、「赤
い」、「ロゴマーク」などを検索語句として抽出する。
項目特定部１０４は、「Ｂ２３４」の検索項目として
「車種」、「ドイツ」の検索項目として「生産国」、
「Ｂ社」の検索項目として「メーカー」、「セダン」検
索項目として「ボディタイプ」、「定員」の検索項目と
して「５人」、「赤」の検索項目として「車体カラ
ー」、「３００万円」の検索項目として「価格」を特定
する。ここで、備考欄１４６は検索項目が設定されてい
ないので、項目特定部１０４は、項目候補格納部１１２
を参照するなどして、例えば「スポーティー」の検索項
目として「車の形状」、「赤」の検索項目として「ロゴ
マークの色」、「ロゴマーク」の検索項目として「車の
模様」を特定する。Further, the word / phrase extracting unit 102 uses the remarks column 146.
"Sporty", "red", "logo mark", etc. are extracted as search terms from the natural sentence written in.
The item specifying unit 104 uses “vehicle type” as a search item for “B234”, “country of origin” as a search item for “Germany”,
"Company" as the search item for "Company B", "Body type" as the "Sedan" search item, "5 people" as the search item for "Capacity", "Body color" as the search item for "Red", "3 million""Price" is specified as a search item for "yen". Here, since no search item is set in the remarks column 146, the item specifying unit 104 sets the item candidate storage unit 112.
For example, specify "car shape" as the "sporty" search item, "logo mark color" as the "red" search item, and "car pattern" as the "logo mark" search item. To do.

【００５７】図８は、図７に示した対象ファイルから生
成された対応リストを示す図である。対応リスト１５０
には、各検索語句にその検索項目が対応付けられる。こ
こで、例えば検索語句「赤」には、「車体カラー」と
「ロゴマーク」が検索項目としてそれぞれ対応付けられ
る。このとき、これらの「赤」という検索語句は、異な
る検索項目に対応付けられているので、同じ用語であっ
ても区別して扱われる。FIG. 8 is a diagram showing a correspondence list generated from the target file shown in FIG. Correspondence list 150
Is associated with each search term. Here, for example, the search term "red" is associated with "body color" and "logo mark" as search items. At this time, since these search terms “red” are associated with different search items, even the same term is treated differently.

【００５８】図９は、蓄積ファイルを検索するときに検
索条件を入力する画面を示す。この検索画面１６０は、
ユーザに欲しい車の条件を入力させるものである。検索
画面１６０は、主題欄１６２、生産国欄１６４、メーカ
ー欄１６６、ボディタイプ欄１６８、車体カラー欄１７
０、定員欄１７２およびその他欄１７４で構成される。FIG. 9 shows a screen for inputting search conditions when searching a stored file. This search screen 160 is
It allows the user to input the conditions of the desired car. The search screen 160 includes a subject column 162, a country of origin column 164, a manufacturer column 166, a body type column 168, and a vehicle body color column 17.
0, the capacity column 172, and the other column 174.

【００５９】ここで、ユーザは、生産国欄１６４および
メーカー欄１６６を無記入にし、ボディタイプ欄１６８
に「セダン又は軽」、車体カラー欄１７０に「赤」、定
員欄１７２に「普通」と入力している。語句抽出部１０
２は、例えば辞書格納部４６の単語辞書を参照して検索
語句として「セダン」、「軽」、「赤」、「普通」、
「スポーツタイプ」などを抽出する。項目特定部１０４
は、「セダン」および「軽」の検索項目としてそれぞれ
「ボディタイプ」、「赤」の検索項目として「車体カラ
ー」、「普通」の検索項目として「定員」を特定する。
このとき、検索語句「普通」は検索項目「定員」に対応
付けられている。例えば、自動車の定員として「５人」
や「６人」が普通である場合、類似語句調整部１０８
は、検索語句「普通」を同じ検索項目「定員」に対応付
けられた検索項目「５人」や「６人」と同視する処理を
行ってよい。Here, the user does not enter the production country column 164 and the manufacturer column 166, and the body type column 168.
"Sedan or Light", "Red" in the vehicle body color column 170, and "Normal" in the capacity column 172. Word extraction unit 10
2 refers to the word dictionary of the dictionary storage unit 46, for example, as search terms "sedan", "light", "red", "normal",
"Sport type" etc. are extracted. Item identification unit 104
Specifies "body type" as the search items for "sedan" and "light", "body color" as the search item for "red", and "capacity" as the search item for "normal".
At this time, the search term “normal” is associated with the search item “capacity”. For example, "5 people" as a car capacity
Or “6 people” is normal, the similar word adjusting unit 108
May perform a process of equating the search term “normal” with the search items “5 people” and “6 people” associated with the same search item “capacity”.

【００６０】また、その他欄１７４は検索項目が設定さ
れていないので、項目特定部１０４は、項目候補格納部
１１２を参照するなどして、「スポーツタイプ」の検索
項目として「車の形状」を特定する。Further, since no search item is set in the other field 174, the item specifying unit 104 refers to the item candidate storage unit 112, etc., and selects "car shape" as a search item for "sport type". Identify.

【００６１】図１０は、図９に示した検索条件ファイル
から生成された対応リストを示す図である。対応リスト
１８０には、各検索語句にその検索項目が対応づけられ
る。ここで、検索項目「生産国」および「メーカー」に
は、検索語句として「ａｌｌ」が対応付けられる。比較
処理部３２の類似度判断部３７は、蓄積ファイルの索引
データにおいて、検索項目「生産国」および「メーカ
ー」に対応付けられた検索語句がどのようなものであっ
ても、検索条件に合致すると判断してよい。FIG. 10 is a diagram showing a correspondence list generated from the search condition file shown in FIG. In the correspondence list 180, each search term is associated with that search item. Here, the search items "country of production" and "maker" are associated with "all" as a search term. The similarity determination unit 37 of the comparison processing unit 32 matches the search condition regardless of the search terms associated with the search items “country of origin” and “maker” in the index data of the accumulated file. Then you can judge.

【００６２】図１１は、処理の対象となるファイルから
索引データを生成する過程を示すフローチャートであ
る。複数のファイルから処理対象となるファイルが設定
されると（Ｓ１１０）、語句抽出部１０２はそのファイ
ルから検索語句を抽出する（Ｓ１１２）。項目特定部１
０４は、抽出された検索語句ごとに検索項目を特定し
（Ｓ１１４）、項目候補格納部１１２を参照して検索項
目の表現の調整を行う（Ｓ１１６）。対応リスト生成部
１０６は対応リストを生成する（Ｓ１１８）。FIG. 11 is a flow chart showing a process of generating index data from a file to be processed. When a file to be processed is set from a plurality of files (S110), the phrase extraction unit 102 extracts a search phrase from the files (S112). Item identification part 1
04 identifies a search item for each of the extracted search terms (S114), and refers to the item candidate storage unit 112 to adjust the expression of the search item (S116). The correspondence list generation unit 106 generates a correspondence list (S118).

【００６３】類似語句調整部１０８は、実質的に同義と
なる検索語句間の調整処理を行い（Ｓ１２０）、統計処
理部２８は各検索語句に対応付けられた検索項目をも考
慮しつつ、各検索語句の出現頻度を計数する（Ｓ１２
２）。索引生成部２９は、検索語句の出現頻度に基づい
て索引データを生成する（Ｓ１２４）。The similar word adjustment unit 108 performs an adjustment process between the search words that are substantially synonymous (S120), and the statistical processing unit 28 considers the search items associated with each search word while The frequency of appearance of the search term is counted (S12
2). The index generator 29 generates index data based on the frequency of appearance of the search term (S124).

【００６４】以上、本発明を実施の形態をもとに説明し
た。この実施の形態は例示であり、それらの各構成要素
や各処理プロセスの組み合わせにいろいろな変形が可能
なこと、またそうした変形例も本発明の範囲にあること
は当業者に理解されるところである。以下、変形例を挙
げる。The present invention has been described above based on the embodiments. It should be understood by those skilled in the art that this embodiment is merely an example, and that various modifications can be made to the combinations of the respective constituent elements and the respective processing processes, and such modifications are also within the scope of the present invention. . Hereinafter, modified examples will be described.

【００６５】実施の形態では、予め設定された検索項目
に対応付けられた所定の場所に検索語句が入力された形
式のファイルを対象とする処理を説明したが、自然文に
よって記述された文章からなるファイルを対象として索
引データの生成および検索を行ってよい。この場合、フ
ァイル検索サーバ１００は、処理の対象となるファイル
の主題となる実体を設定する設定部を有してよい。この
設定部は、例えば実体の設定をユーザに行わせてもよ
い。また、設定部は、自然文から文字列を抽出する前提
技術を用いてこの自然文の概念を示す索引データを生成
し、この索引データに基づいて、そのファイルの実体を
特定してもよい。このとき、項目特定部１０４は、項目
候補格納部１１２を参照して検索項目を特定してよい。
また、項目特定部１０４は、また、検索項目は、同一フ
ァイル中の予め設定された他の検索項目や、同一の実体
が設定された他のファイルに含まれる検索項目を参照し
て検索項目を特定してもよい。In the embodiment, the processing for the file in the format in which the search word is input in the predetermined location associated with the preset search item has been described. The index data may be generated and searched for as a target file. In this case, the file search server 100 may include a setting unit that sets the subject entity of the file to be processed. The setting unit may allow the user to set the substance, for example. In addition, the setting unit may generate index data indicating the concept of the natural sentence by using a prerequisite technique for extracting a character string from the natural sentence, and specify the substance of the file based on the index data. At this time, the item identification unit 104 may refer to the item candidate storage unit 112 to identify the search item.
Further, the item specifying unit 104 also refers to the search item included in another file in which the same entity is set or the search item included in another file in which the same entity is set. May be specified.

【００６６】他の例として、ファイル抽出部１１６は、
実施の形態で説明した検索項目との組み合わせごとの出
現頻度に基づく統計処理により生成された索引データ
と、前提技術で説明したように、検索項目を考慮せず文
字列単位での出現頻度に基づく統計処理により生成され
た索引データの両方に基づいて検索条件に類似するファ
イルを抽出してもよい。As another example, the file extraction unit 116
Based on the index data generated by the statistical processing based on the appearance frequency for each combination with the search item described in the embodiment, and based on the appearance frequency in character string units without considering the search item, as described in the base technology. A file similar to the search condition may be extracted based on both the index data generated by the statistical processing.

【００６７】例えば、実施の形態では自然文から抽出し
た検索語句であっても、検索項目を特定して検索項目に
対応付ける処理を行ったが、自然文から抽出された検索
語句については、検索項目を特定することなく、検索語
句のみを考慮した処理を行ってもよい。ひとつのファイ
ルに予め設定された検索項目に対応付けた所定の位置に
検索語句が入力された部分と自然文により文章が入力さ
れた部分が含まれる場合、前者の部分については各検索
語句をその検索項目に対応づけて処理し、後者の部分に
ついては前提技術で説明したのと同様に検索語句のみを
考慮した処理を行ってよい。その場合、各検索語句を検
索項目に対応づけた処理により生成された索引データは
他のファイルにおいて同様に生成された索引データと比
較し、検索語句のみを考慮した処理により生成された索
引データは他のファイルにおいて同様に生成された索引
データと比較し、ファイル抽出部１１６は、両方の比較
結果を考慮して検索条件に類似するファイルを抽出して
よい。For example, in the embodiment, even for a search term extracted from a natural sentence, a process of specifying a search item and associating the search item with the search item was performed. You may perform the process which considered only a search term, without specifying. In the case where one file contains a part in which a search phrase is input at a predetermined position associated with a preset search item and a part in which a sentence is input by a natural sentence, each search phrase is included in the former part. Processing may be performed in association with the search item, and the latter part may be processed in consideration of only the search term as described in the base technology. In that case, the index data generated by the process of associating each search term with the search item is compared with the index data similarly generated in other files, and the index data generated by the process considering only the search term is The file extraction unit 116 may extract the file similar to the search condition in consideration of both comparison results by comparing with the index data similarly generated in other files.

【００６８】[0068]

【発明の効果】本発明によれば、利便性の高い方法で多
数のファイルから検索条件に合致するファイルを抽出す
ることができる。According to the present invention, it is possible to extract a file matching a search condition from a large number of files by a highly convenient method.

[Brief description of drawings]

【図１】前提技術におけるファイル検索装置の構成を
示す機能ブロック図である。FIG. 1 is a functional block diagram showing a configuration of a file search device in the base technology.

【図２】前提技術における索引データの生成過程を示
すフローチャートである。FIG. 2 is a flowchart showing a process of generating index data in the base technology.

【図３】前提技術における検索過程を示すフローチャ
ートである。FIG. 3 is a flowchart showing a search process in the base technology.

【図４】本実施形態における検索システムの全体構成
を示す機能ブロック図である。FIG. 4 is a functional block diagram showing an overall configuration of a search system according to this embodiment.

【図５】本実施形態におけるファイル検索サーバの構
成を示す機能ブロック図である。FIG. 5 is a functional block diagram showing a configuration of a file search server according to the present embodiment.

【図６】項目候補格納部の内部構成を示す図である。FIG. 6 is a diagram showing an internal configuration of an item candidate storage unit.

【図７】検索対象となる蓄積ファイルの一例を示す図
である。FIG. 7 is a diagram showing an example of a storage file to be searched.

【図８】図７に示した対象ファイルから生成された対
応リストを示す図である。8 is a diagram showing a correspondence list generated from the target file shown in FIG.

【図９】蓄積ファイルを検索するときに検索条件を入
力する画面を示す図である。FIG. 9 is a diagram showing a screen for inputting search conditions when searching a stored file.

【図１０】図９に示した検索条件ファイルから生成さ
れた対応リストを示す図である。10 is a diagram showing a correspondence list generated from the search condition file shown in FIG.

【図１１】処理の対象となるファイルから索引データ
を生成する過程を示すフローチャートである。FIG. 11 is a flowchart showing a process of generating index data from a file to be processed.

【符号の説明】２０・・処理ユニット、２２・・解析処理部、２４・・
生成処理部、２６・・前処理部、２８・・統計処理部、
２９・・索引生成部、３０・・検索ユニット、３２・・
比較処理部、３４・・結果処理部、４０・・保持ユニッ
ト、４２・・ファイル格納部、４４・・索引格納部、４
６・・辞書格納部、４８・・関連データ格納部、５０・
・入出力処理部、８０・・検索システム、９０・・ネッ
トワーク、９２・・ユーザ端末、１００・・ファイル検
索サーバ、１０２・・語句抽出部、１０４・・項目特定
部、１０６・・対応リスト生成部、１０８・・類似語句
調整部、１１０・・対応リスト格納部、１１２・・項目
候補格納部、１１４・・類似語句格納部、１１６・・フ
ァイル抽出部、１１８・・結果提示部。[Explanation of symbols] 20 ... Processing unit, 22 ... Analysis processing unit, 24 ...
Generation processing unit, 26 ... Preprocessing unit, 28 ... Statistics processing unit,
29 ... Index generation unit, 30 ... Search unit, 32 ...
Comparison processing unit, 34 ... Result processing unit, 40 ... Holding unit, 42 ... File storage unit, 44 ... Index storage unit, 4
6 ... Dictionary storage unit, 48 ... Related data storage unit, 50 ...
Input / output processing unit, 80 .. Search system, 90 .. Network, 92 .. User terminal, 100 .. File search server, 102 .. Word extraction unit, 104 .. Item specifying unit, 106 .. Correspondence list generation Part 108, similar phrase adjusting unit, 110 ... Corresponding list storage unit, 112 ... Item candidate storage unit, 114 ... Similar phrase storage unit, 116 ... File extraction unit, 118 ... Result presentation unit.

Claims

[Claims]

1. When referring to a word or phrase contained in a file to be searched as a search word or phrase, in order to determine the similarity of the word or phrase according to the purpose, prior to comparison with the search condition,
A method for generating search index data, characterized in that a search item expressing the use of the search phrase is specified and the search phrase is held in association with the search item.

2. A method of generating index data of a file to be searched by statistical processing based on the appearance frequency of the character string included in the file, the character string representing the same concept. Even if the usage is different, the method for generating search index data is characterized in that the statistical processing is performed by using those character strings as different character strings.

3. A phrase extraction unit for extracting a phrase included in a file to be processed, and a method for referring to the extracted phrase as a search phrase,
An item specifying unit that specifies a search item that expresses the use of the search phrase, and an index data holding unit that holds the search phrase in association with the search item, in order to determine the similarity of the phrase according to the use. An apparatus for generating search index data, comprising:

4. A search term to be referred to at the time of search is input in a predetermined place associated with a search item expressing the use of the search term in order to determine whether the term is similar to the purpose, and input. Of the searched index data, wherein the acquired phrase is acquired by associating the searched item with the searched item, and an index data holding unit that holds the searched item by associating the searched item with the searched item. Generator.

5. The statistical data processing unit further comprises a statistical processing unit that performs statistical processing based on the frequency of occurrence of each combination of the search term and the search item associated therewith, and the index data holding unit includes the statistical processing. 5. The device according to claim 3 or 4, characterized in that it holds the result of (1).

6. The statistical processing unit associates search terms that are associated with the same search item even if the search terms are different from each other and that are substantially synonymous in consideration of the search item, with each other. The apparatus according to claim 5, wherein the processing is performed by performing the processing.

7. A setting unit configured to set an entity which is a subject of a search condition, wherein the statistical processing unit, when the search item associated with the search term is related to the entity, the search term. 6. The apparatus according to claim 5, wherein the statistical processing is performed by increasing the weighting for.

8. A setting unit configured to set a subject entity of a file to be processed, wherein the item identification unit identifies the search item in consideration of an attribute of the entity. The device according to claim 3.

9. In a plurality of files to be searched, when referring to a word / phrase contained in the file as a search word / word for each file, the search is performed in order to determine the similarity of the word / phrase according to the use. A target acquisition unit that acquires, as index data, a search item that expresses the use of a word in association with the search word, a condition acquisition unit that acquires target index data indicating a search condition, the target index data, and the plurality of Comparing the index data of the file, based on the similarity between the index data,
A file search device comprising: a file extraction unit that extracts files conceptually similar in content to the search condition from the plurality of files; and a result presentation unit that presents the extracted files. .

10. A process of extracting a phrase included in a file to be processed, and a process of referring to the extracted phrase as a search phrase,
In order to determine the similarity of terms according to the use, a process of identifying a search item expressing the use of the search phrase, a process of holding the search phrase in association with the search item, and executing the computer A computer program characterized by being vulnerable.