JPH08287081A

JPH08287081A - Retrieving device for data with similarity

Info

Publication number: JPH08287081A
Application number: JP7116599A
Authority: JP
Inventors: Hiroshi Masuichi; 博増市
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-04-19
Filing date: 1995-04-19
Publication date: 1996-11-01

Abstract

PURPOSE: To easily and exactly determine data high in similarity to all the plural data in a device for retrieving data with similarity. CONSTITUTION: The device is provided with a storing means (a) for data with a similarity which holds the set of data with data on a similarity between the respective two kinds of data, a display and input means (b) for a similarity retrieving expression which displays and inputs the similarity retrieving expression including a condition data set to be the condition of similarity retrieval, a similarity retrieving means (c) determining a retrieving data set by similarity retrieval by means of the condition data set, and a retrieving result holding means (d) which holds the retrieving data set obtained by the similarity retrieving means (c). Then, the retrieving result holding means (d) determines the retrieving data set, where a value obtained based on the similarity between two kinds of data which is held in the storing means (a) for data with a similarity, the condition data set inputted by the display and input means (b) for a similarity retrieving expression, the retrieving data set held by the retrieving result holding means (d) and a prescribed calculating expression for the similarity between the condition data set and data with a similarity, is over a prescribed threshold value regulating the limit of the similarity.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、類似度の付けられたデ
ータの検索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a device for retrieving data with similarities.

【０００２】（従来技術１）多量のデータから利用者の
望むデータを検索する際、検索式を作成しデータの絞り
込みを行う。各データは、以下の例のような検索キーを
持つ。・作成日時、作成者等の属性・キーワード・データ中のテキスト部分に含まれる語（全文検索）・データを含むカテゴリ名検索式は、検索キーの指定を、ａｎｄ，ｏｒ，ｎｏｔと
いった論理演算子で結合することにより作成する。例え
ば、キーワードに「光」と「繊維」を指定し、作成日時
の範囲を１９９３年１月から１９９４年８月に指定して
検索を行う場合の検索式は、次のようになる。ＦＫ＝（光ＯＲ繊維））ＡＮＤ（ＰＤ＝１９９
３０１０１：１９９４０８３１）検索キーの指定によって得られるデータの集合に対し
て、論理演算子に従う論理演算を行い、データの絞り込
みを実現する。(Prior Art 1) When searching data desired by a user from a large amount of data, a search formula is created to narrow down the data. Each data has a search key like the following example.・ Attributes such as creation date and time, creator, etc. ・ Keywords ・ Words included in the text part of data (full-text search) ・ Category name including data The search expression specifies the search key as a logical operator such as and, or, not Created by combining with. For example, the search formula in the case where “light” and “fiber” are specified as keywords and the range of creation date and time is specified from January 1993 to August 1994 is as follows. FK = (optical OR fiber) AND (PD = 199
30101: 19940831) A logical operation according to a logical operator is performed on a set of data obtained by designating a search key to realize data narrowing.

【０００３】（従来技術２）データの２項間に類似度が
付けられている場合、「あるデータと類似度の高い（あ
るいは低い）データの集合」を求めることができる。い
くつかの「あるデータと類似度の高い（低い）データの
集合」を、従来技術１と同様に、ａｎｄ，ｏｒ，ｎｏｔ
といった論理演算子で結合することにより、データの絞
り込みを行うことができる。例えば、（ｄａｔａＡと類
似度の高いデータの集合）ＡＮＤ（ｄａｔａＢと類似度
の高いデータの集合）によりデータの絞り込みを行うこ
とができる。(Prior Art 2) When a similarity is given between two terms of data, "a set of data having a high (or low) similarity to a certain data" can be obtained. Some “sets of data having high (low) similarity to certain data” are processed as “and, or, not” as in the case of the related art 1.
Data can be narrowed down by combining with logical operators such as. For example, the data can be narrowed down by (set of data having high similarity to dataA) AND (set of data having high similarity to dataB).

【０００４】[0004]

【発明が解決しようとする課題】５つのデータ（ｄａｔ
ａ１−ｄａｔａ５）の２項間の類似度が図３のように設
定されている場合を考える。類似度の値の単位は（％）
で、値が大きいほど類似度が高いことを示すとする。こ
こで、従来技術２に従って、「ｄａｔａ１とｄａｔａ２
の両者と類似度の高いデータの集合」を求めるとする。
閾値を７０（％）として、「ｄａｔａ１と類似度の高い
データの集合」を求めると｛ｄａｔａ３，ｄａｔａ４｝
となり、同様に「ｄａｔａ２と類似度の高いデータの集
合」は｛ｄａｔａ３，ｄａｔａ４｝である。したがっ
て、「ｄａｔａ１とｄａｔａ２の両者と類似度の高いデ
ータの集合」は、上記２つの集合の論理積をとって、や
はり｛ｄａｔａ３，ｄａｔａ４｝となる。しかしなが
ら、実際は、ｄａｔａ３よりもｄａｔａ４の方がはるか
に「ｄａｔａ１とｄａｔａ２の両者と類似度が高い」と
言える。また、ｄａｔａ４とｄａｔａ５の類似度は高
い。したがって、ｄａｔａ３よりはむしろｄａｔａ５の
方が「ｄａｔａ１とｄａｔａ２の両者と類似度の高いデ
ータ」であると言える。この例から、従来技術２におい
ては以下のような問題点が残ることが分かる。従来技術
２のように、予め「あるデータと類似するデータの集
合」を求めた上で論理演算を用いて絞り込みを行うと、
「複数のデータの全てと類似度の高いデータ」を正確に
求めることができない。[Problems to be Solved by the Invention] Five data (dat
Consider a case where the similarity between the two terms (a1-data5) is set as shown in FIG. The unit of similarity value is (%)
, The higher the value, the higher the similarity. Here, according to the prior art 2, “data1 and data2
“A set of data having a high degree of similarity with both” is obtained.
When a “set of data having a high degree of similarity with data1” is calculated with the threshold value set to 70 (%), {data3, data4}
Similarly, the “set of data having a high degree of similarity with data2” is {data3, data4}. Therefore, the “set of data having a high degree of similarity with both data1 and data2” is the logical product of the above two sets and is also {data3, data4}. However, in fact, it can be said that data4 is much higher in similarity to both data1 and data2 than data3. Also, the similarity between data4 and data5 is high. Therefore, it can be said that data5 is "data having a high degree of similarity to both data1 and data2" rather than data3. From this example, it is understood that the following problems remain in the related art 2. When the “set of data similar to a certain data” is obtained in advance as in the case of the conventional technique 2 and the narrowing is performed using the logical operation,
"Data having a high degree of similarity to all of the plurality of data" cannot be accurately obtained.

【０００５】この問題点は、予め「あるデータと類似す
るデータの集合」を決定しておくために生じるものであ
る。上記の例では、「ｄａｔａ１とｄａｔａ２の両者と
類似度の高いデータの集合」を求めることが決定して始
めて、「ｄａｔａ１とｄａｔａ２の両者と類似度の高い
データの集合」にｄａｔａ４が含まれ、かつ、ｄａｔａ
５がｄａｔａ４と強い類似度を持っているといった情報
を知ることができる。これらの情報を、「ｄａｔａ１と
ｄａｔａ２の両者と類似度の高いデータの集合」を求め
ることが決定する前に知ることは困難である。すなわ
ち、あらゆるデータの組み合わせに対して、それらのデ
ータと類似度の高いデータの集合を予め求めておくこと
は、計算量が大きくなり、現実的には困難である。This problem arises because a "set of data similar to certain data" is determined in advance. In the above example, it is decided that “a set of data having a high degree of similarity with both data1 and data2” is determined, and then “a set of data with a high degree of similarity with both data1 and data2” includes data4, And data
The information that 5 has a strong similarity with data 4 can be known. It is difficult to know these pieces of information before it is decided to obtain “a set of data having a high degree of similarity with both data1 and data2”. That is, for all combinations of data, it is practically difficult to obtain a set of data having a high degree of similarity with the data in advance because the amount of calculation becomes large.

【０００６】本発明は上記の問題点を解決することを目
的とするものである。即ち、本発明は、類似度の付けら
れたデータの検索装置において、複数のデータの全てと
類似度の高いデータを、簡単かつ正確に求めることを目
的とするものである。The present invention aims to solve the above problems. That is, an object of the present invention is to easily and accurately obtain data having a high degree of similarity with all of a plurality of data in a data retrieval apparatus having a degree of similarity.

【０００７】[0007]

【課題を解決するための手段】本発明の類似度付きデー
タ検索装置は、図１に示すように、２つのデータ間ごと
の類似度のデータを持つデータの集合を保持する類似度
付きデータ格納手段ａと、類似度検索の条件となる条件
データ集合を含む類似度検索式を表示入力する類似度検
索式表示入力手段ｂと、前記条件データ集合により類似
度検索を行い検索データ集合を求める類似度検索手段ｃ
と、その類似度検索手段ｃにより得られた検索データ集
合を保持する検索結果保持手段ｄとを有する。そして、
前記類似度検索手段ｄは、類似度付きデータ格納手段ａ
に保持されている二つのデータ間類似度と、類似度検索
式表示入力手段ｂにより入力された条件データ集合と、
検索結果保持手段ｄに保持された検索データ集合と、前
記条件データ集合と前記類似度付きデータとの類似度を
求める所定の計算式と、に基づいて得られる値が類似度
の制限を規定する所定の閾値を越える検索データ集合を
求めるものである。As shown in FIG. 1, a data retrieval apparatus with similarity according to the present invention stores data with similarity, which holds a set of data having data of similarity between two data. A means a, a similarity degree search expression display / input means b for displaying and inputting a similarity degree search expression including a condition data set which is a condition for the similarity degree search, and a similarity degree for performing a similarity degree search on the condition data set to obtain a search data set Degree search means c
And a search result holding unit d that holds the search data set obtained by the similarity search unit c. And
The similarity search means d is a data storage means a with similarity degree a.
The similarity between the two data held in, and the condition data set input by the similarity search expression display input means b,
A value obtained based on the search data set held in the search result holding means d and a predetermined calculation formula for obtaining the similarity between the condition data set and the data with similarity defines the similarity limit. A search data set that exceeds a predetermined threshold is obtained.

【０００８】また、本発明の一態様によれば、上記構成
において、前記類似度検索式表示入力手段の入力する条
件データ集合が、高い類似度のデータを検索するための
第１の条件データ集合と、低い類似度のデータを検索す
るための第２の条件データ集合とを含むことを特徴とす
る。Further, according to one aspect of the present invention, in the above configuration, the condition data set input by the similarity search expression display / input means is a first condition data set for searching for data having a high similarity. And a second condition data set for searching data having a low degree of similarity.

【０００９】[0009]

【作用】本発明は、類似度による絞り込みの際には論理
演算子を用いず、類似度を広い範囲で考慮した上で類似
度計算を行い、より正確に「指定された複数のデータの
全てと類似度の高い（低い）データの集合」を求めるこ
とができる。According to the present invention, a logical operator is not used when narrowing down based on the degree of similarity, and the degree of similarity is calculated after considering the degree of similarity in a wide range. And a set of data with a high degree of similarity (low) can be obtained.

【００１０】[0010]

【実施例】実施例の構成は、図２に示すように、検索式
表示入力手段１、類似度検索式表示入力手段２、類似度
付きデータ格納手段３、論理演算検索手段４、類似度検
索手段５、論理演算検索結果表示手段６、類似度検索結
果表示手段７、検索結果識別格納手段８を備えている。
以下、各構成要素について説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS As shown in FIG. 2, the structure of the embodiment is such that search expression display input means 1, similarity degree search expression display input means 2, data storage means 3 with similarity degree, logical operation search means 4, similarity degree search. A means 5, a logical operation search result display means 6, a similarity degree search result display means 7, and a search result identification storage means 8 are provided.
Hereinafter, each component will be described.

【００１１】検索式表示入力手段１は、キーボード、マ
ウス等の入力手段（図示せず）から論理演算子を用いた
検索式を受け取り、ディスプレイ等の出力手段へ受け取
った結果を表示するユーザーインタフェースを持つプロ
グラムモジュールである。The search expression display input means 1 has a user interface for receiving a search expression using a logical operator from an input means (not shown) such as a keyboard and a mouse and displaying the received result on an output means such as a display. It is a program module that has.

【００１２】類似度検索式表示入力手段２は、キーボー
ド、マウス等の入力手段から類似度検索を行うための検
索式を受け取り、ディスプレイ等の出力手段へ受け取っ
た結果を表示するユーザーインタフェースを持つプログ
ラムモジュールである。The similarity search formula display / input unit 2 receives a search formula for performing a similarity search from an input unit such as a keyboard and a mouse, and outputs the received result to an output unit such as a display program having a user interface. It is a module.

【００１３】類似度付きデータ格納手段３は、検索用の
キーを持ち、かつ、データの２項間に類似度付けの行わ
れているデータ群を保持するプログラムモジュールであ
る。The similarity-added data storage means 3 is a program module that has a key for retrieval and holds a data group in which similarity between two terms of data is held.

【００１４】論理演算検索手段４は、検索式表示入力手
段１から検索式を受け取り、必要があれば検索結果識別
格納手段８から既に検索済みのデータ集合を受け取り、
検索式に従って類似度付きデータ格納手段３を検索し、
検索結果としてデータ集合を得るプログラムモジュール
である。検索結果を検索結果識別格納手段８へ通知し、
検索結果の識別子を受け取る。The logical operation search means 4 receives the search expression from the search expression display / input means 1, and, if necessary, the already retrieved data set from the search result identification storage means 8,
Search the data storage means 3 with similarity according to the search formula,
It is a program module that obtains a data set as a search result. Notify the search result identification storage means 8 of the search result,
Receives the search result identifier.

【００１５】類似度検索手段５は、類似度検索式表示入
力手段２から類似度検索式を受け取り、検索結果識別格
納手段８から既に検索済みのデータ集合を受け取り、類
似度検索式に従って類似度付きデータ格納手段３中の類
似度データから類似度計算を行い、検索（計算）結果と
してデータ集合を得るプログラムモジュールである。検
索結果を検索結果識別格納手段８へ通知し、検索結果の
識別子を受け取る。The similarity search means 5 receives the similarity search expression from the similarity search expression display / input means 2, receives the already retrieved data set from the search result identification storage means 8, and adds the similarity according to the similarity search expression. It is a program module that calculates the similarity from the similarity data in the data storage means 3 and obtains a data set as a search (calculation) result. The search result identification storage means 8 is notified of the search result, and the search result identifier is received.

【００１６】論理演算検索結果表示手段６は、論理演算
検索手段４から検索結果であるデータ集合とその識別子
を受け取り、その内容をディスプレイ等の出力手段へ表
示するユーザーインタフェースを持つプログラムモジュ
ールである。The logical operation search result display means 6 is a program module having a user interface for receiving a data set as a search result and its identifier from the logical operation search means 4 and displaying the contents on an output means such as a display.

【００１７】類似度検索結果表示手段７は、類似度検索
手段５から検索結果であるデータ集合とその識別子を受
け取り、その内容をディスプレイ等の出力手段へ表示す
るユーザーインタフェースを持つプログラムモジュール
である。The similarity search result display means 7 is a program module having a user interface for receiving a data set as a search result and its identifier from the similarity search means 5 and displaying the contents on an output means such as a display.

【００１８】検索結果識別格納手段８は、論理演算検索
手段４および類似度検索手段５から検索結果であるデー
タ集合を受け取り、そのデータ集合に識別子を割当て、
その識別子を検索結果の送り手である論理演算検索手段
４あるいは類似度検索手段５へと返すプログラムモジュ
ールである。受け取った検索結果（データ集合）を識別
子と共に格納し、論理演算検索手段４および類似度検索
手段５から識別子を受け取った場合は、対応する検索結
果（データ集合）を返す。The search result identification storage means 8 receives a data set as a search result from the logical operation search means 4 and the similarity search means 5, and assigns an identifier to the data set.
It is a program module that returns the identifier to the logical operation search means 4 or the similarity search means 5 that is the sender of the search result. The received search result (data set) is stored together with the identifier, and when the identifier is received from the logical operation search means 4 and the similarity degree search means 5, the corresponding search result (data set) is returned.

【００１９】以上のような構成を有する実施例における
論理検索および類似度検索の処理の流れを図４に示す。
論理検索の際に、検索式表示入力手段１は、まず検索式
を入力手段から受け取る。類似度検索の際には、類似度
検索式表示入力手段２は、類似度検索式を受け取る。
（ステップＳ１）FIG. 4 shows the processing flow of the logical search and the similarity search in the embodiment having the above configuration.
At the time of logical search, the search formula display input means 1 first receives the search formula from the input means. At the time of the similarity search, the similarity search formula display / input unit 2 receives the similarity search formula.
(Step S1)

【００２０】受け取った検索式または類似度検索式に、
既に検索済みのデータ集合に対応する識別子が含まれて
いるかを調べる（ステップＳ２）。In the received search expression or similarity search expression,
It is checked whether or not the identifier corresponding to the already retrieved data set is included (step S2).

【００２１】ステップＳ２の判定の結果、識別子が含ま
れていた場合には、論理演算検索手段４あるいは類似度
検索手段５は、検索結果識別格納手段８から識別子に対
応するデータ集合を受け取り、それを要素に展開する
（ステップＳ３）。If the result of determination in step S2 is that an identifier is included, the logical operation retrieval means 4 or the similarity degree retrieval means 5 receives the data set corresponding to the identifier from the retrieval result identification storage means 8 and Is expanded into elements (step S3).

【００２２】受け取った検索式は類似度検索式であるか
否かを調べる（ステップＳ４）。It is checked whether the received retrieval formula is a similarity retrieval formula (step S4).

【００２３】受け取った検索式が類似度検索式であった
場合には、類似度検索手段５により、検索式に従って類
似度計算を行い、検索結果としてデータの集合を得る
（ステップＳ５）。その詳細は図５により後述する。When the received search formula is the similarity search formula, the similarity search means 5 calculates the similarity according to the search formula and obtains a set of data as a search result (step S5). The details will be described later with reference to FIG.

【００２４】ステップＳ４の判定において、受け取った
検索式が類似度検索式ではないと判定されたとき、即ち
論理検索の検索式であったときには、論理演算検索手段
４は、検索式に従って論理演算を行い、検索結果として
データの集合を得る（ステップＳ６）。When it is determined in step S4 that the received search expression is not the similarity search expression, that is, the search expression is a logical search expression, the logical operation search means 4 performs a logical operation according to the search expression. Then, a data set is obtained as a search result (step S6).

【００２５】論理演算検索手段４または類似度検索手段
５は、検索結果として得られたデータの集合を検索結果
識別格納手段８へ送り、データ集合に対応する識別子を
受け取る（ステップＳ７）。The logical operation searching means 4 or the similarity searching means 5 sends a set of data obtained as a search result to the search result identification storage means 8 and receives an identifier corresponding to the data set (step S7).

【００２６】論理演算検索結果表示手段６または類似度
検索結果表示手段７は、検索結果として得られたデータ
の集合と対応する識別子を表示する（ステップＳ８）。The logical operation search result display means 6 or the similarity search result display means 7 displays the identifier corresponding to the set of data obtained as the search result (step S8).

【００２７】次に、本実施例の類似度検索についてさら
に、詳細に説明する。［類似度付きデータ格納手段３の格納情報］類似度付きデータ格納手段３にｎ個のデータＤ＝｛ｄ
１，ｄ２，・・・，ｄｎ｝が格納されているとする。さらに、任意のデータｄｘ
と、すべてのデータｄ１，ｄ２，・・・，ｄｎとの間の
それぞれの類似度を表すデータ、即ち、Ｓ（ｄｘ，ｄ
１），Ｓ（ｄｘ，ｄ２），・・・，Ｓ（ｄｘ，ｄｎ）
（ｘ＝１，２，・・・，ｎ）が、格納されているとす
る。ただし、０≦Ｓ（ｄｘ，ｄｉ）≦１００（ｘ，ｉ＝１，２，・・
・，ｎ）Ｓ（ｄｘ，ｄｘ）＝１００（ｘ＝１，２，・・・，ｎ）Ｓ（ｄｉ，ｄｊ）＝Ｓ（ｄｊ，ｄｉ）（ｉ，ｊ＝１，
２，・・・，ｎ）であるとする。Next, the similarity search of this embodiment will be described in more detail. [Storage Information of Similarity-Added Data Storage Unit 3] n pieces of data D = {d in the similarity-added data storage unit 3
, D2, ..., dn} are stored. Furthermore, arbitrary data dx
, And all the data d1, d2, ..., dn representing the respective similarities, that is, S (dx, d
1), S (dx, d2), ..., S (dx, dn)
It is assumed that (x = 1, 2, ..., N) is stored. However, 0 ≦ S (dx, di) ≦ 100 (x, i = 1, 2, ...
., N) S (dx, dx) = 100 (x = 1,2, ..., n) S (di, dj) = S (dj, di) (i, j = 1,
2, ..., N).

【００２８】［類似度検索式の記述］類似度検索式の記
述はＨ（Ａ）・Ｌ（Ｂ）である。ここでＡ，Ｂは、すで
に検索結果識別格納手段８で割当て済みの識別子を連結
したものである。したがって、類似度検索手段５によっ
て以下のように展開可能である。Ａ→｛ａ１，ａ２，・・・，ａｉ｝，Ｂ→｛ｂ１，ｂ
２，・・・，ｂｊ｝ａｘ∈Ｄ（ｘ＝１，２，・・・，ｉ），ｂｙ∈Ｄ（ｙ＝
１，２，・・・，ｊ）Ｈ（Ａ）・Ｌ（Ｂ）は、データａ１，ａ２，・・・，ａ
ｉとは類似度が高く、データｂ１，ｂ２，・・・，ｂｊ
とは類似度が低いデータを検索することを指定するもの
である。[Description of Similarity Search Formula] The description of the similarity search formula is H (A) .L (B). Here, A and B are concatenations of the identifiers already assigned by the search result identification storage means 8. Therefore, the similarity search means 5 can be developed as follows. A → {a1, a2, ..., ai}, B → {b1, b
2, ..., bj} axεD (x = 1, 2, ..., i), byεD (y =
1, 2, ..., j) H (A) and L (B) are data a1, a2 ,.
i has a high degree of similarity, and data b1, b2, ..., bj
Indicates that data having a low degree of similarity is searched.

【００２９】［論理演算検索手段４の検索アルゴリズ
ム］論理演算検索手段４による検索アルゴリズムは、論
理演算を行う一般的なデータ検索アルゴリズムと同一で
ある。[Retrieval Algorithm of Logical Operation Retrieval Means 4] The retrieval algorithm by the logical operation retrieval means 4 is the same as a general data retrieval algorithm for performing logical operations.

【００３０】−［類似度検索手段５の検索（類似度計
算）アルゴリズム］類似度検索手段５による検索（類似
度計算）アルゴリズムを図５に示す。図中のスタートに
おけるＤ＝｛ｄ１，ｄ２，・・・，ｄｎ｝は、類似度付
きデータ格納手段３に格納されている全データである。
また、データ集合｛ａ１，ａ２，・・・，ａｉ｝，｛ｂ
１，ｂ２，・・・，ｂｊ｝は各集合内でデータの重複が
ないとする。ただし、両集合に同一のデータが含まれる
のは構わない。-[Search (Similarity Calculation) Algorithm of Similarity Search Unit 5] FIG. 5 shows a search (similarity calculation) algorithm by the similarity search unit 5. D = {d1, d2, ..., dn} at the start in the figure is all data stored in the data storage means 3 with similarity.
Also, the data sets {a1, a2, ..., ai}, {b
1, b2, ..., Bj} are assumed to have no data duplication in each set. However, both sets may contain the same data.

【００３１】図５に示すように、類似度検索手段５の検
索アルゴリズムにおいて、まず、類似度検索手段５は類
似度検索式表示入力手段２から、類似度検索式Ｈ（Ａ）
・Ｌ（Ｂ）を受け取る（ステップ５１）。As shown in FIG. 5, in the search algorithm of the similarity search means 5, first, the similarity search means 5 receives the similarity search expression H (A) from the similarity search expression display input means 2.
-Receive L (B) (step 51).

【００３２】類似度検索手段５は、検索結果識別格納手
段８から識別子に対応するデータ集合を受け取り、以下
のように展開する（ステップ５２）。Ａ→｛ａ１，ａ２，・・・，ａｉ｝，Ｂ→｛ｂ１，ｂ
２，・・・，ｂｊ｝ａｘ∈Ｄ（ｘ＝１，２，・・・，ｉ），ｂｙ∈Ｄ（ｙ
＝１，２，・・・，ｊ）The similarity search means 5 receives the data set corresponding to the identifier from the search result identification storage means 8 and develops it as follows (step 52). A → {a1, a2, ..., ai}, B → {b1, b
2, ..., bj} axεD (x = 1, 2, ..., i), byεD (y
= 1, 2, ..., j)

【００３３】検索結果のための変数Ｒ，Ｐを空集合とす
る（ステップ５３）。Variables R and P for the retrieval result are set to an empty set (step 53).

【００３４】Ｍ（ｄｘ）＝ｉ＋ｊＶ（ｄｘ）＝Ｓ（ｄｘ，ａ１）＋Ｓ（ｄｘ，ａ２）＋，
・・・，＋Ｓ（ｄｘ，ａｉ）＋（１００−Ｓ（ｄｘ，ｂ
１））＋（１００−Ｓ（ｄｘ，ｂ２））＋，・・・，＋
（１００−Ｓ（ｄｘ，ｂｊ））｝／Ｍ（ｄｘ）（た
だしｄｘ∈Ｄ）を求めＶ（ｄｘ）が閾値Ｔよりも大きい場合ｄｘを集合
Ｒ，Ｐの要素に加える。この操作をＤの全ての要素につ
いて行う（ステップＳ５４）。M (dx) = i + j V (dx) = S (dx, a1) + S (dx, a2) +,
..., + S (dx, ai) + (100-S (dx, b
1)) + (100-S (dx, b2)) +, ..., +
(100-S (dx, bj))} / M (dx) (where dxεD) is found and when V (dx) is larger than the threshold T, dx is added to the elements of the sets R and P. This operation is performed for all the elements of D (step S54).

【００３５】データ集合ＤからＲとＤの論理積の集合の
要素を除いた集合を新たにＤとする（ステップＳ５
５）。A set obtained by removing the elements of the set of the logical product of R and D from the data set D is newly set as D (step S5).
5).

【００３６】［Ｖ（ｄｘ）×Ｍ（ｄｘ）＋｛Ｖ（ｐ１）
×Ｓ（ｄｘ，ｐ１）＋Ｖ（ｐ２）×Ｓ（ｄｘ，ｐ２）
＋，・・・，＋Ｖ（ｐｋ）×Ｓ（ｄｘ，ｐｋ）｝／１０
０］／（Ｍ（ｄｘ）＋ｋ）（ただしＰ＝｛ｐ１，ｐ
２，・・・，ｐｋ｝，ｄｘ∈Ｄ）を求めこの値を新たにＶ（ｄｘ）とし、Ｍ（ｄｘ）＋ｋ
を新たにＭ（ｄｘ）とする。Ｖ（ｄｘ）が閾値Ｔよりも
大きい場合ｄｘを集合Ｒ，Ｐの要素に加える。上記の操
作をＤの全ての要素について行う（ステップＳ５６）。[V (dx) × M (dx) + {V (p1)
× S (dx, p1) + V (p2) × S (dx, p2)
+, ..., + V (pk) × S (dx, pk)} / 10
0] / (M (dx) + k) (where P = {p1, p
2, ..., pk}, dxεD), and this value is newly set as V (dx), and M (dx) + k
Is newly set as M (dx). When V (dx) is larger than the threshold value T, dx is added to the elements of the sets R and P. The above operation is performed for all the elements of D (step S56).

【００３７】Ｐは空集合であるかを判定し（ステップ５
７）、Ｐが空集合でなければ、Ｐを空集合とし（ステッ
プＳ５８）、ステップＳ５５へ戻り、Ｐが空集合であれ
ば、検索結果としてＲの要素であるデータを出力する
（ステップＳ５９）。It is determined whether P is an empty set (step 5
7) If P is not an empty set, P is set as an empty set (step S58), and the process returns to step S55. If P is an empty set, data that is an element of R is output as a search result (step S59). .

【００３８】本アルゴリズムは、「データａ１，ａ２，
・・・，ａｉとの２項間で類似度が高く、かつ、データ
ｂ１，ｂ２，・・・，ｂｊとの２項間で類似度が低いデ
ータ」を得るだけではなく、「データａ１，ａ２，・・
・，ａｉとの２項間で類似度が高く、かつ、データｂ
１，ｂ２，・・・，ｂｊとの２項間で類似度が低いデー
タ」との２項間で類似度の高いデータも再帰的に考慮し
た上で、検索結果集合Ｒを求めるものである。これによ
り、従来技術の問題点であった複数のデータの全てと類
似度の高いデータを正確に得ることができる。This algorithm uses the "data a1, a2,
, Ai with a high degree of similarity between the two terms and data b1, b2, ..., bj with a low degree of similarity between the two terms ". a2 ...
, Ai has a high similarity between the two terms, and the data b
, B2, ..., bj, which has a low degree of similarity between two terms, recursively considers data having a high degree of similarity between two terms, and obtains the search result set R. . As a result, it is possible to accurately obtain data having a high degree of similarity with all of the plurality of data, which has been a problem of the conventional technique.

【００３９】実施例の具体的なデータ例を用いた画面イ
メージを図６から図１０に示す。図６中の論理検索ウィ
ンドウ６１上部のテキスト入力部６２が図２の検索式表
示入力手段１に、類似度検索ウィンドウ６４上部のテキ
スト入力部６５が類似度検索式表示入力手段２に対応す
る。また、論理検索ウィンドウ６１中の２つのリストア
イテム６３が論理演算検索結果表示手段６に、類似度検
索ウィンドウ６４中の２つのリストアイテム６６が類似
度検索結果表示手段７に対応する。図６は検索式が入力
される前の初期の状態の画面を示している。Screen images using concrete data examples of the embodiment are shown in FIGS. 6 to 10. The text input section 62 at the top of the logical search window 61 in FIG. 6 corresponds to the search expression display input means 1 of FIG. 2, and the text input section 65 at the top of the similarity search window 64 corresponds to the similarity search expression display input means 2. Further, the two list items 63 in the logical search window 61 correspond to the logical operation search result display means 6, and the two list items 66 in the similarity search window 64 correspond to the similarity search result display means 7. FIG. 6 shows a screen in an initial state before a search expression is input.

【００４０】以下、図７から図１０の操作内容について
説明する。図７において、論理検索ウィンドウ６１上部
のテキスト入力部６２に検索式を入力し、検索ボタン７
１をクリックすることによって、論理演算検索手段４に
より、指定された検索式に基づいて検索が行われ、その
結果が識別子とともに表示される。この例では、著者
が”山田博”であり（ＯＴ＝山田博）、かつ、１９９３
年に登録された（ＰＤ＝１９９３０１０１：１９９３１
２３１）データを検索している。検索結果として得られ
たデータ集合に対して、識別子”ｋ１”が与えられてい
る。The operation contents of FIGS. 7 to 10 will be described below. In FIG. 7, enter a search expression in the text input section 62 at the top of the logical search window 61, and click the search button 7
By clicking 1, a search is performed by the logical operation search means 4 based on the specified search expression, and the result is displayed together with the identifier. In this example, the author is “Hiroshi Yamada” (OT = Hiroshi Yamada), and 1993
Registered in the year (PD = 19930101: 19931
231) Searching for data. The identifier “k1” is given to the data set obtained as the search result.

【００４１】図８は、図７と同様に、論理検索を行う例
が示されており、検索条件としては、著者が”高橋誠”
であり、かつ、１９９３年に登録されたデータを指定し
ている。検索結果として得られたデータ集合に対して、
識別子”ｋ２”が与えられている。Similar to FIG. 7, FIG. 8 shows an example of performing a logical search, and the search condition is that the author is “Makoto Takahashi”.
And specifies the data registered in 1993. For the data set obtained as a search result,
The identifier "k2" is given.

【００４２】図９は、図７と同様に、論理検索を行う例
が示されており、検索条件としては、著者が”森義彦”
であり、かつ、１９９３年に登録されたデータを指定し
ている。検索結果として得られたデータ集合に対して、
識別子”ｋ３”が与えられている。Similar to FIG. 7, FIG. 9 shows an example of performing a logical search, and the search condition is that the author is “Yoshihiko Mori”.
And specifies the data registered in 1993. For the data set obtained as a search result,
The identifier "k3" is given.

【００４３】図１０は、類似度検索ウィンドウ６４上部
のテキスト入力部６５に類似度検索の条件を入力し、検
索ボタン６８をクリックすることによって、類似度検索
手段５により指定された類似度検索式に基づいて検索が
行われ、その結果が識別子とともに表示される。この例
では、識別子ｋ１で指定されたデータ集合と識別子ｋ２
で指定されたデータ集合に属するデータとは類似度が強
く、識別子ｋ３で指定されたデータ集合に属するデータ
とは類似度が低いデータ（”山田博”および”高橋誠”
によって書かれた文献と類似度が高く、”森義彦”によ
って書かれた文献と類似度が低い文献）を検索してい
る。検索結果として得られたデータ集合に対して、識別
子ｋ４が与えられている。In FIG. 10, the similarity search condition designated by the similarity search means 5 is entered by inputting the conditions for the similarity search in the text input section 65 at the top of the similarity search window 64 and clicking the search button 68. The search is performed based on, and the result is displayed together with the identifier. In this example, the data set designated by the identifier k1 and the identifier k2
Data that has a high degree of similarity with the data that belongs to the data set that is specified by, and that has a low degree of similarity with the data that belongs to the data set that has been specified by identifier k3 ("Hiroshi Yamada" and "Makoto Takahashi").
, Which has a high degree of similarity with the document written by, and has a low degree of similarity with the document written by "Yoshihiko Mori". An identifier k4 is given to the data set obtained as the search result.

【００４４】なお、論理検索ウィンドウ６１、類似度検
索ウィンドウ６４のいずれにおいても、識別子ｋ４を用
いてさらに絞り込みを行うことが可能である。In both the logical search window 61 and the similarity search window 64, it is possible to further narrow down using the identifier k4.

【００４５】以上のように本実施例は、類似度による絞
り込みの際に論理演算子を用いる代わりに、類似度を広
い範囲で考慮した類似度計算を行い、より正確に「指定
された複数のデータの全てと類似度の高い（低い）デー
タの集合」を求めることができるものである。さらに、
一般の論理演算子による絞り込み手段と、上記の類似度
計算を行う手段の両者を利用者に提供することにより、
細かな検索指定が可能な装置を実現し、多量のデータか
らの検索の効率を向上させることを可能とするものであ
る。As described above, in the present embodiment, instead of using the logical operator when narrowing down by the similarity, the similarity calculation considering the similarity in a wide range is performed, and more accurately the “specified plurality of It is possible to obtain "a set of data having high (low) similarity to all the data". further,
By providing the user with both the means for narrowing down using a general logical operator and the means for performing the above similarity calculation,
It is possible to realize a device capable of finely specifying a search and improve the efficiency of a search from a large amount of data.

【００４６】[0046]

【発明の効果】本発明は、類似度による絞り込みの際に
論理演算子を用いる代わりに、類似度を広い範囲で考慮
した類似度計算を行うようにしたので、より正確に「指
定された複数のデータの全てと類似度の高い（あるいは
低い）データの集合」を求めることができる。As described above, according to the present invention, the similarity calculation is performed in consideration of the similarity in a wide range instead of using the logical operator when narrowing down by the similarity. A set of data having a high degree of similarity (or a low degree of similarity) with all of the data of.

[Brief description of drawings]

【図１】本発明の構成を示すブロック図、FIG. 1 is a block diagram showing a configuration of the present invention,

【図２】本発明の実施例の構成を示すブロック図、FIG. 2 is a block diagram showing a configuration of an embodiment of the present invention,

【図３】類似度検索を説明するための図、FIG. 3 is a diagram for explaining similarity search,

【図４】論理検索および類似度検索の処理の流れを示す
図、FIG. 4 is a diagram showing a processing flow of logical search and similarity search;

【図５】類似度検索手段５の検索アルゴリズムを示す
図、FIG. 5 is a diagram showing a search algorithm of the similarity search means 5;

【図６】実施例の具体的なデータ例を用いた画面イメー
ジを示す図（その１）、FIG. 6 is a diagram (part 1) showing a screen image using a specific data example of the embodiment;

【図７】実施例の具体的なデータ例を用いた画面イメー
ジを示す図（その２）、FIG. 7 is a diagram (part 2) showing a screen image using a specific data example of the embodiment;

【図８】実施例の具体的なデータ例を用いた画面イメー
ジを示す図（その３）、FIG. 8 is a diagram (part 3) showing a screen image using a specific data example of the embodiment;

【図９】実施例の具体的なデータ例を用いた画面イメー
ジを示す図（その４）、FIG. 9 is a diagram (part 4) showing a screen image using a specific data example of the embodiment;

【図１０】実施例の具体的なデータ例を用いた画面イメ
ージを示す図（その５）、FIG. 10 is a diagram (part 5) showing a screen image using a specific data example of the embodiment;

[Explanation of symbols]

ａ…類似度付きデータ格納手段、ｂ…類似度検索式表示
入力手段、ｃ…類似度検索手段、ｄ…検索結果保持手
段。a ... Similarity data storage means, b ... Similarity search expression display / input means, c ... Similarity search means, d ... Search result holding means.

Claims

[Claims]

1. A data storage unit with similarity, which holds a set of data having similarity data between two pieces of data, and a similarity search formula including a condition data set which is a condition for similarity search. A similarity search expression display and input means, a similarity search means for performing a similarity search by the condition data set to obtain a search data set, and a search result holding means for holding the search data set obtained by the similarity search means The similarity search means, the two data similarity held in the data storage means with similarity, the condition data set input by the similarity search expression display input means, The values obtained based on the search data set held in the search result holding means and the predetermined calculation formula for obtaining the similarity between the condition data set and the data with similarity degree are similar to each other. Similarity with the data retrieval apparatus characterized by obtaining the search data set exceeds a predetermined threshold value defining the limit.

2. The condition data set input by the similarity search expression display / input means includes a first condition data set for searching data with high similarity and a first condition data set for searching data with low similarity. 2. A data search device with similarity, comprising two condition data sets.