JPH10105559A

JPH10105559A - Information processor

Info

Publication number: JPH10105559A
Application number: JP8255806A
Authority: JP
Inventors: Yoshinori Sato; 嘉則佐藤; Akira Maeda; 章前田; Hideyuki Maki; 牧　　秀行; Masafumi Okada; 政文岡田; Katsumi Omori; 勝美大森
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-09-27
Filing date: 1996-09-27
Publication date: 1998-04-24
Anticipated expiration: 2016-09-27
Also published as: JP3449129B2

Abstract

PROBLEM TO BE SOLVED: To provide a function, with which retrieval accuracy and prediction accuracy can be improved by evaluating, storing and utilizing the retrieved result and the predicted result and the retrieval skill of users can be shared by sharing the stored evaluated result among plural users, concerning an information processor for performing similar data retrieval and prediction while using similar data. SOLUTION: The similar data are retrieved by a similar data retriever 105, and the similarity degree of retrieved data is evaluated by a similarity degree evaluator 107 and stored in a similarity degree evaluation value storage part 112. While using the retrieved result and the evaluation values in the past stored in the similarity degree evaluation value storage part 112, an executed result output device 108 outputs certainty in the latest retrieved result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、データベース装置
などの情報記憶装置内に格納された数値または記号で表
現されたデータの集まりから、任意の類似データを検
索、加工して、利用者に提供する方法に係り、自然現
象、人口統計等の社会的現象、株価変動等の経済的現
象、工業プラント等の化学的、物理的現象等の一般のデ
ータ管理、予測に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of searching and processing arbitrary similar data from a group of data represented by numerical values or symbols stored in an information storage device such as a database device and providing the data to a user. The present invention relates to general data management and prediction of natural phenomena, social phenomena such as demographics, economic phenomena such as stock price fluctuations, and chemical and physical phenomena of industrial plants and the like.

【０００２】[0002]

【従来の技術】近年の計算機技術の発達、LAN、WAN等ネ
ットワークインフラストラクチャの整備に従い、これま
で以上に大量、多種のデータを蓄積、利用する動きが活
発化しており、単にデータを蓄積して定型的な業務に利
用するだけでなく、分析業務などの非定型な業務にデー
タベースを利用するケースが現れてきた。2. Description of the Related Art With the recent development of computer technology and the development of network infrastructure such as LAN and WAN, there has been an increasing movement to store and use a large amount and various kinds of data more than ever. A case has emerged where a database is used not only for routine tasks but also for irregular tasks such as analytical tasks.

【０００３】従来はDBMS(Database Management Syste
m）に備わった検索機能を利用して、データの集計、加
工を行ってきた。現在はDBMSはRDB(Relational Databas
e)を用いたものが最も普及しており、RDBではSQLと呼ば
れる言語をベースに機能が提供される。SQLでは複数の
表形式のデータに対する演算操作として各種集計機能を
提供する。文献としては「データベースシステム概論、
C. J. Date著、藤原譲訳、丸善株式会社」に詳しい。以
下このRDBに記載のものを従来技術１と呼ぶ。Conventionally, DBMS (Database Management System)
Data has been compiled and processed using the search function provided in m). Currently DBMS is RDB (Relational Databas
The one using e) is the most popular, and RDB provides functions based on a language called SQL. SQL provides various aggregation functions as arithmetic operations on multiple tabular data. The literature is "Introduction to Database Systems,
CJ Date, translated by Joe Fujiwara, Maruzen Co., Ltd. " Hereinafter, what is described in this RDB will be referred to as conventional technology 1.

【０００４】また集計したデータの分析目的の一つとし
ては、物理的、社会的現象等の予測がある。予測手法と
しては重回帰分析、自己回帰分析等の他、大量に蓄積さ
れたデータを元に類似データを探し出し予測結果を計算
出力するものとしてMBR（Memory Based Reasoning:記憶
に基づく推論）と呼ばれる手法がある。MBRに関しては
「Craig Stanfill, DavidWaltz, TOWARD MEMORY-BASED
REASOING, Communications of the ACM, Dec 1986, V
ol.29. Number 29, pp.1213-1228」に概要が述べられて
いる。以下このMBRに関するものをを従来技術２と呼
ぶ。[0004] One of the purposes of analyzing the aggregated data is to predict physical and social phenomena. Other than the multiple regression analysis and autoregression analysis, MBR (Memory Based Reasoning) is a method for finding similar data based on a large amount of accumulated data and calculating and outputting the prediction result. There is. For MBR, please refer to `` Craig Stanfill, David Waltz, TOWARD MEMORY-BASED
REASOING, Communications of the ACM, Dec 1986, V
ol. 29. Number 29, pp. 1213-1228 ". Hereinafter, the one related to the MBR will be referred to as conventional technology 2.

【０００５】従来技術２は既知のデータから未知のデー
タに類似したデータを探し出し、これらを直接用いて予
測する。古典的な予測手法、あるいはニューラルネット
ワークと比較して予測モデルの運用が容易であり、また
表形式のデータを直接予測に使うことが可能であるた
め、RDBとの親和性が高く、現状のデータベース環境で
蓄積したデータを利用しやすいという利点がある。従来
技術２を気象予測に応用した例として「毛利隆夫、田中
秀彦、記憶に基づく推論による天気予報、人工知能学会
誌、1995, Vol.19, No.5, pp.798-805」がある。In prior art 2, data similar to unknown data is searched for from known data, and prediction is performed by directly using these data. Compared to the classical prediction method or neural network, the operation of the prediction model is easier, and the tabular data can be used directly for prediction. There is an advantage that data accumulated in the environment can be easily used. As an example of applying Conventional Technique 2 to weather forecasting, there is “Takao Mohri, Hidehiko Tanaka, Weather Forecasting by Memory-Based Inference, Journal of the Japanese Society for Artificial Intelligence, 1995, Vol.19, No.5, pp.798-805”.

【０００６】[0006]

【発明が解決しようとする課題】従来技術１は複数の表
を操作する機能、表中のデータの単純な加工、検索の機
能しか提供しない。検索キーがはっきりした定型的な作
業、例えば月別の支店毎の売り上げや、地域別の製品毎
の売り上げを集計するといった作業は、予めデータベー
ス管理者や分析の専門家によってプログラムを用意して
おくことが可能になる。しかし既に分かっているデータ
との類似データが欲しいが、検索キーがはっきりしない
場合、利用者が手作業で検索キーを絞り込みながら検索
を繰り返す必要があり、効率的な検索を行うためには、
検索キーの絞り込み方法などに経験が要求される。The prior art 1 provides only a function of operating a plurality of tables, a simple processing of data in the tables, and a function of searching. For routine tasks with clear search keys, such as monthly sales for each branch or total sales for each product in each region, prepare a program in advance by a database administrator or analysis expert. Becomes possible. However, if you want similar data to the data that you already know, but the search key is not clear, the user must repeat the search while narrowing down the search key manually.To perform an efficient search,
Experience is required in how to narrow down search keys.

【０００７】従来技術２は類似データを自動的に探し出
す機能と、類似データを用いて未知のデータを予測する
機能を提供するが、高精度な予測のためには非常に多く
のデータが必要となる。しかし現実問題では常に予測対
象の基本的な構造は時間と共に変化していることがあ
り、そのため古くなってしまったデータを使用できない
といった問題が生じる。Prior Art 2 provides a function of automatically searching for similar data and a function of predicting unknown data using similar data. However, a very large amount of data is required for highly accurate prediction. Become. However, in a real problem, the basic structure of a prediction target may always change with time, and thus a problem arises in that old data cannot be used.

【０００８】[0008]

【課題を解決するための手段】本願第１の発明として、
複数のフィールドからなる１個以上のレコードデータを
入力する手段と、前記入力されたデータを蓄積する手段
と、１個以上の類似度判定フィールドを指定する手段
と、類似度を計算するためのデータ間距離定義を指定す
る手段と、前記類似度判定フィールドに関して、前記指
定された距離定義に従い前記入力データと類似するデー
タを蓄積データから検索する手段と、前記検索された類
似データを出力する手段と、類似度評価フィールドを指
定する手段と、前記類似度評価フィールドに関して評価
する手段と、前記評価結果を蓄積する手段と、前記蓄積
された評価結果を用いて上記出力手段を制御する手段
と、上記蓄積された類似度評価結果を検索された類似デ
ータと共に表示する手段を有することを特徴とする。Means for Solving the Problems As a first invention of the present application,
Means for inputting at least one record data consisting of a plurality of fields, means for storing the input data, means for specifying one or more similarity determination fields, and data for calculating similarity Means for specifying a distance definition, means for searching the stored data for data similar to the input data in accordance with the specified distance definition, and means for outputting the searched similar data with respect to the similarity determination field. Means for designating a similarity evaluation field; means for evaluating the similarity evaluation field; means for storing the evaluation result; means for controlling the output means using the stored evaluation result; There is provided a means for displaying the accumulated similarity evaluation results together with the searched similar data.

【０００９】この構成のため、見つかったデータの類似
度を評価して結果を蓄積しておくことで、データがどの
程度信頼できるかを判定できる。そのため利用するたび
に検索精度が向上する効果を得られる。また蓄積した類
似度を共有することにより検索に対する利用者の経験が
不要になる。With this configuration, it is possible to judge how reliable the data is by evaluating the similarity of the found data and accumulating the results. Therefore, the effect of improving the search accuracy each time it is used can be obtained. In addition, sharing the accumulated similarity eliminates the need for the user's experience in searching.

【００１０】また、本願第２の発明としては、複数のフ
ィールドからなる１個以上のレコードデータを入力する
手段と、前記入力されたデータを蓄積する手段と、１個
以上の類似度判定フィールドを指定する手段と、類似度
を計算するためのデータ間距離定義を指定する手段と、
前記類似度判定フィールドに関して、前記指定された距
離定義に従い前記入力データと類似するデータを蓄積デ
ータから検索する手段と、前記検索された類似データを
出力する手段と、類似度評価フィールドを指定する手段
と、前記類似度評価フィールドに関して評価する手段
と、前記評価結果を蓄積する手段と、前記蓄積された評
価結果を用いて上記出力手段と、入力されるレコードデ
ータの類似度評価フィールドの値が欠損している場合、
上記検索された類似データと、上記蓄積された類似度評
価結果を用いて欠損フィールド値を推論する手段を有す
ることを特徴とする。According to a second aspect of the present invention, a means for inputting one or more record data comprising a plurality of fields, a means for storing the input data, and one or more similarity determination fields are provided. Means for specifying the distance between data for calculating the similarity,
Means for searching the stored data for data similar to the input data according to the specified distance definition, means for outputting the searched similar data, and means for specifying a similarity evaluation field with respect to the similarity determination field Means for evaluating the similarity evaluation field, means for accumulating the evaluation result, the output means using the accumulated evaluation result, and the value of the similarity evaluation field of the input record data is missing. If you have
There is provided means for inferring a missing field value using the searched similar data and the accumulated similarity evaluation result.

【００１１】この本願第２の発明では、従来技術２と同
様に類似データを用いて未知のデータを予測する場合、
常に得られるデータの類似度を用いて予測結果を補正す
ることが可能であり、大量のデータを得るために古いデ
ータを使わざるを得ない場合でも、高精度な予測が可能
である。According to the second aspect of the present invention, when predicting unknown data using similar data as in the second prior art,
It is possible to correct the prediction result using the similarity of the data that is always obtained, and it is possible to perform highly accurate prediction even when old data must be used to obtain a large amount of data.

【００１２】[0012]

【発明の実施の形態】以下、図面を用いて本発明の第一
の実施形態を説明する。図１は、事例を用いた情報処理
装置１００の構成とデータの流れを示している。また、
図２はネットワークを介して接続されたサーバ装置、ク
ライアント装置により構成される事例を用いた情報処理
装置２００を表している。２０１〜２０３はクライアン
ト処理装置、２０４はサーバ処理装置である。サーバ処
理装置は２台以上でもかまわないが、ここでは説明の便
宜上１台とした。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the present invention will be described below with reference to the drawings. FIG. 1 shows the configuration of an information processing apparatus 100 using examples and the flow of data. Also,
FIG. 2 shows an information processing apparatus 200 using an example constituted by a server device and a client device connected via a network. 201 to 203 are client processing devices, and 204 is a server processing device. The number of server processing devices may be two or more, but here is one for convenience of explanation.

【００１３】図３は、事例を用いた情報処理装置２００
の詳細を示している。３０７はクライアント処理装置を
構成する部分であり、３０８はサーバ処理装置を構成す
る部分である。各クライアント処理装置に含まれる部分
は同一であるため、説明の便宜上ここでは一つの装置の
詳細を示した。FIG. 3 shows an information processing apparatus 200 using cases.
The details are shown. Reference numeral 307 denotes a part constituting the client processing device, and reference numeral 308 denotes a part constituting the server processing device. Since the parts included in each client processing device are the same, the details of one device are shown here for convenience of explanation.

【００１４】ここで、データ入力装置１０２、類似度判
定フィールド入力装置１０３、類似度評価フィールド入
力装置１０４、類似データ検索装置１０５、類似度判定
装置１０６、類似度評価装置１０７、実行結果出力装置
１０８、検索データ出力装置１０９、類似度出力装置１
１０は、図１における各装置と同様である。３０１は接
続識別子入力装置、３０２クライアント側送信装置、３
０３はサーバ側受信装置、３０５はサーバ側送信装置、
３０６はクライアント側受信装置である。Here, a data input device 102, a similarity determination field input device 103, a similarity evaluation field input device 104, a similar data search device 105, a similarity determination device 106, a similarity evaluation device 107, and an execution result output device 108 , Search data output device 109, similarity output device 1
10 is the same as each device in FIG. Reference numeral 301 denotes a connection identifier input device; 302, a client-side transmission device;
03 is a server-side receiving device, 305 is a server-side transmitting device,
Reference numeral 306 denotes a client-side receiving device.

【００１５】装置１００と装置３００は類似データを検
索する原理については同じであるため、以降装置３００
に従って説明を行う。The principle of searching for similar data is the same between the apparatus 100 and the apparatus 300.
Will be described in accordance with

【００１６】図４に、装置３０８が格納しているデータ
の例を示す。データ４０１は飲料品製造向上におけるデ
ータの例であり、各行は製造を行った日、曜日、予想最
低気温、予想最高気温、最低湿度、最高湿度、予想気
圧、前日製造実績、当日の製造実績を表している。FIG. 4 shows an example of data stored in the device 308. The data 401 is an example of data for improving the production of beverages, and each row shows the date of production, the day of the week, the expected minimum temperature, the expected maximum temperature, the minimum humidity, the maximum humidity, the estimated atmospheric pressure, the previous day's production result, and the production result of the day Represents.

【００１７】図５に、装置３００に装置３０７に与える
データの例を示す。データ５０１は類似例検索の対象と
なる入力データであり、データ５０２は類似判定フィー
ルドを指定するデータ、データ５０３は類似度評価フィ
ールドを指定するデータである。類似度判定フィールド
は類似データを見つける際に使用されるフィールドであ
り、類似度評価フィールドは見つかった類似データの良
さを評価する際に使われる。すなわち、データ４００
は、利用者は供給量実績がほぼ同量で、かつその他の項
目が似ているデータを探したい場合の指定である。FIG. 5 shows an example of data provided to the device 300 to the device 300. Data 501 is input data to be searched for a similar example, data 502 is data specifying a similarity determination field, and data 503 is data specifying a similarity evaluation field. The similarity determination field is a field used when finding similar data, and the similarity evaluation field is used when evaluating the goodness of the found similar data. That is, the data 400
Is a designation when the user wants to search for data in which the actual supply amount is almost the same and other items are similar.

【００１８】図６に、装置３０８が格納している距離定
義、距離定義管理データの例を示す。データ６０１は装
置３０９に格納されている距離定義管理テーブル、デー
タ６０２は装置１１１に格納されている各種の距離定義
を表している。データ６０１はどの利用者がどの距離定
義を使用するかの対応関係が記述されている。データ６
０２はデータ間の距離を測る際の、各項目が持つ重要度
を示したものである。重要度は、（0,1）の数値で表さ
れ、数値が大きいほど重要度が大きいことを意味し、重
要度が０の場合にはその項目は類似例検索の際に全く考
慮されないことを意味する。FIG. 6 shows an example of distance definition and distance definition management data stored in the device 308. Data 601 indicates a distance definition management table stored in the device 309, and data 602 indicates various distance definitions stored in the device 111. The data 601 describes the correspondence between which user uses which distance definition. Data 6
02 indicates the importance of each item when measuring the distance between data. The importance is represented by the numerical value of (0,1). The higher the numerical value, the higher the importance. If the importance is 0, it means that the item is not considered at all when searching for similar examples. means.

【００１９】図７のデータ７００は、装置１１２に格納
されている類似度評価結果を示している。装置１１３に
格納されている各データのそれぞれのフィールド値が、
過去の検索に対してどのような類似性を持っていたかを
示しており、フィールド値が小さいほど類似度が高いこ
とを意味する。例えばフィールド値７０２はレコードデ
ータ７０１を製造実績に関して評価された類似度を表し
ており、レコードデータ７０３と比較すると、過去の検
索において類似性が高かったことを示している。Data 700 in FIG. 7 shows the similarity evaluation result stored in the device 112. Each field value of each data stored in the device 113 is
It indicates what similarity the past search has, and the smaller the field value is, the higher the similarity is. For example, the field value 702 indicates the similarity of the record data 701 evaluated with respect to the manufacturing performance, and when compared with the record data 703, indicates that the similarity was high in the past search.

【００２０】図８に示したデータ８００は、装置１０８
が出力する。あるいは装置１０８が、表示するデータを
表している。フィールド値８０１は検索された類似デー
タそのもの、フィールド値８０２は類似度判定フィール
ドから計算された類似度、フィールド値８０３は類似度
評価フィールドから計算した類似度評価フィールドであ
り、過去の類似度判定がうまくいっていたかどうかを示
す度合いを意味している。すなわち、入力したデータに
関する類似度をフィールド値８０２が表し、類似度の信
頼度がフィールド値８０３に表されている。The data 800 shown in FIG.
Output. Alternatively, the device 108 represents data to be displayed. The field value 801 is the searched similar data itself, the field value 802 is the similarity calculated from the similarity determination field, and the field value 803 is the similarity evaluation field calculated from the similarity evaluation field. It means the degree to show whether or not it went well. That is, the field value 802 represents the similarity regarding the input data, and the reliability of the similarity is represented by the field value 803.

【００２１】以下図３に従って、本実施形態の概要を説
明する。まず装置３０８の管理者は予め記憶部１１３に
データを蓄積しておく。ここで管理者とはデータベース
管理者でも良いし、利用者自信でも良い。検索の際に
は、装置１０２、装置１０３、装置１０４に対してそれ
ぞれ入力データ５０１、類似度判定フィールド指定デー
タ５０２、類似度評価フィールド指定データ５０３が与
えられる。装置３０１に対しては利用者の識別子、クラ
イアント装置の自体の識別子、アプリケーションが持つ
識別子等が与えられる。ここでは説明の便宜上利用者の
識別子が与えられるものとする。The outline of the present embodiment will be described below with reference to FIG. First, the administrator of the device 308 stores data in the storage unit 113 in advance. Here, the administrator may be a database administrator or a user himself / herself. At the time of the search, the input data 501, the similarity determination field designation data 502, and the similarity evaluation field designation data 503 are given to the devices 102, 103, and 104, respectively. The device 301 is provided with a user identifier, an identifier of the client device itself, an identifier of an application, and the like. Here, it is assumed that a user identifier is given for convenience of explanation.

【００２２】装置３０２は、データ５０１、５０２、５
０３、利用者識別子を装置３０３に送信する。装置３０
３は、受け取ったデータのうち、利用者識別子を装置３
０４に与え、実際に使用する距離定義を決定する。次に
データ５０１、データ５０２、データ５０３、距離定義
を元に、装置１０５が類似データを検索する。The apparatus 302 stores data 501, 502, 5,
03, transmitting the user identifier to the device 303; Device 30
3 indicates the user identifier in the received data
04 to determine the distance definition actually used. Next, the device 105 searches for similar data based on the data 501, data 502, data 503, and the distance definition.

【００２３】実際の検索は、装置１０６が行う。類似デ
ータの検索では、着目しているデータ中の類似度判定フ
ィールド値と入力データの類似度判定フィールド値がど
の程度異なっているかを元に計算される。また、各フィ
ールドの重要度は距離定義によって与えられる。検索結
果は、装置１０７により評価され、評価結果を元に記憶
部１１３に格納されているデータ７００が更新される。The actual search is performed by the device 106. In searching for similar data, calculation is performed based on how much the similarity determination field value in the data of interest differs from the similarity determination field value of the input data. The importance of each field is given by a distance definition. The search result is evaluated by the device 107, and the data 700 stored in the storage unit 113 is updated based on the evaluation result.

【００２４】装置１０５の検索結果、データ８０１、デ
ータ８０２及びデータ８０３は、装置３０５により装置
３０６に送られる。ただし、データ８０３は更新される
前の過去の評価値である。装置３０６は、受信したデー
タを装置１０８に与え、装置１０８は内部に持つ装置１
０９によりデータ８０１を、装置１１０によりデータ８
０２、８０３を出力する。The data 801, data 802, and data 803 as a result of the search by the device 105 are sent to the device 306 by the device 305. However, the data 803 is a past evaluation value before being updated. The device 306 gives the received data to the device 108, and the device 108
09 to the data 801 and the device 110 to the data 8
02 and 803 are output.

【００２５】このように、類似度評価装置１１７、類似
度評価値記憶部１１２を設け、過去の検索結果を蓄積す
ることにより、使用する度に検索精度が向上していく点
に特徴がある。As described above, the similarity evaluation device 117 and the similarity evaluation value storage unit 112 are provided, and the retrieval accuracy is improved by accumulating the past retrieval results each time they are used.

【００２６】さらに、装置３０１、３０４、３０９によ
り利用者毎の距離定義を管理することにより、複数の利
用者の距離定義を共有することが可能となる。このた
め、検索精度が向上した結果も同様に共有することがで
きる点に本実施形態の特徴がある。Further, by managing the distance definitions for each user by the devices 301, 304, and 309, it becomes possible to share the distance definitions of a plurality of users. For this reason, the feature of the present embodiment is that the result of the improved search accuracy can be shared similarly.

【００２７】次に、データ５００を入力した場合の処理
の詳細を図面を用いて説明する。図９は、本実施形態の
処理の流れを示したものである。処理９０１でデータ５
０１が入力され、処理９０２でデータ５０２、データ５
０３が入力される。ただし、処理９０１、処理９０２は
クライアント処理装置３０７によって行われる。装置３
０７が受け取ったデータとクライアント処理装置が保持
している識別子を、処理９０７がサーバ処理装置３０８
に送信する。Next, details of the processing when the data 500 is input will be described with reference to the drawings. FIG. 9 shows the flow of the process of the present embodiment. Data 5 in processing 901
01 is input, and data 502, data 5
03 is input. However, the processes 901 and 902 are performed by the client processing device 307. Device 3
07, the data received and the identifier held by the client processing device,
Send to

【００２８】処理９０３、９０４、９０５は、装置３０
８によって行われる。処理９０３では、データ６０１を
参照して、どの距離定義を使用するかを決定する。それ
を受け処理９０４は、実際に検索処理を実行する。処理
９０４の詳細を図１０に示す。The processes 903, 904, and 905 are performed by the device 30
8. In the process 903, it is determined which distance definition is to be used by referring to the data 601. In response, the process 904 actually executes a search process. Details of the process 904 are shown in FIG.

【００２９】まず、処理１００１で類似データを格納し
ておく一時領域をクリアする。一時領域の大きさは見つ
たい類似データ数の多さによる。最大類似データ数は利
用者が与えても良いし、サーバ処理装置の管理者が決め
ておいても良い。ここでは説明の便宜上サーバ処理装置
の管理者が予め最大類似データ数を予め決めておくもの
とする。First, in step 1001, a temporary area for storing similar data is cleared. The size of the temporary area depends on the number of similar data to be found. The maximum number of similar data may be given by the user or may be determined by the administrator of the server processing device. Here, it is assumed that the administrator of the server processing apparatus determines the maximum number of similar data in advance for convenience of explanation.

【００３０】処理１００２で、装置１１３に格納されて
いるデータ４０１から１レコード（１行）を参照する。
処理１００３で、入力データと処理１００２で参照した
データの類似度を判定する。例えば、入力データ、被検
索データの類似度判定フィールド値をそれぞれＸ、Ｙ、
類似度判定フィールドの重要度をＷとすると、ある１個
の類似度判定フィールドの距離ｄはｄ＝ｗ・（Ｘ−Ｙ）
で与えられる。また、データ全体の距離は各フィールド
の距離の総和Ｄであり、使用するフィールドはデータ５
０２で与えた類似度判定フィールドの指定による。また
データ同士の類似度判定値Ｗ＝1/(Ｄ・Ｄ + 1)とする。
すなわち、類似度判定値を示す数値は大きいほど比較し
ているデータ同士が似ていることを意味し、全く同じデ
ータ同士では類似度判定値は１となる。In processing 1002, one record (one line) is referred to from the data 401 stored in the device 113.
In process 1003, the similarity between the input data and the data referred to in process 1002 is determined. For example, the input data and the data to be searched have similarity determination field values of X, Y,
Assuming that the importance of the similarity determination field is W, the distance d of a certain similarity determination field is d = w · (XY).
Given by The distance of the entire data is the sum D of the distances of the respective fields.
By the designation of the similarity determination field given in 02. It is also assumed that the similarity determination value W between data is 1 / (D · D + 1).
In other words, the larger the numerical value indicating the similarity determination value, the more similar the compared data is. The similarity determination value is 1 for exactly the same data.

【００３１】次に処理１００４で、類似データ格納領域
にある検索済みの類似データと、現在参照している類似
データの類似度を比較する。現在参照中の類似データの
類似度が、検索済みの類似データ中、最も似ていないデ
ータの類似度よりも低い場合、すなわち入力データに対
してより近い場合は、処理１００５において参照中の類
似データを格納領域に格納し、比較した過去の類似デー
タは消去する。Next, in step 1004, the similarity of the searched similar data in the similar data storage area is compared with the similarity of the similar data currently referred to. If the similarity of the similar data currently being referred to is lower than the similarity of the least similar data in the searched similar data, that is, if the similarity is closer to the input data, the similar data being referred to in process 1005 Is stored in the storage area, and the similar past data compared is deleted.

【００３２】処理１００６により、データ４０１中のデ
ータ全てが比較されるまで処理を繰り返す。The process is repeated until all the data in the data 401 are compared by the process 1006.

【００３３】処置１００７では、データ５０３で指定さ
れている類似度指定フィールドを使い、見つかった類似
データの類似度（類似度判定値）を評価する。このとき
評価して類似性が低いと判断されたデータは、（１）類
似データとして採用しないか、（２）類似度評価値をク
ライアント装置３０７に送信して装置３０７で出力処理
を制御するか、（３）装置３０８で全ての類似データと
評価値を同時に表示して利用者が採用、不採用の判定を
下す等の方法があり得る。ここでは説明の便宜上、全て
の類似データと評価値を出力するものとする。In step 1007, the similarity (similarity judgment value) of the found similar data is evaluated using the similarity designation field designated by the data 503. Data evaluated at this time and determined to have low similarity is either (1) not used as similar data, or (2) the similarity evaluation value is transmitted to the client device 307 and the output process is controlled by the device 307. (3) There may be a method in which all similar data and evaluation values are displayed at the same time on the device 308 and the user determines whether or not the data is adopted or not. Here, for convenience of explanation, it is assumed that all similar data and evaluation values are output.

【００３４】例えば、データ５０３に従って類似度を評
価する場合、入力データの製造実績値をZ、類似度デー
タiの製造実績値をZi、類似度をＷi、Ｅiをデータの評
価値とする。まず、以下の（数１）が成り立つと仮定す
る。For example, when the similarity is evaluated in accordance with the data 503, the production result value of the input data is Z, the production result value of the similarity data i is Zi, the similarity is Wi, and Ei is the evaluation value of the data. First, it is assumed that the following (Equation 1) holds.

【００３５】Ｚ＝Σi(Zi・Ｗi)/Σi Ｗi + Σi Ｗi・Ｅi…（数１）ただしiは類似データの識別番号である。第１項は入力
データの製造実績値は、類似データの重み付け平均、第
２項は補正項、第３項は誤差項を意味する。処理１００
７ではデータ７００を参照して検索された類似データの
過去の評価値Ｅiを算出する。例えば、類似度評価フィ
ールドが製造実績であり、見つかった番号２のデータが
類似データとして検索された場合、処理１００７とデー
タ７００により類似度評価10×Ｗiが算出される。類似
データと、類似度判定値、類似度評価値は処理９０８に
よりクライアント処理装置３０７に送信される。Z = Σi (Zi · Wi) / Σi Wi + Σi Wi · Ei (Expression 1) where i is an identification number of similar data. The first term is the actual production value of the input data, the weighted average of similar data, the second term is the correction term, and the third term is the error term. Processing 100
In step 7, the past evaluation value Ei of the similar data searched with reference to the data 700 is calculated. For example, when the similarity evaluation field is the production result and the found data of No. 2 is searched as similar data, the similarity evaluation 10 × Wi is calculated by the processing 1007 and the data 700. The similar data, the similarity determination value, and the similarity evaluation value are transmitted to the client processing device 307 in step 908.

【００３６】処理１００８は、データを送信する前に、
現在の検索結果を元に類似度評価値Ｅiを更新する。実
際の検索結果では（数１）の左辺と右辺は常に一致する
わけではない。そこで、最近n回の類似度判定値をＷij
(j=0,1,..,n。iはj回目の検索で見つかった類似データ
の識別番号）を保持しておいて左辺と右辺の違いを小さ
くするために、以下の（数２）とし、（数１）の左辺と
右辺の二乗誤差εを小さくするようＥiの更新を行う。The process 1008 is performed before transmitting data.
The similarity evaluation value Ei is updated based on the current search result. In the actual search result, the left side and the right side of (Equation 1) do not always match. Therefore, the similarity determination values for the last n times are calculated as Wij
(j = 0,1, .., n, where i is the identification number of the similar data found in the j-th search) to reduce the difference between the left side and the right side, And Ei is updated so as to reduce the square error ε between the left side and the right side of (Equation 1).

【００３７】 ε＝{Σj(Ｚ-Σi(Zi・Ｗij)/ΣI Ｗij-Σi Ｗij・Ｅi）^2}/2…（数２）ただし、a^bはaのｂ乗を表す。このとき（数２）より、
以下の（数３）に従えば、εの極小解を得られる。Ε = {Σj (Z−Σi (Zi · Wij) / ΣI Wij−Σi Wij · Ei) ^ 2} / 2 (2) where a ^ b represents a raised to the power of b. From this (Equation 2),
According to the following (Equation 3), a minimum solution of ε can be obtained.

【００３８】 σε/σEi=(Ｚ - Zi・Ｗij- Ｗij・Ｅi)・Ｗi…（数３）ただし、本実施形態を初めて使用する場合は、Ｅiの初
期値は微少な乱数値で設定されている。Σε / σEi = (Z−Zi · Wij−Wij · Ei) · Wi (Equation 3) However, when this embodiment is used for the first time, the initial value of Ei is set to a small random value. I have.

【００３９】ここで、補正項を修正する際に、判定フィ
ールド値に関して似ているものは、評価値を大きく修正
し、判定フィールド値に関して似ていないものは小さく
修正することが可能になるため、あるデータが別々の入
力データの類似データとして参照された場合も正しく
（数１）を補正できる。Here, when correcting the correction term, it is possible to make a large correction to the evaluation value for those similar with respect to the judgment field value, and to make a small correction to the one not similar with respect to the judgment field value. (Equation 1) can be correctly corrected even when certain data is referred to as similar data of different input data.

【００４０】また、補正項のＷi・Ｅiは、類似データi
の評価値を意味しており、Ｗi・Ｅiが小さいほど類似度
評価フィールドに関してデータ間の類似性が高く、Ｗi
・Ｅiが大きいほど類似性が低くなる。The correction term Wi · Ei is similar data i
The smaller the value of Wi · Ei, the higher the similarity between data with respect to the similarity evaluation field.
The greater the Ei, the lower the similarity.

【００４１】処理９０８で、装置３０７が送信した類似
データと、類似度判定値、類似度評価値は装置３０８に
渡される。ここで処理９０６、９０８は装置３０８にお
ける処理である。処理９０６はデータ８００に示される
情報を出力する。In process 908, the similar data transmitted by the device 307, the similarity judgment value, and the similarity evaluation value are passed to the device 308. Here, processes 906 and 908 are processes in the device 308. The process 906 outputs the information shown in the data 800.

【００４２】このように、本実施形態では、過去の検索
の結果を用いることにより、検索を重ねる度に類似度検
索の精度が向上する。As described above, in this embodiment, the accuracy of the similarity search is improved each time the search is repeated by using the result of the past search.

【００４３】次に、本発明の第２の実施形態を説明す
る。装置１１００は、未知のフィールド値を含む入力デ
ータを受け取った場合に、類似データを検索すると同時
に未知フィールド値を予測する。図１１は、事例を用い
た情報処理装置１１００の構成とデータの流れを示して
いる。ここで、１１０５はクライアント処理装置、１１
０６はサーバ処理装置、１１０３は予測結果出力装置、
１１０４は類似度評価フィールド予測装置を表してい
る。１１０１は、図３における装置１０１、１０２、１
０３、１０４と同様である。また、３０２、３０３、３
０５、３０６は、図３におけるものと同様である。１１
０３は、図３における装置３０４、３０９、１０５、１
０６、１０７、１１１、１１２、１１３と同様である。Next, a second embodiment of the present invention will be described. When receiving input data including an unknown field value, the apparatus 1100 searches for similar data and predicts an unknown field value at the same time. FIG. 11 shows the configuration and data flow of an information processing apparatus 1100 using a case. Here, 1105 is a client processing device, 11
06 is a server processing device, 1103 is a prediction result output device,
Reference numeral 1104 denotes a similarity evaluation field prediction device. Reference numeral 1101 denotes the devices 101, 102, 1 in FIG.
03 and 104 are the same. 302, 303, 3
05 and 306 are the same as those in FIG. 11
03 denotes the devices 304, 309, 105, 1 in FIG.
Same as 06, 107, 111, 112, 113.

【００４４】図１２は、装置１１０１に与えられる欠損
値を含む入力データの例であり、製造実績値が欠けてい
る。この他入力データとしてデータ５０２、５０３が、
１１０１に与えられる。これら各データは、同時に与え
られてもよい。データ５０２は、データの類似度を判定
するフィールドの指定であり、データ５０３はデータの
欠損値を含むフィールドである。FIG. 12 is an example of input data including a missing value given to the device 1101, and the actual production value is missing. In addition, data 502 and 503 are input data.
1101. Each of these data may be given simultaneously. Data 502 is a specification of a field for determining the similarity of data, and data 503 is a field including a missing value of the data.

【００４５】図１３は、装置１１００の出力結果の例で
あり、製造実績の欠損値は３７００と予測されている。
また、図１４のデータ１４００は真値が得られたとき
に、装置１１００の予測動作を評価するために入力する
データであり、装置１１０１に与えられる。第２の実施
の形態においては、装置１１０３を用いることにより、
使用する度に予測精度が向上する。FIG. 13 is an example of the output result of the apparatus 1100, and the missing value of the production result is predicted to be 3700.
The data 1400 in FIG. 14 is input when the true value is obtained to evaluate the prediction operation of the device 1100 and is given to the device 1101. In the second embodiment, by using the device 1103,
The prediction accuracy improves with each use.

【００４６】以下、図１５の処理１５００に従って装置
１１００の処理を説明する。処理１５０１では、装置１
１０１により装置１１００はデータ１２００を得る。次
に、処理１５０２ではデータ５０２、５０３を得る。処
理１５０３では、データ１２００、５０２、５０３を装
置１１０６に送信する。処理１５０４、１５０５、１５
０６は、処理９０３、９０４と同様であり、データ１２
００の類似データ、類似度判定値、類似度評価値を得
る。ただし類似度評価フィールドが未知であるため、装
置１１２に格納されているデータ７００は更新されな
い。処理１５０７は（数１）を用いて類似データから欠
損フィールド値を予測する。処理１５０８では、予測し
た値を装置１１０５に送信し、処理１５０８は送信され
た予測値、類似データ、類似度判定値、類似度評価値を
受け取る。処理１５０９は、データ１３００を利用者に
出力する。データ１３０１は類似データ、１３０２は入
力データの欠損値を予測した値を表している。Hereinafter, the processing of the apparatus 1100 according to the processing 1500 of FIG. 15 will be described. In processing 1501, the device 1
According to 101, the device 1100 obtains data 1200. Next, in a process 1502, data 502 and 503 are obtained. In the process 1503, the data 1200, 502, and 503 are transmitted to the device 1106. Processing 1504, 1505, 15
06 is the same as the processes 903 and 904,
00 similar data, a similarity determination value, and a similarity evaluation value are obtained. However, since the similarity evaluation field is unknown, the data 700 stored in the device 112 is not updated. Processing 1507 predicts a missing field value from similar data using (Equation 1). In the process 1508, the predicted value is transmitted to the device 1105, and the process 1508 receives the transmitted predicted value, similar data, similarity determination value, and similarity evaluation value. The process 1509 outputs the data 1300 to the user. Data 1301 represents similar data, and 1302 represents a predicted value of a missing value of input data.

【００４７】次に、予測の真値が後から得られた場合の
処理１６００を図１６を用いて説明する。装置１１００
は、処理１６０１、１６０２によりデータ１４００、５
０２、５０３を受け取る。５０２、５０３で指定するフ
ィールドは予測時に用いたものと同様である。処理１６
０３、１６０４、１６０５、１６０６は処理１５０３、
１５０４、１５０５、１５０６と同様であり、データ１
４００の類似データ、類似度判定装置、類似度評価値を
得る。ただし、処理１５００異なり欠損値フィールドは
存在しないため、類似度評価フィールドの指定に従っ
て、処理１６０６、１６０７により装置１１２に格納さ
れているデータ７００は更新される。ただし処理１６０
６、１６０７は処理１００７、１００８と同様の処理で
ある。Next, the processing 1600 when the true value of the prediction is obtained later will be described with reference to FIG. Apparatus 1100
Are data 1400, 5 by processing 1601 and 1602.
02 and 503 are received. The fields designated by 502 and 503 are the same as those used at the time of prediction. Process 16
03, 1604, 1605, and 1606 are processes 1503,
Same as 1504, 1505, 1506, and data 1
400 similar data, a similarity determination device, and a similarity evaluation value are obtained. However, since there is no missing value field unlike the process 1500, the data 700 stored in the device 112 is updated by the processes 1606 and 1607 according to the designation of the similarity evaluation field. However, processing 160
Steps 6 and 1607 are the same as steps 1007 and 1008.

【００４８】このように処理１６００により、次回から
同様の類似データを用いて予測をする際に、真値により
近い予測値が得られると共に、距離定義、類似データ評
価値をサーバ処理装置により複数の利用者が共有するこ
とで、予測精度が向上した結果も同様に共有することが
できる。As described above, in the process 1600, when prediction is performed using similar similar data from the next time, a prediction value closer to the true value is obtained, and the distance definition and the similar data evaluation value are converted into a plurality of values by the server processing apparatus. As a result of sharing by the user, the result of the improved prediction accuracy can be similarly shared.

【００４９】[0049]

【発明の効果】以上、本発明ではデータ類似度評価装
置、評価値蓄積装置を設けることで、使う度に検索精度
が向上し、かつ精度向上を複数の利用者で共有できると
いう利点がある。また類似度評価フィールド予測装置を
設けることで、類似データを用いて欠損した類似度評価
フィールドを予測することができる。後から真値が得ら
れる場合は予測精度を向上させ、精度向上の結果を複数
の利用者で共有することができる。As described above, according to the present invention, by providing the data similarity evaluation device and the evaluation value storage device, there is an advantage that the retrieval accuracy is improved each time it is used, and the improvement in accuracy can be shared by a plurality of users. Further, by providing the similarity evaluation field prediction device, it is possible to predict a missing similarity evaluation field using similar data. When a true value is obtained later, the prediction accuracy is improved, and the result of the accuracy improvement can be shared by a plurality of users.

[Brief description of the drawings]

【図１】事例を用いた情報処理装置の全体図。FIG. 1 is an overall view of an information processing apparatus using a case.

【図２】本発明を用いるクライアント／サーバ処理装置
のハードウェア構成図。FIG. 2 is a hardware configuration diagram of a client / server processing device using the present invention.

【図３】クライアント／サーバ処理装置を用いた実施の
形態の全体図。FIG. 3 is an overall view of an embodiment using a client / server processing device.

【図４】入力データの１例を示す図。FIG. 4 is a diagram showing an example of input data.

【図５】距離定義を格納する形式を示す図。FIG. 5 is a diagram showing a format for storing a distance definition.

【図６】距離定義管理テーブルを示す図。FIG. 6 is a diagram showing a distance definition management table.

【図７】類似度判定結果を蓄積する形式の１例を示す
図。FIG. 7 is a diagram showing an example of a format for storing a similarity determination result.

【図８】類似例検索の出力データ（実行結果）の１例を
示す図。FIG. 8 is a diagram showing an example of output data (execution result) of a similar example search.

【図９】装置３００での処理を示すフローチャート。FIG. 9 is a flowchart showing processing in the apparatus 300.

【図１０】装置１０５での処理を示すフローチャート。FIG. 10 is a flowchart showing processing in the device 105.

【図１１】本発明の実施形態の全体図。FIG. 11 is an overall view of an embodiment of the present invention.

【図１２】装置１１００にあたえる入力データの１例を
示す図。FIG. 12 is a diagram showing an example of input data given to the device 1100.

【図１３】装置１１００の出力データの１例を示す図。FIG. 13 is a view showing an example of output data of the apparatus 1100.

【図１４】真値の入力データの１例を示す図。FIG. 14 is a diagram showing an example of true value input data.

【図１５】本発明における予測の仕方を示すフローチャ
ート。FIG. 15 is a flowchart showing a prediction method according to the present invention.

【図１６】本発明における類似データの検索の仕方を示
すフローチャート。FIG. 16 is a flowchart showing a method of searching for similar data in the present invention.

───────────────────────────────────────────────────── フロントページの続き (72)発明者岡田政文茨城県日立市大みか町五丁目２番１号株式会社日立製作所大みか工場内 (72)発明者大森勝美茨城県日立市大みか町五丁目２番１号株式会社日立製作所大みか工場内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Masafumi Okada 5-2-1 Omika-cho, Hitachi City, Ibaraki Prefecture Inside the Omika Plant, Hitachi, Ltd. (72) Katsumi Omori 5-chome, Omika-cho, Hitachi City, Ibaraki Prefecture No. 1 Inside the Omika Plant of Hitachi, Ltd.

Claims

[Claims]

1. An information processing apparatus for searching for desired data from stored data, input means for inputting at least one record data consisting of a plurality of fields, and specifying at least one similarity determination field. Determination field specifying means; distance definition specifying means for specifying a data distance definition for calculating similarity; and for the similarity determination field, data similar to the input data according to the specified distance definition is stored data. Search means for searching from the list; evaluation field designating means for designating a similarity evaluation field; evaluation means for evaluating the similarity evaluation field; and means for controlling the search means using the evaluation result of the evaluation means. An information processing apparatus comprising:

2. The information processing apparatus according to claim 1, wherein when the value of the similarity evaluation field of the input record data is missing, the similarity data retrieved and the evaluation result are used to remove the missing data. An information processing apparatus comprising: means for inferring a field value.

3. The information processing apparatus according to claim 1, further comprising: a storage unit configured to store the evaluation result of the evaluation unit, wherein the evaluation result stored in the storage unit corresponds to a change in the evaluation result. An information processing apparatus characterized in that the information processing apparatus changes the information.

4. The information processing apparatus according to claim 1, further comprising means for displaying the accumulated evaluation results together with the searched similar data.

5. A data analysis device comprising: a client processing device; one or more server processing devices for managing a plurality of record data including a plurality of items; and a network connecting the client processing devices, wherein the client processing device comprises: One or more records to be subjected to data analysis, a first transmitting means for transmitting an instruction of a data analysis method to any one or more of the servers, and the server processing device transmits the transmitted instruction. Means for searching for data similar to the transmitted data from the data stored in each database, and means for evaluating the degree of similarity of the searched similar data by the analysis method by the server processing device. Means for accumulating the evaluation result by the server processing device; and Means for controlling data search means, second transmission means for transmitting the search result of the searched data to the client processing device, and means for outputting the transmitted analysis result by the client processing device. An information processing apparatus characterized by the above-mentioned.

6. The information processing apparatus according to claim 5, further comprising: means for displaying the stored similarity evaluation results together with the retrieved similar data.

7. A storage medium storing a program for retrieving similarity of field values and therefore similar data from one or more record data sets consisting of one or more fields, wherein one or more similarity judgments are performed. When a field, one or more similarity determination fields, and a data-to-data distance definition for calculating the similarity are specified,
Searching the stored data for data similar to the input data according to the specified distance definition, outputting the searched data, and evaluating the similarity between the input data and the searched data with respect to the similarity evaluation field. A storage medium characterized in that:

8. The storage medium according to claim 7, wherein a program having a procedure for displaying the similarity evaluation result together with the searched similar data is stored.