JPH11161670A

JPH11161670A - Method, device, and system for information filtering

Info

Publication number: JPH11161670A
Application number: JP9329933A
Authority: JP
Inventors: Tsutomu Matsunaga; 務松永; Hiromi Kida; 博巳木田
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 1997-12-01
Filing date: 1997-12-01
Publication date: 1999-06-18
Anticipated expiration: 2017-12-01
Also published as: JP3497712B2

Abstract

PROBLEM TO BE SOLVED: To provide the high-precision information filtering device which can automatically reflect a user's interest. SOLUTION: A profile management part 17 generates a user profile from a correlation matrix obtained by dimension deletion from a feature vector set showing features of an input document. A filtering process part 18 calculates the projection of feature vectors by documents and corresponding user profiles and filters the document on the basis of the calculated values. A correlation matrix is generated again from a feature vector set as filtering results and the profile management part 17 updates the corresponding user profiles. Further, the corresponding user profiles are put together to generate a common profile and when a user profile is updated, the corresponding common profile is updated.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大量の電子化情報
からユーザにとって関心の高いテーマを持つ電子化情報
をフィルタリング（フィルタリング）する手法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for filtering digitized information having a theme of high interest to a user from a large amount of digitized information.

【０００２】[0002]

【従来の技術】近年、インタネットに代表される大規模
かつ高速なネットワークの普及等により、エンドユーザ
が容易に多種の電子化情報を多様な形態で取得できる環
境が提供されている。しかし、情報の電子化の推進は情
報化社会の一翼を担う一方、その膨大化した情報は、人
間が管理可能な量を遥かに越えてしまう弊害をもたらし
ており、この問題を解決する手法ないしシステムの開発
が望まれている。また、電子化情報の流通化に伴って、
大量の電子化情報から必要な情報のみを取捨選択する必
要性が生じている。この場合の取捨選択作業は、人手で
行うには負担がかかりすぎるため、コンピュータ装置に
よる自動化、例えば、ユーザが関心を持つテーマに沿っ
て、流入する大量の電子化情報を自動的に選別する情報
フィルタリング方法に関する検討がなされている。2. Description of the Related Art In recent years, with the spread of large-scale and high-speed networks typified by the Internet, an environment has been provided in which end users can easily obtain various types of electronic information in various forms. However, while the promotion of computerization of information plays a part in the information society, the enormous amount of information has the adverse effect of exceeding the amount that can be managed by humans. Development of the system is desired. In addition, with the distribution of digitized information,
There is a need to select only necessary information from a large amount of electronic information. The selection operation in this case is too burdensome to be performed manually, and thus is automated by a computer device, for example, information that automatically selects a large amount of electronic information that flows in according to a theme of interest to the user. Discussions have been made on filtering methods.

【０００３】一般に、情報フィルタリング方法では、ユ
ーザの関心度合いを定量化してコンピュータ処理するた
めに、ユーザがどのような情報に関心を有しているかを
表現する基準ベクトル（ユーザプロファイルベクトル、
ユーザプロファイル、あるいは単にプロファイルとも呼
ばれている）が用いられる。ユーザプロファイルは、例
えば、予めユーザが関心のある電子化情報に含まれる複
数のテキストデータの集合に含まれる単語の出現頻度を
単語毎に求め、求めた単語の種類に応じた次元、例え
ば、単語の種類が１０種類あれば１０次元のベクトルに
変換するとともに、これを正規化したものである。In general, in the information filtering method, in order to quantify the degree of interest of a user and perform computer processing, a reference vector (user profile vector,
A user profile, or simply a profile) is used. The user profile is obtained, for example, in advance, for each word, the appearance frequency of a word included in a set of a plurality of text data included in digitized information of interest to the user, and a dimension corresponding to the obtained word type, for example, the word If there are 10 types, the vector is converted into a 10-dimensional vector, and this is normalized.

【０００４】また、ベクトルによるパターン認識手法の
一形態として部分空間法（部分空間類別法とも呼ばれ
る）が知られている。この手法は、類別すべきカテゴリ
を特徴ベクトル成分の分布から形成される部分空間への
射影を通して判定する統計的手法である。この場合の変
換するベクトル成分の固有ベクトル計算には、例えば、
量子化アルゴリズムであるカルーネン・レーベ（Karhun
en-Loeve）変換によるＫＬ解析が採用されている。この
部分空間法における代表的な手法には、ＣＬＡＦＩＣ(C
LAss-Featuring Information Compression)法や、平均
学習部分空間法（Averaged Learning Sub-space Metho
d,ＡＬＳＭ）がある。ＣＬＡＦＩＣ法及びＡＬＳＭは、
同様の類別基準を持ち、またＡＬＳＭは、対抗するカテ
ゴリも考慮した適応的な学習法である。なお、この部分
空間法については、例えば、「パターン認識と部分空間
法」（エルッキ・オヤ著、産業図書）等で詳しく記述さ
れている。A subspace method (also called a subspace classification method) is known as one form of a pattern recognition method using vectors. This method is a statistical method in which a category to be classified is determined through projection onto a subspace formed from the distribution of feature vector components. The eigenvector calculation of the vector component to be converted in this case includes, for example,
Karhunen Rebe, a quantization algorithm
KL analysis by en-Loeve) conversion is employed. Representative methods in this subspace method include CLAFIC (C
LAss-Featuring Information Compression method, Averaged Learning Sub-space Metho
d, ALSM). CLAFIC method and ALSM
ALSM is an adaptive learning method that has similar classification criteria and also considers opposing categories. The subspace method is described in detail, for example, in "Pattern Recognition and Subspace Method" (Erkki Oya, Sangyo Tosho).

【０００５】実際に情報フィルタリングを行う場合は、
ユーザが関心有りと判定されるような閾値を予め設定し
ておき、当該閾値に基づいてユーザプロファイルを参照
することにより、対象となる電子化情報群に対して、類
似の度合が大きい順にランク付けされる。電子化情報
は、例えば、当該ランクの上位から所定数が選択され、
ユーザに対して提示される。When actually performing information filtering,
A threshold is set in advance so that the user is determined to be interested, and by referring to the user profile based on the threshold, the target digitized information group is ranked in descending order of similarity. Is done. For example, a predetermined number of digitized information is selected from the top of the rank,
Presented to the user.

【０００６】[0006]

【発明が解決しようとする課題】ところで、一般にフィ
ルタリングの誤りには、必要な情報を落とす「漏れ」
と、不必要な情報を取り込む「ノイズ」とがあり、これ
らの間にはトレードオフの関係があることは良く知られ
たことである。しかし、従来のフィルタリングでは、ユ
ーザの関心度合い、すなわち「必要な情報」のみに着目
した一面的なフィルタリングであり、情報の「漏れ」に
対する減少のみが考慮されたものである。そのため、
「ノイズ」の除去を直接的に考慮しておらず、フィルタ
リング精度を高める上で限界があった。By the way, in general, a filtering error is a "leakage" that drops necessary information.
It is well known that there is a trade-off between these and "noise" that takes in unnecessary information. However, the conventional filtering is a one-sided filtering that focuses only on the degree of interest of the user, that is, “necessary information”, and considers only a decrease in “leakage” of information. for that reason,
Since the removal of "noise" was not directly considered, there was a limit in improving the filtering accuracy.

【０００７】また、従来のフィルタリングでは、ユーザ
の関心事項が複数ある場合や、関心の時間的な変化に柔
軟に対応することができないという制約があった。具体
的には、業務や趣味等はユーザの長期的な関心事項であ
り、事件等は一時的な関心事項であるが、従来のフィル
タリングでは、一様なキーワード入力等によって関心度
合いを決定しなければならないために、ユーザの関心の
変化に対応した自動的なフィルタリングは不可能であっ
た。In addition, the conventional filtering has a limitation that it is not possible to flexibly cope with a case where there is a plurality of interests of the user or a temporal change of the interest. Specifically, tasks and hobbies are long-term interests of the user, and incidents are temporary interests, but with conventional filtering, the degree of interest must be determined by uniform keyword input, etc. This makes automatic filtering in response to changing user interests impossible.

【０００８】そこで本発明の課題は、ユーザの関心情報
であるユーザプロファイルを、ユーザ自身が設定、評価
する必要なく、変化するユーザの関心に追随する学習機
能により、プロファイルの自動作成を可能とし、フィル
タリングに係る精度を一定値以上に維持することができ
る、情報フィルタリング方法を提供することにある。本
発明の他の課題は、上記情報フィルタリング方法の実施
に適した情報フィルタリング装置を提供することにあ
る。Accordingly, an object of the present invention is to enable automatic creation of a user profile, which is information of a user's interest, by a learning function that follows a changing user's interest without the need for the user to set and evaluate the profile. An object of the present invention is to provide an information filtering method capable of maintaining the accuracy of filtering at a certain value or more. Another object of the present invention is to provide an information filtering device suitable for implementing the above information filtering method.

【０００９】[0009]

【課題を解決するための手段】上記課題を解決するた
め、本発明は、ユーザの関心の有無を識別するための識
別情報が付与された電子化情報から冗長な次元を削減し
た学習ベクトルを抽出し、この学習ベクトルに所定の部
分空間類別基準を適用して「関心有」または「関心無」
のいずれかのカテゴリに対応するユーザプロファイルを
作成する過程と、選別対象となる新規電子化情報が入力
されたときに、その新規電子化情報の特徴を表す対象ベ
クトルを抽出し、この対象ベクトルと前記作成されたユ
ーザプロファイルとの特徴差を当該ユーザプロファイル
に対応する部分空間への射影により求め、この特徴差に
基づいて前記新規電子化情報を「関心有」または「関心
無」のいずれかのカテゴリに選別する過程と、選別後の
電子化情報から前記ユーザプロファイルと同一形式の更
新プロファイルを作成し、この更新プロファイルを用い
て前記ユーザプロファイルを更新する過程とを含む、情
報フィルタリング方法を提供する。In order to solve the above-mentioned problems, the present invention extracts a learning vector in which redundant dimensions are reduced from digitized information to which identification information for identifying whether or not the user is interested is added. Then, by applying a predetermined subspace classification criterion to this learning vector, "interesting" or "not interested"
The process of creating a user profile corresponding to any of the categories, and when new digitized information to be sorted is input, a target vector representing the feature of the new digitized information is extracted, and this target vector and The feature difference from the created user profile is obtained by projecting the feature profile into a subspace corresponding to the user profile, and based on this feature difference, the new digitized information is either "interesting" or "not interested". Providing an information filtering method including a step of selecting into categories and a step of creating an update profile in the same format as the user profile from the digitized information after the selection and updating the user profile using the update profile. .

【００１０】上記情報フィルタリング方法において、よ
り好ましくは、相互に関連する複数のユーザプロファイ
ルを統合して各ユーザプロファイルと共用関係をなす共
用プロファイルを作成する過程をさらに含むようにす
る。この場合、前記選別する過程は、前記共用プロファ
イルまたは前記ユーザプロファイルとの特徴差に基づい
て前記新規電子化情報を「関心有」または「関心無」の
いずれかのカテゴリに選別する。[0010] Preferably, the information filtering method further includes a step of integrating a plurality of mutually related user profiles to create a shared profile having a shared relationship with each user profile. In this case, the selecting step selects the new digitized information into one of the categories of “interested” or “not interested” based on a characteristic difference from the shared profile or the user profile.

【００１１】また、上記他の課題を解決する本発明の情
報フィルタリング装置は、電子化情報の特徴から冗長な
次元が削除されたベクトルを抽出するベクトル処理手段
と、ユーザの関心の有無を識別するための識別情報が付
与された電子化情報から前記ベクトル処理手段で抽出さ
れた学習ベクトルに、所定の部分空間類別基準を適用し
て「関心有」または「関心無」のいずれかのカテゴリに
対応するユーザプロファイルを作成するプロファイル作
成手段と、選別対象となる新規電子化情報から前記ベク
トル処理手段で抽出された対象ベクトルと前記ユーザプ
ロファイルとの特徴差を、当該ユーザプロファイルに対
応する部分空間への射影により求め、この特徴差に基づ
いて前記新規電子化情報を「関心有」または「関心無」
のいずれかのカテゴリに選別する選別手段とを有し、こ
の選別手段による選別結果から新たな学習ベクトルを抽
出して前記プロファイル作成手段に導くように構成され
ていることを特徴とする。According to another aspect of the present invention, there is provided an information filtering apparatus for extracting a vector from which redundant dimensions have been deleted from the characteristics of digitized information, and identifying whether or not the user is interested. A predetermined subspace classification criterion is applied to the learning vector extracted by the vector processing means from the digitized information to which the identification information is added to correspond to either the "interesting" or "not interested" category. A profile creating unit for creating a user profile to be performed, and a feature difference between the target vector extracted by the vector processing unit from the new digitized information to be sorted and the user profile to a subspace corresponding to the user profile. It is obtained by projection, and based on this feature difference, the new digitized information is "interested" or "not interested".
And a new learning vector is extracted from the result of the selection by the selection unit, and is guided to the profile creation unit.

【００１２】より好ましくは、前記ユーザプロファイル
をユーザ毎に管理するユーザ管理手段をさらに備え、こ
のユーザ管理手段が新規ユーザによる最初の選別である
ことを認識したときに、当該新規ユーザについての初期
プロファイル設定用データを対話式で取り込んで前記プ
ロファイル作成手段に当該新規ユーザについての前記ユ
ーザプロファイルを作成させるように構成する。[0012] More preferably, the apparatus further comprises a user management means for managing the user profile for each user, and when the user management means recognizes that this is the first selection by a new user, an initial profile for the new user is provided. The configuration data is interactively fetched, and the profile creation means is configured to create the user profile for the new user.

【００１３】なお、前記プロファイル作成手段は、例え
ば、前記抽出された学習ベクトルから部分空間類別基準
に基づいて、あるいは所定の平均的学習部分空間法の適
応的な学習条件に基づいて相関行列を作成するように構
成する。前者は、ユーザプロファイルを新規に作成する
場合、後者はユーザプロファイルを更新する場合に有効
となる。また、前記ベクトル処理手段は、正規直交変換
によるＫＬ解析を施して前記冗長な次元を削減するよう
に構成する。The profile creation means creates a correlation matrix from the extracted learning vectors based on a subspace classification criterion or based on adaptive learning conditions of a predetermined average learning subspace method. It is constituted so that. The former is effective when a new user profile is created, and the latter is effective when updating a user profile. Further, the vector processing means is configured to perform KL analysis by orthonormal transform to reduce the redundant dimension.

【００１４】本発明の他の情報フィルタリング装置は、
相互に関わり合う複数の前記ユーザプロファイルを統合
して統合前のユーザプロファイルと共用関係をなす共用
プロファイルを作成して保存するとともに、この共用プ
ロファイルに関わるユーザプロファイルの少なくとも一
つが更新された場合に、当該更新を前記共用プロファイ
ルに反映させる共用プロファイル処理手段を更に備え、
前記選別手段が、前記ユーザプロファイルまたは共用プ
ロファイルを選択的に用いて前記選別を行うようにした
ものである。なお、前記共用プロファイル処理手段は、
例えば統合候補となる複数のユーザプロファイルの各々
について関心の有無の差分に着目した距離値を算出し、
この距離値の総和から統合するかどうかを判定するよう
にする。Another information filtering apparatus according to the present invention comprises:
When creating and saving a shared profile that forms a shared relationship with the user profile before integration by integrating a plurality of the user profiles that are related to each other, when at least one of the user profiles related to the shared profile is updated, A shared profile processing unit for reflecting the update in the shared profile;
The selection means performs the selection by selectively using the user profile or the shared profile. Note that the shared profile processing means includes:
For example, for each of a plurality of user profiles that are integration candidates, a distance value focusing on the difference between the presence or absence of interest is calculated,
It is determined whether or not to integrate from the sum of the distance values.

【００１５】上記他の課題を解決する本発明の情報フィ
ルタリングシステムは、上記情報フィルタリング装置を
通信回線に接続し、前記通信回線を通じて流通する電子
化情報が、前記情報フィルタリング装置に取り込まれる
ようにしたものである。この場合、前記情報フィルタリ
ング装置は、エージェント手段を通じて取り込まれた前
記電子化情報のフィルタリングを行うように構成するこ
とが望ましい。According to another aspect of the present invention, there is provided an information filtering system, wherein the information filtering device is connected to a communication line, and digitized information circulated through the communication line is taken into the information filtering device. Things. In this case, it is preferable that the information filtering device is configured to perform filtering of the digitized information captured through an agent unit.

【００１６】[0016]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を詳細に説明する。（第１実施形態）図１は、本発明を適用した情報フィル
タリング装置の機能ブロック図である。図中、実線は処
理の流れ、破線はデータ類の流れを表す。この情報フィ
ルタリング装置１は、例えばスタンドアロン型のコンピ
ュータ装置の内部または外部に構築される文書データベ
ース（図示省略）及びプロファイルデータベース（以
下、プロファイルＤＢ）２０、上記コンピュータ装置が
所定のプログラムを読み込んで実行することにより形成
される、データ入力部１１、ユーザ管理部１２、特徴ベ
クトル抽出部１３、次元処理部１４、処理選択部１５、
相関行列作成部１６、プロファイル管理部１７、フィル
タリング処理部１８、結果出力部１９を備えて構成され
る。また、図示しないが、後述する初期プロファイル用
設定データやグループ化基準の設定データ等を対話式に
取り込むための設定用インタフェースを搭載した表示装
置、文書を取り込むための入力装置、フィルタリング結
果を出力するための出力装置をも備えている。Embodiments of the present invention will be described below in detail with reference to the drawings. (First Embodiment) FIG. 1 is a functional block diagram of an information filtering apparatus to which the present invention is applied. In the figure, the solid line represents the flow of processing, and the broken line represents the flow of data. The information filtering apparatus 1 includes, for example, a document database (not shown) and a profile database (hereinafter, profile DB) 20 built inside or outside a stand-alone computer device, and the computer device reads and executes a predetermined program. The data input unit 11, the user management unit 12, the feature vector extraction unit 13, the dimension processing unit 14, the processing selection unit 15,
It comprises a correlation matrix creation unit 16, a profile management unit 17, a filtering processing unit 18, and a result output unit 19. Although not shown, a display device equipped with a setting interface for interactively capturing initial profile setting data and grouping reference setting data, which will be described later, an input device for capturing a document, and a filtering result are output. An output device is also provided.

【００１７】上記プログラムは、通常、コンピュータ装
置の内部記憶装置あるいは外部記憶装置に格納され、随
時読み取られて実行されるようになっているが、コンピ
ュータ装置とは分離可能な記録媒体、例えばＣＤ−ＲＯ
ＭやＦＤ等に格納された可搬性記録媒体、あるいは構内
ネットワークに接続されたプログラムサーバ等に記録さ
れ、使用時に読み込まれて上記内部記憶装置または外部
記憶装置にインストールされて随時実行に供されるもの
であってもよい。The above program is usually stored in an internal storage device or an external storage device of the computer device, and is read and executed as needed. However, a recording medium separable from the computer device, for example, a CD-ROM. RO
It is recorded on a portable recording medium stored in an M or FD or a program server connected to a private network, read at the time of use, installed in the internal storage device or the external storage device, and provided for execution at any time. It may be something.

【００１８】プロファイルＤＢ２０は、電子化情報（以
下、特にことわらない限り、単数、複数を問わず文書と
称する）に対するユーザの関心情報、すなわち「関心
有」または「関心無」のカテゴリを表すユーザプロファ
イル、及び複数のユーザプロファイルと共用関係をなす
共用プロファイルをユーザ毎に保存したものである。こ
のユーザプロファイル等は、前述の部分空間法に基づい
て作成されるもので、ユーザがアクセスして選別（フィ
ルタリング）を行う度に更新されるようになっている。The profile DB 20 stores user interest information on digitized information (hereinafter, singular or plural documents unless otherwise specified), ie, a user who represents a category of “interested” or “not interested”. A profile and a shared profile having a shared relationship with a plurality of user profiles are stored for each user. The user profile and the like are created based on the above-mentioned subspace method, and are updated each time a user accesses and performs selection (filtering).

【００１９】データ入力部１１は、ユーザによるアクセ
スがあったとき、すなわちユーザがスキャナ等の入力装
置を通じて文書を本装置に読み込ませたときに、これを
取り込んでユーザ管理部１２に入力するものである。ユ
ーザ管理部１２は、ユーザプロファイル等をユーザ毎に
管理しており、上記ユーザからのアクセスを契機に当該
ユーザが予め登録されたユーザか否かを判定し、判定結
果に応じて以後の処理を選択的に行う。The data input unit 11 takes in the document when it is accessed by the user, that is, when the user reads the document through the input device such as a scanner, and inputs the document to the user management unit 12. is there. The user management unit 12 manages a user profile and the like for each user, and determines whether or not the user is a pre-registered user in response to access from the user, and performs subsequent processing according to the determination result. Perform selectively.

【００２０】新規ユーザによるアクセスの場合は、デー
タ入力部１１から入力された文書に初期プロファイル用
設定データを割り当てた後、これを特徴ベクトル抽出部
１３に入力する。また、新規ユーザからのアクセスであ
る旨を処理選択部１５に通知する。初期プロファイル用
設定データは、上記設定用インタフェースを介して対話
式で取り込んだ当該新規ユーザの「関心有り」または
「関心無し」の識別情報である。一方、既登録ユーザに
よるアクセスの場合は、入力された文書を特徴ベクトル
抽出部１３に入力するとともに、当該ユーザの識別情報
を処理選択部１５に通知する。ユーザ管理部１２は、ま
た、プロファイル管理部１７において使用される、後述
の共用プロファイル作成のためのグループ化基準の設定
をも行う。このグループ化基準は、プロファイルが相互
に関連するかどうか、例えばどのユーザとどのユーザの
関心が共通するかを表す基準であり、予めシステムパラ
メータを通じて設定する。上記設定用インタフェースを
通じて個々のユーザが明示的に設定するようにすること
もできる。In the case of access by a new user, after assigning initial profile setting data to the document input from the data input unit 11, this is input to the feature vector extracting unit 13. Further, the processing selection unit 15 is notified that the access is from a new user. The initial profile setting data is identification information of “interested” or “not interested” of the new user interactively acquired through the setting interface. On the other hand, in the case of access by a registered user, the input document is input to the feature vector extraction unit 13 and the identification information of the user is notified to the process selection unit 15. The user management unit 12 also sets a grouping criterion for creating a shared profile, which will be described later, used in the profile management unit 17. This grouping criterion is a criterion indicating whether or not the profiles are related to each other, for example, which users have common interests and which are set in advance through system parameters. Individual users can also explicitly set through the setting interface.

【００２１】特徴ベクトル抽出部１３は、入力された文
書の特徴を表すベクトル（特徴ベクトルの集合、または
その集合を代表する一つのベクトル）を抽出する。具体
的には、当該文書中に出現するキーワード（以下、単
語）の種類を次元数とし、各単語の出現頻度に重みをか
けた一または複数のベクトルを演算処理により求める。
この場合の単語の重み付けは、公知の「ＴＦ・ＩＤＦ
法」により行うことができる。抽出されたベクトルは、
次元処理部１４に入力される。The feature vector extracting unit 13 extracts a vector (a set of feature vectors or one vector representing the set) representing the features of the input document. Specifically, the type of a keyword (hereinafter, word) appearing in the document is set as the number of dimensions, and one or a plurality of vectors obtained by weighting the appearance frequency of each word are obtained by arithmetic processing.
In this case, the weighting of the word is performed using a well-known “TF / IDF”.
Method. The extracted vector is
It is input to the dimension processing unit 14.

【００２２】次元処理部１４は、特徴ベクトル抽出部１
３で抽出されたベクトルに対し、前述のＫＬ解析、すな
わち正規直交変換による主成分分析を施し、重みが相対
的に低い冗長な次元のベクトルの削減（または次元圧縮
とも言う）を行うものである。次元削減されたベクトル
は、処理選択部１５に入力される。The dimension processing unit 14 includes the feature vector extraction unit 1
The vector extracted in Step 3 is subjected to the above-described KL analysis, that is, principal component analysis by orthonormal transform, to reduce redundant dimension vectors having relatively low weights (also referred to as dimension compression). . The vector whose dimension has been reduced is input to the processing selection unit 15.

【００２３】処理選択部１５は、次元処理部１４で次元
削減されたベクトルに対する以後の処理を選択するもの
である。具体的には、新規ユーザのアクセスの場合には
プロファイル作成、データ入力部１１に残りの入力文書
がある場合にはフィルタリング（選別）、フィルタリン
グ完了後の場合はプロファイル更新の処理がそれぞれ選
択されるようにする。The processing selection section 15 selects the subsequent processing for the vector whose dimension has been reduced by the dimension processing section 14. More specifically, a profile creation process is selected when a new user accesses, a filtering (selection) process is performed when there is a remaining input document in the data input unit 11, and a profile update process is performed after filtering is completed. To do.

【００２４】相関行列作成部１６は、処理選択部１５で
選択された処理がプロファイル作成またはプロファイル
更新の場合に、次元削減されたベクトルに各々対応した
相関行列を作成するものである。When the process selected by the process selecting unit 15 is profile creation or profile update, the correlation matrix creation unit 16 creates a correlation matrix corresponding to each dimension-reduced vector.

【００２５】プロファイル管理部１７は、主として、プ
ロファイルや共用プロファイルのプロファイルＤＢ２０
への保存、読み出し、更新を行うものである。すなわ
ち、相関行列作成部１６で作成された相関行列を当該ユ
ーザの関心を表すユーザプロファイルとして、これをプ
ロファイルＤＢ２２にユーザ毎に保存させ、フィルタリ
ングの際に、このプロファイルをプロファイルＤＢ２０
から適宜読み出す。また、プロファイルＤＢ２０中にお
いて、ユーザ管理部１２で設定したグループ化基準に対
応した複数のユーザプロファイルを統合（該当する複数
の相関行列の総和）して当該グループ識別情報の共用プ
ロファイルを作成し、さらに、新規または更新されたユ
ーザプロファイルが、共用プロファイルに関連する場合
は、そのユーザプロファイルの情報で対応する共用プロ
ファイルを更新する。共用プロファイルを設けることに
より、ユーザの持つ情報を提供し合い、これらのユーザ
間で相互に関心情報を共有することが可能となる。つま
り、特定のユーザプロファイルに含まれない関心情報が
あっても、共用プロファイルに当該関心情報が存在すれ
ば、それを補完的に参照できるようになる。The profile management unit 17 mainly includes a profile DB 20 for profiles and shared profiles.
To save, read, and update. That is, the correlation matrix created by the correlation matrix creating unit 16 is stored as a user profile indicating the interest of the user in the profile DB 22 for each user, and this profile is stored in the profile DB 20 at the time of filtering.
Is read as appropriate. In the profile DB 20, a plurality of user profiles corresponding to the grouping criterion set by the user management unit 12 are integrated (sum of a plurality of corresponding correlation matrices) to create a shared profile of the group identification information. If the new or updated user profile is related to the shared profile, the corresponding shared profile is updated with the information of the user profile. By providing a shared profile, it is possible to provide information owned by users and to share information of interest between these users. In other words, even if there is interest information that is not included in the specific user profile, if the interest information exists in the shared profile, it can be complementarily referred to.

【００２６】フィルタリング処理部１８は、処理選択部
１５でフィルタリング処理が選択された場合に、プロフ
ァイルＤＢ２２に格納された当該ユーザのユーザプロフ
ァイルまたは当該ユーザが属する共有プロファイルによ
って文書（実際には次元削減されたベクトル）のフィル
タリングを行う。フィルタリング結果は、結果出力部１
９を通じて出力装置または文書データベースに出力され
る。結果出力部１９は、また、フィルタリング結果を既
存のユーザプロファイルに反映させるためにデータ入力
部１１にフィードバックする機能をも有する。When a filtering process is selected by the process selecting unit 15, the filtering processing unit 18 uses the user profile of the user stored in the profile DB 22 or the shared profile to which the user belongs to to save the document (actually, the dimension is reduced). Vector). The filtering result is output to the result output unit 1.
9 to an output device or a document database. The result output unit 19 also has a function of feeding back the data to the data input unit 11 in order to reflect the filtering result on the existing user profile.

【００２７】次に、本発明によるプロファイル及びフィ
ルタリングの概念を説明する。図２は、プロファイル作
成の処理手順図である。この場合のプロファイルは、新
規ユーザより入力された文書に基づいて作成されるもの
で、この文書には、上述のようにして「関心有」または
「関心無」の識別情報が付与されている。なお、新規ユ
ーザという場合、ユーザ管理部１２で既に管理されてい
るユーザであるが、そのユーザからのアクセスが全くな
く、フィルタリングを初めて行う場合を含む。Next, the concept of profile and filtering according to the present invention will be described. FIG. 2 is a processing procedure diagram of the profile creation. The profile in this case is created based on the document input by the new user, and the document is provided with the identification information of “interested” or “not interested” as described above. Note that a new user is a user who is already managed by the user management unit 12, but includes a case in which there is no access from the user and filtering is performed for the first time.

【００２８】情報フィルタリング装置１は、新規ユーザ
からの文書が入力されると（ステップＳ１０１）、この
文書中に出現する単語の種類を次元数とし、各単語の出
現頻度に重み付けをしたベクトル（特徴ベクトル集合）
を抽出する（ステップＳ１０２）。この抽出されたベク
トルに対して上述の次元削減を行い（ステップＳ１０
３）、これにより得られたベクトルから相関行列を作成
する（ステップＳ１０４）。この相関行列は、所定の部
分空間類別基準、すなわち、ベクトル空間モデルのパタ
ーン認識を行う場合に用いられる部分空間法に基づく基
準に基づいて作成される。この部分空間類別基準につい
ては、例えば、「パターン認識と部分空間法」（エルッ
キ・オヤ著、産業図書）等の記載を参考にすることがで
きる。この相関行列は、ユーザプロファイルとして保存
される（ステップＳ１０５）。このように、相関行列を
ユーザプロファイルとすることにより、ユーザの関心を
表す関心情報は、文書中に出現する単語間の共起関係に
着目した表現となる。When a document is input from a new user (step S101), the information filtering apparatus 1 sets the type of words appearing in this document as the number of dimensions and weights the frequency of appearance of each word (characteristic). Vector set)
Is extracted (step S102). The above-described dimension reduction is performed on the extracted vector (step S10).
3), a correlation matrix is created from the vector thus obtained (step S104). The correlation matrix is created based on a predetermined subspace classification criterion, that is, a criterion based on a subspace method used when performing pattern recognition of a vector space model. For this subspace classification criterion, for example, the description in “Pattern recognition and subspace method” (Erkki Oya, Sangyo Tosho) can be referred to. This correlation matrix is stored as a user profile (step S105). As described above, by using the correlation matrix as the user profile, the interest information indicating the interest of the user is an expression focusing on the co-occurrence relationship between words appearing in the document.

【００２９】図３は、フィルタリングの処理手順図であ
る。情報フィルタリング装置１は、文書が入力されると
（ステップＳ２０１）、その文書に対して上記ステップ
Ｓ１０２〜１０３と同様の処理を施し、次元削減された
ベクトルを抽出する（ステップＳ２０２〜Ｓ２０３）。
そして、抽出したベクトルと、ユーザの識別情報を検索
キーとしてプロファイルＤＢ２０から読み出したユーザ
プロファイルまたは共用プロファイルに対し、それぞれ
前述のＫＬ解析を施して固有値及び固有ベクトルを算出
し、部分空間（プロファイル）に対するベクトルの射影
を算出抽出することにより、文書の選別を行う（ステッ
プＳ２０４）。FIG. 3 is a flowchart of the filtering process. When a document is input (step S201), the information filtering apparatus 1 performs the same processing as that in steps S102 to S103 on the document to extract a dimension-reduced vector (steps S202 to S203).
Then, the above-described KL analysis is performed on the extracted vector and the user profile or the shared profile read from the profile DB 20 using the user identification information as a search key to calculate eigenvalues and eigenvectors, respectively. The document is sorted by calculating and extracting the projection of (step S204).

【００３０】図４は、プロファイル更新の処理手順図で
ある。この場合に入力される文書は、上記図３の処理手
順により得られたフィルタリング結果である。このフィ
ルタリング結果にも、「関心有」または「関心無」の識
別情報が付与されている。フィルタリング結果である文
書が入力されると（ステップＳ３０１）、情報フィルタ
リング装置１は、その文書に対して上記ステップＳ１０
２〜１０３と同様の処理を施し、次元削減を施したベク
トルを抽出する（ステップＳ３０２〜Ｓ３０３）。その
後、抽出したベクトルからＡＬＳＭの適応的な学習条件
に基づいて相関行列を再作成する（ステップＳ３０
４）。さらに、既に保存されている該当ユーザプロファ
イルに対応する相関行列と該当部分空間をこの再作成し
た相関行列で更新する（ステップＳ３０５）。このよう
にしてフィードバックの度にユーザプロファイルが自動
的に更新される。FIG. 4 is a processing procedure diagram of the profile update. The document input in this case is a filtering result obtained by the processing procedure of FIG. This filtering result is also provided with identification information of “interesting” or “not interested”. When a document as a filtering result is input (step S301), the information filtering apparatus 1 performs the above-described step S10 on the document.
The same processing as in steps 2 to 103 is performed to extract the dimension-reduced vector (steps S302 to S303). Thereafter, a correlation matrix is re-created from the extracted vectors based on the adaptive learning conditions of ALSM (step S30).
4). Further, the correlation matrix corresponding to the user profile already stored and the corresponding subspace are updated with the recreated correlation matrix (step S305). In this way, the user profile is automatically updated for each feedback.

【００３１】次に、共用プロファイルについてより詳し
く説明する。共用プロファイルの作成、フィルタリン
グ、更新も、基本的には上記図２〜４の処理手順に従っ
て行うことができる。ここでは、文書に付与された「関
心有」と「関心無」の両面を考慮した共用プロファイル
の作成を中心に説明する。なお、識別情報「関心有」に
対応するカテゴリを「正の関心」、また「関心無」に対
応するカテゴリを「負の関心」とする。関連する複数の
ユーザプロファイルの統合を行う際には、各ユーザプロ
ファイル間における類似の度合い、すなわち距離値の大
きさを考慮することが、共用プロファイルを効果的に作
成する上で重要となる。Next, the sharing profile will be described in more detail. The creation, filtering, and updating of the shared profile can also be performed basically according to the processing procedures of FIGS. Here, description will be made focusing on creation of a shared profile in consideration of both “interesting” and “not interested” given to a document. Note that the category corresponding to the identification information “interest” is “positive interest”, and the category corresponding to “non-interest” is “negative interest”. When integrating a plurality of related user profiles, it is important to consider the degree of similarity between the user profiles, that is, the magnitude of the distance value, in order to effectively create a shared profile.

【００３２】例えば、文書集合Ｄにおけるユーザプロフ
ァイルＡの、ユーザプロファイルＢに対する距離値を抽
出する場合に、「正の関心」及び「負の関心」に着目す
ることにより、共用ファイルを高精度に作成することが
できる。この点を説明する。文書集合Ｄに文書Ｋが含ま
れている場合、まず、プロファイルＢにおける文書Ｋに
関して、「正の関心」の射影値から「負の関心」の射影
値の差分を算出する（第１射影差分値）。次に、プロフ
ァイルＡにおける文書Ｋに関して、「正の関心」の射影
値から「負の関心」の射影値の差分を算出する（第２射
影差分値）。さらに、プロファイルＢの第１射影差分値
からプロファイルＡの第２射影差分値の差を算出する。
この算出値を２乗した値を差分距離とする。この差分距
離は、ユークリッド距離として表現される、文書Ｋに対
するユーザプロファイルＡ及びＢにおける関心の有無の
差となるものである。For example, when extracting the distance value of the user profile A from the document profile D with respect to the user profile B, a common file is created with high accuracy by focusing on “positive interest” and “negative interest”. can do. This point will be described. When the document K is included in the document set D, first, regarding the document K in the profile B, the difference between the projection value of “positive interest” and the projection value of “negative interest” is calculated (first projection difference value). ). Next, for the document K in the profile A, a difference between the projection value of “positive interest” and the projection value of “negative interest” is calculated (second projection difference value). Further, a difference between the second projection difference value of the profile A and the first projection difference value of the profile B is calculated.
A value obtained by squaring the calculated value is defined as a difference distance. This difference distance is expressed as a Euclidean distance and is a difference between the presence or absence of interest in the user profiles A and B for the document K.

【００３３】次に、同様にして、差分距離を文書集合Ｄ
に含まれるすべての文書に対して算出し、これらの各差
分距離の総和を算出する。この総和による算出値が、文
書集合ＤにおけるユーザプロファイルＡのユーザプロフ
ァイルＢに対する距離値であり、関心の程度が似通った
文書の順に足しあげる際の、すなわち統合する場合のユ
ーザプロファイル間における差異の尺度となるものであ
る。この距離値が大きい場合には対象となるユーザプロ
ファイル間では類似度合いが小さく、一方、距離値が小
さい場合にはユーザプロファイル間における類似度合い
は大きくなる。共用ファイルを作成する際には、この類
似度合いが所定範囲内のもの（例えば予め定めた閾値以
下のもの）を統合するように構成する。Next, similarly, the difference distance is set to the document set D
Is calculated for all the documents included in, and the sum of these difference distances is calculated. The value calculated by this sum is the distance value of the user profile A to the user profile B in the document set D, and is a measure of the difference between the user profiles when adding documents having similar degrees of interest in the order of integration, that is, when integrating the documents. It is what becomes. When the distance value is large, the degree of similarity between the target user profiles is small, while when the distance value is small, the degree of similarity between the user profiles is large. When a shared file is created, a configuration in which the similarity is within a predetermined range (for example, one having a predetermined threshold or less) is integrated.

【００３４】次に、情報フィルタリング装置１を用いた
情報フィルタリング方法を、図５及び図６を参照して説
明する。まず、初期プロファイルを作成する場合の例を
説明する。文書がデータ入力部１１を通じて入力され、
新規ユーザであることを確認すると（ステップＳ４０
１、Ｓ４０２）、ユーザ管理部１２は、その文書に初期
プロファイル設定用データを付与して特徴ベクトル抽出
部１３へ送る。特徴ベクトル抽出部１３は、この文書か
らベクトル抽出を行う（ステップＳ４０３）。さらに、
抽出したベクトルから次元処理部１４で冗長な次元を削
減する（ステップＳ４０４）。相関行列作成部１６は、
この次元削減されたベクトルから相関行列を作成し（ス
テップＳ４０６）、プロファイル管理部１７に送る。プ
ロファイル管理部１７は、この相関行列をユーザプロフ
ァイル（初期プロファイル）として、プロファイルＤＢ
２２に保存する（ステップＳ４０７）。Next, an information filtering method using the information filtering apparatus 1 will be described with reference to FIGS. First, an example of creating an initial profile will be described. A document is input through the data input unit 11,
After confirming that the user is a new user (step S40)
1, S402), the user management unit 12 adds the data for initial profile setting to the document, and sends the document to the feature vector extraction unit 13. The feature vector extracting unit 13 extracts a vector from this document (step S403). further,
The dimension processing unit 14 reduces redundant dimensions from the extracted vector (step S404). The correlation matrix creation unit 16
A correlation matrix is created from the reduced vector (step S406) and sent to the profile management unit 17. The profile management unit 17 uses this correlation matrix as a user profile (initial profile)
22 (step S407).

【００３５】図６に移り、初期プロファイルの作成が終
了した場合は、そのユーザが属すべき共用プロファイル
を設定されているかどうかを調べる。共用プロファイル
が設定されている場合は（ステップＳ４０８：Yes）、
当該共用プロファイルが既に存在するか否かを調べ、存
在しない場合には（ステップＳ４０９：No）、共用プロ
ファイル処理部２１で、対応する共用プロファイルを新
規作成する（ステップＳ４１０）。当該共用プロファイ
ルが既に存在する場合は（ステップＳ４０９：Yes）、
その既存の共用プロファイルを、初期プロファイルの情
報で更新する（ステップＳ４１１）。次の文書がある場
合はステップＳ４０１に戻り（ステップＳ４１２：Ye
s）、上記処理を繰り返す。初期プロファイルの作成だ
けを行う場合は処理を終了する（ステップＳ４１２：N
o）。Referring to FIG. 6, when the creation of the initial profile is completed, it is checked whether a shared profile to which the user belongs is set. If the shared profile has been set (step S408: Yes),
It is checked whether or not the shared profile already exists. If not (step S409: No), the shared profile processing unit 21 newly creates a corresponding shared profile (step S410). If the shared profile already exists (step S409: Yes),
The existing shared profile is updated with the information of the initial profile (step S411). If there is a next document, the process returns to step S401 (step S412: Ye
s) Repeat the above process. If only the initial profile is to be created, the process ends (step S412: N
o).

【００３６】図５に戻り、この初期プロファイル、更新
されて保存されているユーザプロファイル、あるいは共
用プロファイル（便宜上、単にプロファイルとする）を
用いてフィルタリングを行う場合は、選別対象となる文
書に対してステップＳ４０２〜Ｓ４０３の処理を施し、
これにより得られたベクトルをフィルタリング処理部１
８に送る。フィルタリング処理部１８は、このベクトル
と該当するプロファイルとの射影を算出し（ステップＳ
４１３）、算出結果に基づいて文書選別を行う（ステッ
プＳ４１４）。選別結果は結果出力部１９を通じてユー
ザに提示される。また、プロファイル更新のためにステ
ップＳ４０１に戻る（ステップＳ４１５）。Returning to FIG. 5, when filtering is performed using this initial profile, an updated and saved user profile, or a shared profile (hereinafter simply referred to as a profile), a document to be sorted is selected. Perform the processing of steps S402 to S403,
The vector thus obtained is filtered by the filtering processing unit 1
Send to 8. The filtering processing unit 18 calculates the projection of this vector and the corresponding profile (Step S).
413), document selection is performed based on the calculation result (step S414). The selection result is presented to the user through the result output unit 19. The process returns to step S401 for updating the profile (step S415).

【００３７】プロファイル更新は、選別結果である文書
に対してステップＳ４０２〜Ｓ４０３の処理を施し、こ
れにより得られたベクトルを相関行列作成部１６に送
る。相関行列作成部１６は、このベクトルに基づいて相
関行列を再作成する（ステップＳ４１６）。プロファイ
ル管理部１７は、対応するプロファイルを再作成した相
関行列で更新する（ステップＳ４１７）。共用プロファ
イルに関する処理（図６参照）は、初期プロファイルの
作成の場合と同様となる。In the profile update, the processes of steps S402 to S403 are performed on the document as the selection result, and the vector obtained by this process is sent to the correlation matrix creating unit 16. The correlation matrix creation unit 16 re-creates a correlation matrix based on this vector (step S416). The profile management unit 17 updates the corresponding profile with the re-created correlation matrix (step S417). The processing related to the shared profile (see FIG. 6) is the same as the case of creating the initial profile.

【００３８】なお、本実施形態では、相関行列をユーザ
プロファイルとしているが、これは入力文書における単
語の共起関係に着目し、共用プロファイルとして各ユー
ザプロファイルの合成の際の処理負荷を軽減することを
目的とするものである。しかしながらこの手法以外に
も、例えば、相関行列に対してＫＬ解析を施し、その結
果得られるベクトル空間をユーザプロファイとして作成
することも可能である。この場合の情報量は、ユーザプ
ロファイルに相関行列を用いる場合とほぼ同じになる
が、共用プロファイルとして各ユーザプロファイルを合
成する際に、次元数を統一するための処理が必要となる
ものである。In this embodiment, the correlation matrix is a user profile. This focuses on the co-occurrence of words in the input document, and reduces the processing load when synthesizing each user profile as a shared profile. It is intended for. However, in addition to this method, for example, it is also possible to perform KL analysis on the correlation matrix and create the resulting vector space as a user profile. The amount of information in this case is almost the same as the case where a correlation matrix is used for a user profile, but processing for unifying the number of dimensions is required when combining each user profile as a shared profile.

【００３９】（第２実施形態）本発明は、通信回線とし
てインタネット等の公衆網を介して流通する大量の電子
化情報に対して自動的なフィルタリングを行うシステ
ム、例えば、上記情報フィルタリング装置として機能す
る情報フィルタリングサーバ、公衆網から情報を取得す
る機能を有するクライアントを配備した情報フィルタリ
ングシステムの形態で実施することも可能である。(Second Embodiment) The present invention functions as a system for automatically filtering a large amount of electronic information circulated through a public network such as the Internet as a communication line, for example, functions as the information filtering device. It is also possible to implement the present invention in the form of an information filtering system in which an information filtering server that performs the operation and a client having a function of acquiring information from a public network are deployed.

【００４０】この場合の情報フィルタリングサーバは、
例えば、インタネット環境上における複数の大規模なデ
ータベースを具備した各種情報提供サーバからの電子化
情報からクライアントに最適な情報を選択して提供する
情報提供支援サーバ、所謂、情報ナビゲーションサーバ
として位置付けることができる。In this case, the information filtering server:
For example, it can be positioned as an information providing support server that selects and provides optimum information to a client from computerized information from various information providing servers including a plurality of large-scale databases on the Internet environment, that is, an information navigation server. it can.

【００４１】この場合の構成例としては、コンピュータ
装置の内部あるいは外部記憶装置に、上記プロファイル
ＤＢ２０と同一のデータベースを構築し、公衆網を介し
てクライアント及び上記各種情報提供サーバとの通信を
行う通信制御部を具備する。さらに上記情報フィルタリ
ング装置１と同様の機能ブロック、すなわち、データ入
力部１１、ユーザ管理部１２、特徴ベクトル抽出部１
３、次元処理部１４、処理選択部１５、相関行列作成部
１６、プロファイル管理部１７、フィルタリング処理部
１８、結果出力部１９、を具備して構成する。As an example of the configuration in this case, the same database as the above-mentioned profile DB 20 is constructed in the internal or external storage device of the computer device, and the communication is performed between the client and the various information providing servers via the public network. A control unit is provided. Further, functional blocks similar to those of the information filtering apparatus 1 described above, that is, the data input unit 11, the user management unit 12, the feature vector extraction unit 1
3, a dimension processing unit 14, a processing selection unit 15, a correlation matrix creation unit 16, a profile management unit 17, a filtering processing unit 18, and a result output unit 19.

【００４２】この情報フィルタリングサーバが上記情報
フィルタリング装置１と相違する点は、通信制御を行う
公知の通信制御部を具備する点である。この通信制御部
を介して流通する電子化情報群をデータ入力部１１に入
力し、クライアントからの情報取得要求を受け付けるよ
うに構成することで、ネットワークを用いた情報フィル
タリングが可能になる。この場合の情報取得要求の入力
は、例えばＷＷＷ環境のブラウザ等をインタフェースと
して使用することができる。また、上記各種情報提供サ
ーバからの電子化情報群は、必ず情報フィルタリングサ
ーバを経由するようにし、この電子化情報群に対する選
別結果を通信制御部を介してクライアントに提供するよ
うに構成する。また、情報フィルタリングサーバは、例
えば、インタネット環境におけるサーバのエージェント
技術と融合することにより、流通する大量の電子化情報
群に対して自動的なフィルタリングを行うシステムの構
築が可能になる。This information filtering server differs from the information filtering device 1 in that it includes a known communication control unit for performing communication control. By inputting the digitized information group distributed through the communication control unit to the data input unit 11 and receiving an information acquisition request from a client, information filtering using a network becomes possible. In this case, the information acquisition request can be input using, for example, a browser in a WWW environment as an interface. Also, the computerized information group from the various information providing servers is configured to always pass through the information filtering server, and to provide the client with the selection result for the computerized information group via the communication control unit. Further, the information filtering server, for example, by integrating with the agent technology of the server in the Internet environment, it is possible to construct a system for automatically filtering a large amount of computerized information distributed.

【００４３】次に、実際に、本実施形態の情報フィルタ
リング装置１における評価実験を行った結果について説
明する。この実験では、評価用文書として１９９５年１
１月から１９９６年１０月までの１年間分の日本語によ
る新聞記事を使用し、前半年分を初期プロファイル用設
定データの文書集合（以下、訓練集合）、また後半年分
を評価実験用の文書集合（以下、評価集合）とした。Next, the result of actually performing an evaluation experiment in the information filtering device 1 of the present embodiment will be described. In this experiment, as a document for evaluation,
Using newspaper articles written in Japanese for one year from January to October 1996, the first half of the year was used as a document set of initial profile setting data (hereinafter referred to as a training set), and the second half was used for evaluation experiments. Document set (hereinafter referred to as evaluation set).

【００４４】これらの記事は、図７に示す５つのジャン
ルのいずれかに属するとともに、図中の記事数が各ジャ
ンル毎のデータ数を表している。また各記事は、タイト
ルと本文から構成され、当該記事が属するジャンル名が
付与されているものとする。本実験では、単語を特定
し、訓練集合においてその単語を含むものを「関心
有」、その他を「関心無」の記事と便宜的に分けてユー
ザプロファイルを生成した。ここでは、これらの単語を
関心事項として「トピック」とし、対応するトピックに
よるプロファイルで評価集合をフィルタリングした実験
結果を示すものである。These articles belong to any of the five genres shown in FIG. 7, and the number of articles in the figure represents the number of data for each genre. Each article is composed of a title and a text, and is given a genre name to which the article belongs. In this experiment, words were specified, and those containing the words in the training set were divided into "interested" articles and others were classified as "uninterested" articles for convenience to generate user profiles. Here, these words are referred to as "topics" as items of interest, and an experimental result of filtering an evaluation set with a profile based on the corresponding topic is shown.

【００４５】図８は、単語「輸入」をトピックにフィル
タリングして得られた１２２５記事のうち、２５記事の
タイトルを示している。この結果によれば、トピック
「輸入」と意味の類似する単語「貿易」に関する記事の
ほか、必ずしも語彙上関連深くない単語「病気」に関す
る記事が比較的目立つことがわかる。これは、「輸入」
の語を含む記事中に同時に「感染」の語を含む記事が多
いために「感染」も併せて関心事項とみなされたことに
よるものと解釈される。このことから、本実施形態の情
報フィルタリング装置１によれば、ユーザが関心事項の
内容や数を明確に意識しなくても、学習した記事集合に
おける語のつながりを通して自動的に目的となる記事が
抽出されることがわかる。FIG. 8 shows titles of 25 articles out of 1225 articles obtained by filtering the word “import” into topics. According to the result, it can be seen that articles related to the word “trade” having a similar meaning to the topic “import” and articles related to the word “illness” which is not necessarily closely related in vocabulary are relatively conspicuous. This is an "import"
It is interpreted that "infection" was also considered to be a matter of interest because many of the articles containing the word "infection" among articles containing the word "infection" at the same time. For this reason, according to the information filtering device 1 of the present embodiment, even if the user does not clearly recognize the content and number of items of interest, the target article can be automatically obtained through the connection of words in the learned article set. It can be seen that it is extracted.

【００４６】図９及び図１０は、単語「阪神」をトピッ
クにフィルタリングされた記事中において、「プロ野
球」及び「地震」の単語を含む記事の分布を、「正の関
心」の部分空間における第３成分までの値で表したもの
である。この場合、「プロ野球」及び「地震」の単語を
含む記事数は、「阪神」をトピックにフィルタリングさ
れた４６記事中で、各々、“１０６４”及び“１２３”
である。また、図中における各点は、１記事毎に対応し
ており、各成分へのマッピング結果を表している。FIGS. 9 and 10 show the distribution of articles including the words “professional baseball” and “earthquake” in the articles filtered by the word “Hanshin” as a topic in the “positive interest” subspace. It is represented by values up to the third component. In this case, the number of articles including the words “professional baseball” and “earthquake” is “1064” and “123” respectively in 46 articles filtered by “Hanshin” as a topic.
It is. Each point in the figure corresponds to each article, and represents a mapping result to each component.

【００４７】図９の分布結果では、第１成分と第２成分
に広がりをもち、特に第３成分に偏った散らばりがみら
れる。一方、図１０の分布結果では、第３成分はすべて
ゼロに近い値をとり、第１及び第２成分の平面で大きく
広がる対照的な分布となっている。従って、記事中にお
ける単語のつながりと部分空間の軸に対応がとられ、こ
のことが、上述のような複数の関心事項のフィルタリン
グを可能にしていることがわかる。In the distribution results of FIG. 9, the first component and the second component have a spread, and in particular, the third component has an uneven distribution. On the other hand, in the distribution result of FIG. 10, all the third components have values close to zero, and have a contrasting distribution that greatly spreads in the plane of the first and second components. Therefore, correspondence between the word connection in the article and the axis of the subspace is taken, and it can be seen that this enables filtering of a plurality of interests as described above.

【００４８】このように、本実施形態の情報フィルタリ
ング装置１では、文書中における単語の共起関係が導入
され、関心有り及び無しの両面を考慮した類別基準によ
るフィルタリングを行うことにより、従来型のような一
面的な情報フィルタリングと比較して、より精度の高い
選別結果を得ることが可能となる。As described above, in the information filtering apparatus 1 according to the present embodiment, the co-occurrence relation of words in a document is introduced, and filtering is performed based on a classification criterion in consideration of both interest and non-interest. Compared with such one-sided information filtering, it is possible to obtain a selection result with higher accuracy.

【００４９】また、作成される相関行列をユーザプロフ
ァイルとして用いることから、ユーザの関心事項の数や
内容に限定されることなく、それらの関連に応じたフィ
ルタリングが単一のユーザプロファイルにより得られる
ようになり、また、ユーザの関心事項の広がりを部分空
間における次元数から知ることができるようになる。Further, since the created correlation matrix is used as a user profile, it is possible to obtain a filtering according to their relations by a single user profile, without being limited to the number and contents of the user's interests. And the extent of interest of the user can be known from the number of dimensions in the subspace.

【００５０】また、相関行列に対する適応的な学習によ
り、ユーザプロファイルの表現を変えることなくユーザ
の関心事項の変化に対して柔軟に追随できるので、関心
の時間的な変化に対応した自動的なフィルタリングが可
能となる。Also, since adaptive learning for the correlation matrix can flexibly follow changes in the user's interest without changing the expression of the user profile, automatic filtering corresponding to temporal changes in interest can be performed. Becomes possible.

【００５１】また、共用プロファイルを各ユーザプロフ
ァイルにおける相関行列の和から生成し、ユーザ間にお
ける関心事項の共有化を図ることが可能なので、ユーザ
個人に特定して絞り込んだ関心情報に基づいたために取
りこぼした情報や、ユーザに近い関心情報でありなが
ら、当該ユーザの関心情報からは抽出不可能であった情
報に対する補完的なフィルタリングが可能となる。Further, since a common profile can be generated from the sum of the correlation matrices in each user profile, and the items of interest can be shared among the users, the information is omitted because it is based on the interest information specified and narrowed down to the individual user. Complementary filtering can be performed on information that is interest information close to the user or information that cannot be extracted from the interest information of the user.

【００５２】また、個々の各ユーザプロファイルの情報
から、サービス提供者等のシステム運用管理者は、ユー
ザ全体における関心の動向が把握可能となり、対象とな
るユーザに応じた、例えば、公告やダイレクトメール等
のダイレクトマーケッティングや、新商品開発のマーケ
ッティングリサーチ用の調査材料となり得る効果があ
る。Further, from the information of each individual user profile, the system operation manager such as a service provider can grasp the trend of the interest of the entire user, and, for example, public notice or direct mail according to the target user. It can be used as a research material for direct marketing and marketing research for new product development.

【００５３】さらに、既存の複数の情報提供サービスシ
ステム等と独立して動作するシステムの構築や、既存シ
ステムへの組み込みも容易になる。Further, it is easy to construct a system that operates independently of a plurality of existing information providing service systems and the like and to incorporate the system into an existing system.

【００５４】[0054]

【発明の効果】以上の説明から明らかなように、本発明
によれば、ユーザの関心情報であるユーザプロファイル
が自動的に作成される効果がある。また、情報フィルタ
リングに際して、ユーザの関心を「関心有」と「関心
無」の両面が考慮されるので、フィルタリングの精度を
一定値以上に維持することが可能となる。As is clear from the above description, according to the present invention, there is an effect that a user profile, which is information of interest of a user, is automatically created. Further, at the time of information filtering, both "interesting" and "non-interesting" are considered as the user's interest, so that the accuracy of the filtering can be maintained at a certain value or more.

【００５５】本発明をネットワーク環境下で適用させた
場合には、この情報フィルタリングにより、継続的に流
入する大量の電子化情報群からユーザの関心に基づいて
確実且つタイムリーに必要となる情報の取得が出来るこ
とから、情報の有効活用が促進される。このことから、
アクセス効率及び実用性が格段に向上するシステムの提
供が可能となる。When the present invention is applied in a network environment, by this information filtering, it is possible to reliably and timely obtain necessary information based on the user's interest from a large amount of continuously flowing computerized information. Because it can be obtained, effective use of information is promoted. From this,
It is possible to provide a system in which access efficiency and practicality are significantly improved.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る情報フィルタリング
装置の機能ブロック図。FIG. 1 is a functional block diagram of an information filtering device according to an embodiment of the present invention.

【図２】プロファイル作成の処理手順図。FIG. 2 is a processing procedure diagram of profile creation.

【図３】フィルタリングの処理手順図。FIG. 3 is a processing procedure diagram of filtering.

【図４】プロファイル更新時の処理手順図。FIG. 4 is a processing procedure diagram at the time of updating a profile.

【図５】本発明の情報フィルタリング装置によるフィル
タリング方法の手順説明図。FIG. 5 is an explanatory diagram of a procedure of a filtering method by the information filtering device of the present invention.

【図６】本発明の情報フィルタリング装置によるフィル
タリング方法の手順説明図。FIG. 6 is an explanatory diagram of a procedure of a filtering method by the information filtering device of the present invention.

【図７】本発明の情報フィルタリング装置の評価実験デ
ータを示した図表。FIG. 7 is a table showing evaluation experiment data of the information filtering device of the present invention.

【図８】「輸入」のトピックでフィルタリングされた評
価集合のタイトル群。FIG. 8 is a list of titles of an evaluation set filtered by the topic “import”.

【図９】「阪神」のトピックでフィルタリングされた
「プロ野球」の語を含む記事の分布結果。FIG. 9 is a distribution result of articles including the word “professional baseball” filtered on the topic “Hanshin”.

【図１０】「阪神」のトピックでフィルタリングされた
「地震」の語を含む記事の分布結果。FIG. 10 is a distribution result of articles including the word “earthquake” filtered on the topic “Hanshin”.

[Explanation of symbols]

１情報フィルタリング装置１１データ入力部１２ユーザ管理部１３特徴ベクトル抽出部１４次元処理部１５処理選択部１６相関行列作成部１７プロファイル管理部１８フィルタリング処理部１９結果出力部２０プロファイルＤＢ（データベース） Reference Signs List 1 information filtering device 11 data input unit 12 user management unit 13 feature vector extraction unit 14 dimension processing unit 15 processing selection unit 16 correlation matrix creation unit 17 profile management unit 18 filtering processing unit 19 result output unit 20 profile DB (database)

Claims

[Claims]

1. A learning vector in which redundant dimensions are reduced is extracted from digitized information to which identification information for identifying whether or not a user is interested is provided, and a predetermined subspace classification criterion is applied to the learning vector. "Interested" or "not interested"
A process of creating a user profile corresponding to any of the categories, and when new digitized information to be sorted is input, a target vector representing the feature of the new digitized information is extracted,
The characteristic difference between the target vector and the created user profile is obtained by projecting the characteristic vector into a subspace corresponding to the user profile. Based on the characteristic difference, the new digitized information is determined as “interested” or “not interested”. Information filtering, comprising the steps of: selecting from any of the categories: creating an update profile in the same format as the user profile from the digitized information after the selection; and updating the user profile using the update profile. Method.

2. The method according to claim 1, further comprising the step of: integrating a plurality of mutually related user profiles to create a shared profile having a shared relationship with each of the user profiles, wherein the selecting step includes the step of combining the shared profile or the user profile. 2. The information filtering method according to claim 1, wherein the newly digitized information is classified into one of "interesting" and "not interested" categories based on the feature difference.

3. Vector processing means for extracting a vector from which redundant dimensions have been deleted from the characteristics of the digitized information, and said vector processing from the digitized information to which identification information for identifying whether or not the user is interested is added. A profile creating means for creating a user profile corresponding to either the "interesting" or "not interested" category by applying a predetermined subspace classification criterion to the learning vector extracted by the means; A characteristic difference between the target vector extracted by the vector processing means from the new digitization information and the user profile is obtained by projecting the feature vector into a subspace corresponding to the user profile, and based on the characteristic difference, the new digitization information is obtained. And selecting means for selecting the category of “interested” or “not interested”. An information filtering device configured to extract a new learning vector from the information and guide the extracted learning vector to the profile creation unit.

4. The apparatus according to claim 1, further comprising a user management unit for managing the user profile for each user, wherein when the user management unit recognizes that the selection is an initial selection by a new user, the user management unit sets an initial profile for the new user. 4. The information filtering apparatus according to claim 3, wherein the apparatus is configured to interactively fetch data to cause the profile creation means to create the user profile for the new user.

5. The information filtering apparatus according to claim 3, wherein the profile creation means is configured to create a correlation matrix from the extracted learning vectors based on a subspace classification criterion.

6. The profile generating means is configured to generate a correlation matrix from the extracted learning vectors based on adaptive learning conditions of a predetermined average learning subspace method. The information filtering device according to claim 3, wherein

7. The vector processing device according to claim 3, wherein said vector processing means is configured to perform KL analysis by orthonormal transform to reduce said redundant dimension. Information filtering device.

8. A plurality of user profiles related to each other are integrated to create and store a shared profile having a shared relationship with a user profile before integration, and at least one of the user profiles related to the shared profile is updated. The system further comprises: a shared profile processing unit that reflects the update to the shared profile when the update is performed, wherein the selection unit performs the filtering by selectively using the user profile or the shared profile. 8. The information filtering device according to any one of items 3 to 7.

9. The shared profile processing means calculates a distance value focusing on a difference in presence or absence of interest for each of a plurality of user profiles as integration candidates, and determines whether or not to integrate from the sum of the distance values. 9. The information filtering apparatus according to claim 8, wherein the information filtering apparatus is configured as follows.

10. An information filtering apparatus according to claim 3, wherein the information filtering apparatus is connected to a communication line, and digitized information circulated through the communication line is taken into the information filtering apparatus. Information filtering system.

11. The information filtering system according to claim 9, wherein the information filtering device is configured to perform filtering on the digitized information captured through an agent means.