JP2005250693A

JP2005250693A - Character information classification program

Info

Publication number: JP2005250693A
Application number: JP2004057995A
Authority: JP
Inventors: Yuichiro Taniguchi; 雄一郎谷口
Original assignee: Tsubasa System Co Ltd
Current assignee: Tsubasa System Co Ltd
Priority date: 2004-03-02
Filing date: 2004-03-02
Publication date: 2005-09-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a classification technique that allows classifying user desired character information without registering words to be classified. <P>SOLUTION: Object character information prescribed as a processing object is extracted from character information disclosed on a network. Of the object character information, a character portion that satisfies predetermined requirements is considered a keyword. The keyword is extracted and subjected to keyword matching. Depending on whether or not there is the keyword, a keyword vector 104a is created from the character information. Based on the keyword vector 104a created, the similarity between keyword vectors within the character information is calculated. Based on the similarity, the keyword vectors are associated with one another. The result 107a of the association is outputted. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、文字情報分類技術に関する。 The present invention relates to a character information classification technique.

従来から、いわゆるウェブページあるいはテキスト情報などに代表される、文字情報を含む電子情報の情報量が増加している。このような文字情報を含む電子情報のうち、ウェブページは、インターネットを含むネットワーク上に設置されたサーバに格納されてユーザに公開されている。 2. Description of the Related Art Conventionally, the information amount of electronic information including character information represented by so-called web pages or text information has increased. Of such electronic information including character information, a web page is stored in a server installed on a network including the Internet and is open to the user.

ウェブページには様々なコンテンツがあり、最新のニュース情報などのユーザが求める情報が逐次更新されている。そして、インターネット内に多数あるウェブページにおいて、逐次更新される様々なコンテンツ内の情報を的確に取得することは、ユーザにとって重要なことであった。 The web page has various contents, and information requested by the user, such as the latest news information, is sequentially updated. It is important for the user to accurately acquire information in various contents that are sequentially updated in many web pages in the Internet.

なお、上述のように逐次更新されるウェブページのコンテンツに示される情報を、ユーザの嗜好に併せてユーザに対して早期に通知する技術が考えられている。また、文字情報の分類に関する先行技術としては、例えば以下のものが挙げられる。
特開２００１−２６５８０８号公報特開２００３−２４８６８８号公報特開２００２−４１５４４号公報 In addition, the technique of notifying a user early about the information shown by the content of the web page updated sequentially as above-mentioned according to a user preference is considered. Moreover, the following is mentioned as a prior art regarding the classification | category of character information, for example.
JP 2001-265808 A JP 2003-248688 A JP 2002-41544 A

ウェブページのコンテンツの分類には、自然言語処理に基づいた、タイトルまたは本文などの文章構造の分析が必要であった。しかしながら、ウェブページコンテンツのＨＴＭＬ（HyperText Markup Language）のデータ構造（デザイン）は、個々のウェブページによって一様ではなかった。このため、従来の文字情報の分類技術では、全てのウェブページのコンテンツに対して有効に情報を取得することが困難であった。 In order to classify the contents of web pages, it was necessary to analyze sentence structures such as titles or text based on natural language processing. However, the data structure (design) of HTML (HyperText Markup Language) of web page contents is not uniform among individual web pages. For this reason, it has been difficult for the conventional character information classification technology to effectively acquire information for the contents of all web pages.

また、従来のウェブページのコンテンツ分類技術では、事前に分類の基準をキーワードで与える必要があった。このため、従来のウェブページのコンテンツ分類技術では、事前にキーワードを作成する必要があった。 Further, in the conventional web page content classification technology, it is necessary to provide a classification criterion in advance using keywords. For this reason, in the conventional web page content classification technology, it is necessary to create keywords in advance.

さらに、従来のウェブページのコンテンツ分類技術では、形態素解析と呼ばれるキーワード抽出方法を用いて、コンテンツを分類する際のキーワードを生成していた。形態素解析とは、辞書を使用し、テキストデータの自然文を意味のある最小の単位（単語）に分解して、コンピュータ処理に適した形にすることである。このため、形態素解析を用いてキーワード抽出を行う場合には、辞書に単語を登録する必要があり、辞書に登録されていない略語や新語、あるいは固有名詞や外来語などの解析を行うことは困難であった。 Further, in the conventional web page content classification technology, keywords for classifying content are generated using a keyword extraction method called morphological analysis. The morphological analysis is to use a dictionary to decompose a natural sentence of text data into meaningful minimum units (words) and make them suitable for computer processing. For this reason, when performing keyword extraction using morphological analysis, it is necessary to register words in the dictionary, and it is difficult to analyze abbreviations and new words that are not registered in the dictionary, proper names, foreign words, etc. Met.

本発明はこのような問題を解決するためになされたもので、その目的は、様々な情報の形態に有効な文字情報を含むコンテンツの分類技術を提供することにある。また、本発明の他の目的は、ユーザ所望の文字情報を含むコンテンツについて、分類対象となる単語の登録なしに行うことのできる分類技術を提供することにある。 The present invention has been made to solve such a problem, and an object thereof is to provide a technology for classifying content including character information effective in various information forms. Another object of the present invention is to provide a classification technique that can be performed on a content including character information desired by a user without registering a word to be classified.

上記目的を達成するため、本発明は以下の手段とした。
すなわち、本発明は、ネットワーク上に公開される文字情報から、予め処理対象と定められている対象文字情報を抽出し、前記対象文字情報のうち所定の条件を満足する文字部分をキーワードとみなして当該キーワードを抽出し、前記キーワードに対してキーワードマッチングを行い、前記キーワードの有無に応じて、前記文字情報からキーワードベクトルを生成し、生成した前記キーワードベクトルに基づいて、前記文字情報内におけるキーワードベクトルについて相互に類似度を算出し、前記類似度に基づいて前記キーワードベクトル同士の関連づけを行い、前記キーワードベクトル同士の関連づけの結果を出力することを特徴とする。 In order to achieve the above object, the present invention has the following means.
That is, according to the present invention, target character information that is predetermined as a processing target is extracted from character information that is disclosed on the network, and a character portion that satisfies a predetermined condition in the target character information is regarded as a keyword. The keyword is extracted, keyword matching is performed on the keyword, a keyword vector is generated from the character information in accordance with the presence or absence of the keyword, and a keyword vector in the character information is generated based on the generated keyword vector Similarity is calculated with respect to each other, the keyword vectors are associated with each other based on the similarity, and a result of association between the keyword vectors is output.

本発明では、ネットワーク上に公開される文字情報のうち、処理対象と定められている対象文字情報を抽出して分類処理を行う。
従って、本発明によれば、様々な情報の形態に有効な文字情報を含むコンテンツの分類技術を提供することができる。
また、本発明では、文字情報から抽出したキーワードに基づいて、ユーザに提供する文字情報の関連づけ（クラスタリング）処理を行う。
従って、本発明によれば、ユーザ所望の文字情報を含むコンテンツについて、分類対象となる単語の登録なしに行うことのできる分類技術を提供することができる。
また、本発明は、前記キーワードを抽出する際には、文字情報の種類に基づいてキーワードとみなす文字部分を抽出する、ことを特徴とする。
また、本発明は、前記キーワードマッチングを行う際には、正規表現を用いてキーワードマッチングを行う、ことを特徴とする。
なお、本発明は、以上の何れかの機能を実現させるプログラムであってもよい。また、本発明は、そのようなプログラムをコンピュータが読み取り可能な記憶媒体に記録してもよい。さらに、本発明は、以上の何れかの機能を実現する装置であってもよい。 In the present invention, target character information determined as a processing target is extracted from character information disclosed on the network, and classification processing is performed.
Therefore, according to the present invention, it is possible to provide a technology for classifying content including character information effective in various information forms.
In the present invention, the association (clustering) processing of the character information provided to the user is performed based on the keyword extracted from the character information.
Therefore, according to the present invention, it is possible to provide a classification technique that can be performed on a content including character information desired by a user without registering a word to be classified.
In the present invention, when extracting the keyword, a character portion regarded as a keyword is extracted based on the type of character information.
Further, the present invention is characterized in that when performing the keyword matching, keyword matching is performed using a regular expression.
Note that the present invention may be a program for realizing any of the above functions. In the present invention, such a program may be recorded on a computer-readable storage medium. Furthermore, the present invention may be an apparatus that realizes any one of the functions described above.

本発明によれば、様々な情報の形態に有効な文字情報の分類技術を提供することができる。また、本発明によれば、ユーザ所望の文字情報を、分類対象となる単語の登録なしに行うことのできる分類技術を提供することができる。 According to the present invention, it is possible to provide a character information classification technique effective for various forms of information. Further, according to the present invention, it is possible to provide a classification technique capable of performing user-desired character information without registering a word to be classified.

次に、本発明を実施するための最良の形態について図面を参照して説明する。
文字情報分類装置は、本発明に係る処理を行うために、パーソナルコンピュータ(ＰＣ)，ワークステーション(ＷＳ)，専用のコンピュータなどを用いて構成される。この文字情報分類装置は、不図示のハードウェアとして、処理装置(ＣＰＵ，主記憶装置(ＲＡＭ等)，入出力ユニット，デバイスドライバ等で構成される)、入力装置(キーボード，マウス等)、表示装置(ディスプレイ装置、プリンタ等)および記憶装置(メモリ、ハードディスク等)を有している。このコンピュータの記憶装置には、ＲＳＳ（RDF Site Summary）ブックマーク，ＲＳＳデータベース，キーワードマップ，キーワードテーブル，ベクトルテーブル，類似度テーブル，クラスタテーブルなどが格納されている。 Next, the best mode for carrying out the present invention will be described with reference to the drawings.
The character information classification device is configured using a personal computer (PC), a workstation (WS), a dedicated computer, or the like in order to perform the processing according to the present invention. This character information classification device includes, as hardware (not shown), a processing device (including a CPU, a main storage device (RAM, etc.), an input / output unit, a device driver, etc.), an input device (keyboard, mouse, etc.) It has a device (display device, printer, etc.) and a storage device (memory, hard disk, etc.). This computer storage device stores RSS (RDF Site Summary) bookmarks, RSS databases, keyword maps, keyword tables, vector tables, similarity tables, cluster tables, and the like.

ＲＳＳブックマークとは、取得すべき文字情報を含むＲＳＳのＵＲＬで構成されたリストである。キーワードマップとは、後述のキーワード抽出処理で得られたキーワード列を、対象文字情報の全体または一部と対応付けて格納する。このキーワードマップは、対象文字情報の全体または一部をキーに、対応するキーワード列を取り出せるマップ構造になっている。 The RSS bookmark is a list composed of RSS URLs including character information to be acquired. The keyword map stores a keyword string obtained by a keyword extraction process, which will be described later, in association with all or part of the target character information. This keyword map has a map structure in which a corresponding keyword string can be extracted using all or part of the target character information as a key.

キーワードテーブルは、キーワード抽出処理で得られたキーワード全てを格納する。このキーワードテーブルには、重複するキーワードは格納されないようになっている。ベクトルテーブルとは、対象文字情報の全体または一部とこの対象文字情報の全体または一部
から生成したベクトルとを対応付けて格納する記憶領域である。類似度テーブルは、二つのベクトルとそのベクトル間の類似度とを対応付けて格納する。さらに、クラスタテーブルは、一つ以上のベクトルで構成されるクラスタの情報を格納する。 The keyword table stores all the keywords obtained by the keyword extraction process. In this keyword table, duplicate keywords are not stored. The vector table is a storage area for storing all or part of the target character information and a vector generated from the whole or part of the target character information in association with each other. The similarity table stores two vectors and the similarity between the vectors in association with each other. Furthermore, the cluster table stores information on clusters composed of one or more vectors.

図１は本発明による文字情報分類プログラムをコンピュータのハードディスク装置などの二次記憶に導入（インストール）した、文字情報分類装置１００の一例を示す。文字情報分類装置１００は、コンピュータに文字情報分類プログラムを実行させることによって、対象文字抽出部１０１，キーワード抽出部１０２，キーワード格納部１０３，キーワードベクトル生成部１０４，類似度算出部１０５，得点記録部１０６，クラスタリング部１０７，出力部１０８を実現する。 FIG. 1 shows an example of a character information classification device 100 in which a character information classification program according to the present invention is installed (installed) in a secondary storage such as a hard disk device of a computer. The character information classification device 100 causes a computer to execute a character information classification program, thereby performing a target character extraction unit 101, a keyword extraction unit 102, a keyword storage unit 103, a keyword vector generation unit 104, a similarity calculation unit 105, and a score recording unit. 106, a clustering unit 107, and an output unit 108 are realized.

次に、このように構成された文字情報分類装置１００の動作について説明する。
対象文字抽出部１０１は、ネットワーク上に公開されるＲＳＳに含まれる文字情報から、予め処理対象と定められている対象文字情報を抽出する。キーワード抽出部１０２は、対象文字情報のうち所定の条件を満足する文字部分をキーワードとみなして、そのキーワードを抽出する。キーワード格納部１０３は、キーワード抽出部１０２が抽出したキーワードをデータベースに格納する。キーワードベクトル生成部１０４は、キーワードに対してキーワードマッチングを行い、キーワードの有無に応じて、キーワード抽出部１０２が抽出した文字情報からキーワードベクトル１０４ａを生成する。 Next, the operation of the character information classification device 100 configured as described above will be described.
The target character extraction unit 101 extracts target character information that is previously determined as a processing target from character information included in RSS published on the network. The keyword extraction unit 102 regards a character portion satisfying a predetermined condition in the target character information as a keyword, and extracts the keyword. The keyword storage unit 103 stores the keywords extracted by the keyword extraction unit 102 in a database. The keyword vector generation unit 104 performs keyword matching on the keyword, and generates a keyword vector 104a from the character information extracted by the keyword extraction unit 102 according to the presence or absence of the keyword.

類似度算出部１０５は、キーワードベクトル生成部１０４が生成したキーワードベクトル１０４ａに基づいて、ＲＳＳ内におけるキーワードについて、類似度テーブル１０５ａに記録し、このキーワード同士が類似しているか否かに基づいて類似度を算出して得点化する。類似度テーブル１０６は、類似度テーブル１０５ａを詳細化した図である。クラスタリング部１０７は、類似度算出部１０５が算出した類似度の高さに基づいてキーワードベクトルのクラスタリング（関連づけ）を行い、クラスタリングの結果をデータベース１０７ａに格納する。出力部１０８は、データベース１０７ａを参照して、不図示のディスプレイなどの出力装置を介して出力する。 Based on the keyword vector 104a generated by the keyword vector generation unit 104, the similarity calculation unit 105 records the keywords in the RSS in the similarity table 105a, and is similar based on whether the keywords are similar to each other. The degree is calculated and scored. The similarity table 106 is a detailed view of the similarity table 105a. The clustering unit 107 performs keyword vector clustering (association) based on the high similarity calculated by the similarity calculation unit 105, and stores the clustering result in the database 107a. The output unit 108 refers to the database 107a and outputs it via an output device such as a display (not shown).

次に、文字情報分類装置１００による文字情報分類処理について、フローチャートを参照して説明する。
図２は、本発明の文字情報分類プログラムをコンピュータに実行させることによって実現される、文字情報分類装置１００の処理を示すフローチャートである。
文字情報分類装置１００が本処理を開始すると、対象文字抽出部１０１は、ＲＳＳ（RDF Site Summary）をネットワークから取得して、ハードディスク装置などの記憶装置１０９にこのＲＳＳを記憶する（Ｓ１０１）。このとき、対象文字抽出部１０１は、本装置１００のユーザが入力装置１１０から指定したＲＳＳのＵＲＬ（Uniform Resource Locator）、あるいはＲＳＳブックマークに格納されるＲＳＳのＵＲＬに基づいて、ＷＷＷ（World Wide Web）サーバ上に公開されているＲＳＳを取得する。そして、対象文字抽出部１０１は、この指定されたＵＲＬに対してＨＴＴＰ（HyperText Transfer Protocol）リクエスト（要求）を送信して、ＷＷＷサーバからＲＳＳを取得する。 Next, character information classification processing by the character information classification device 100 will be described with reference to a flowchart.
FIG. 2 is a flowchart showing processing of the character information classification device 100 realized by causing a computer to execute the character information classification program of the present invention.
When the character information classification device 100 starts this processing, the target character extraction unit 101 acquires RSS (RDF Site Summary) from the network and stores this RSS in the storage device 109 such as a hard disk device (S101). At this time, the target character extracting unit 101 uses the WWW (World Wide Web) based on the RSS URL (Uniform Resource Locator) designated by the user of the device 100 from the input device 110 or the RSS URL stored in the RSS bookmark. ) Get RSS published on the server. Then, the target character extraction unit 101 transmits an HTTP (HyperText Transfer Protocol) request (request) to the designated URL, and acquires RSS from the WWW server.

その後、対象文字抽出部１０１は、取得したＲＳＳに含まれる対象文字列を抽出する。 Thereafter, the target character extraction unit 101 extracts a target character string included in the acquired RSS.

キーワード抽出部１０２は、この対象文字列から所定の条件に基づいてキーワードの抽出を行う（Ｓ１０２）。本実施の形態において、キーワード抽出の条件は以下のように定める。すなわち、本実施の形態では、全角または半角で表示された英字、及び全角で表示されたカタカナのいずれかの文字が３文字以上連続する部分をキーワードとして抽出する。また、本実施の形態では、漢字が２文字以上連続する部分をキーワードとして抽出する。 The keyword extraction unit 102 extracts keywords from the target character string based on a predetermined condition (S102). In the present embodiment, keyword extraction conditions are determined as follows. In other words, in the present embodiment, a portion in which any one of the alphabetic characters displayed in full-width or half-width and katakana displayed in full-width is continuous is extracted as a keyword. In the present embodiment, a portion where two or more kanji characters continue is extracted as a keyword.

図３は、キーワード抽出部１０２による上述の条件に基づいたキーワード抽出の一例を示す。図３において、キーワード抽出部１０２に、「形態素解析を使う場合、ＨＴＭＬのＢＲタグは・・・」という文字列が入力したとする。このとき、キーワード抽出部１０２は、上述の条件に基づいて、「形態素解析」、「場合」、「ＨＴＭＬ」の文字列をキーワードとして抽出する。そして、キーワード抽出部１０２は、ＲＳＳに含まれる対象文字情報の全体または一部毎にキーワード抽出を行う。 FIG. 3 shows an example of keyword extraction based on the above-described conditions by the keyword extraction unit 102. In FIG. 3, it is assumed that a character string “when using morphological analysis, the HTML BR tag is ...” is input to the keyword extraction unit 102. At this time, the keyword extraction unit 102 extracts character strings of “morpheme analysis”, “case”, and “HTML” as keywords based on the above-described conditions. Then, the keyword extraction unit 102 performs keyword extraction for all or part of the target character information included in the RSS.

キーワード格納部１０３は、キーワード抽出部１０２が抽出した全てのキーワードを記憶装置に生成されたキーワードテーブルに記憶する（Ｓ１０２）。
キーワードベクトル生成部１０４は、キーワードテーブルに格納されたＲＳＳの全ての対象文字情報の全体または一部について、キーワードベクトルを生成する（Ｓ１０３）。本実施の形態において、キーワードベクトルとは、対象文字情報の全体または一部がキーワードテーブルにあるキーワードを含むか否かをベクトル形式で表現したものである。 The keyword storage unit 103 stores all the keywords extracted by the keyword extraction unit 102 in the keyword table generated in the storage device (S102).
The keyword vector generation unit 104 generates a keyword vector for all or a part of all the target character information of RSS stored in the keyword table (S103). In the present embodiment, the keyword vector is a vector format representing whether or not the entire target character information includes a keyword in the keyword table.

例えば、全ての対象文字情報の全体または一部から抽出したキーワードを「Ｋ１Ｋ２
Ｋ３Ｋ４」とし、分類対象となる文字列を「Ｓ１Ｓ２Ｓ３」とした場合、それぞれの文字列がこれらのキーワードを含むか否かを０または１の数値で表現する。そして、キーワードを含むか否かの判断手法は、以下のキーワードマッチング手法によって判断する。 For example, a keyword extracted from all or a part of all target character information is represented by “K1 K2
If the character string to be classified is “S1 S2 S3”, whether or not each character string includes these keywords is represented by a numerical value of 0 or 1. And the determination method of whether a keyword is included is determined by the following keyword matching methods.

図４は、本実施の形態にかかる、キーワードベクトル生成手法の一例を示す。図４において、キーワードＫ１は、「形態素解析」である。また、キーワードＫ２は、「場合」である。また、キーワードＫ３は、「ＨＴＭＬ」である。さらに、キーワードＫ４は、「解析」である。文字列は、Ｓ１「形態素解析を使う場合、ＨＴＭＬのタグが未知語になる」、Ｓ２「形態素解析とキーワード解析を併用する場合は」、及びＳ３「ＨＴＭＬタグで改行を表すのはＢＲである」の３つである。 FIG. 4 shows an example of a keyword vector generation method according to the present embodiment. In FIG. 4, the keyword K1 is “morphological analysis”. The keyword K2 is “case”. The keyword K3 is “HTML”. Further, the keyword K4 is “analysis”. The character strings are S1 “when morphological analysis is used, HTML tags become unknown words”, S2 “when morphological analysis and keyword analysis are used together”, and S3 “representing a line break in HTML tags is BR. ”.

以上の条件で示された文字列Ｓ１，Ｓ２，Ｓ３は、本実施の形態のキーワードマッチング手法によって、以下のようなキーワードベクトルＶ１，Ｖ２，Ｖ３として表現される。すなわち、キーワードベクトルＶ１は、文字列Ｓ１がキーワードＫ１，Ｋ２，Ｋ３を含むため、（１，１，１，０）となる。また、キーワードベクトルＶ２は、文字列Ｓ２がキーワードＫ１，Ｋ２を含むため、（１，１，０，０）となる。そして、キーワードベクトルＶ３は、文字列Ｓ３がキーワードＫ３を含むため、（０，０，１，０）となる。 The character strings S1, S2, and S3 shown under the above conditions are expressed as the following keyword vectors V1, V2, and V3 by the keyword matching method of the present embodiment. That is, the keyword vector V1 is (1, 1, 1, 0) because the character string S1 includes the keywords K1, K2, and K3. The keyword vector V2 is (1, 1, 0, 0) because the character string S2 includes the keywords K1 and K2. The keyword vector V3 is (0, 0, 1, 0) because the character string S3 includes the keyword K3.

また、キーワードマッチングは、正規表現を用いて以下のように行う。例えば、対象文字情報から抽出したキーワードの全体または一部の中にある３文字のキーワード「XYZ」より正規表現である「X[^Y]*Y[^Z]*Z」を生成する。この「*」の記号は、「直前の文字が０文字以上繰り返される」ということを示す。また、「^Y」の記号は、「Ｙではない」ということを示す。すなわち、「公取委」というキーワードに対する正規表現は、「公[^取]*取[^委]*委」である。そして、この正規表現にマッチする文字列の一例は、「公取委」「公正取引委」「公正取引委員会」となる。正規表現を生成した後、対象文字情報から抽出したキーワードの全体または一部の文字列と生成した正規表現である「X[^Y]*Y[^Z]*Z」の文字列とのキーワードマッチングを行う。 Keyword matching is performed using regular expressions as follows. For example, a regular expression “X [^ Y] * Y [^ Z] * Z” is generated from a three-character keyword “XYZ” in all or part of the keyword extracted from the target character information. The symbol “*” indicates that “the preceding character is repeated zero or more characters”. The symbol “^ Y” indicates “not Y”. In other words, the regular expression for the keyword "public commission" is "public [^]] *" [^ committee] * ". An example of a character string that matches the regular expression is “trade commission”, “fair trade commission”, “fair trade commission”. After the regular expression is generated, the keyword with the whole or part of the keyword extracted from the target character information and the generated regular expression "X [^ Y] * Y [^ Z] * Z" Perform matching.

全ての対象文字情報の全体または一部についてキーワードベクトルを生成した後、類似度算出部１０５は、このキーワードベクトルに基づいて、２つのベクトル間の類似度（２つのキーワードベクトルの類似を求める）を求めて、その算出した類似度の結果を類似度テーブル１０５ａに記録する（Ｓ１０４）。この類似度の算出は、生成した全てのキーワードベクトルを２つずつ組み合わせて互いについて類似度を算出する。 After generating keyword vectors for all or part of all target character information, the similarity calculation unit 105 calculates the similarity between two vectors (similarity between two keyword vectors) based on this keyword vector. Then, the result of the calculated similarity is recorded in the similarity table 105a (S104). The similarity is calculated by combining the generated keyword vectors two by two and calculating the similarity for each other.

そして、類似度の算出は、ベクトル間のハミング距離と比較する２つのベクトル（ベクトル１とベクトル２とする）双方のキーワード数によって、以下の式（１），（２）によって求められる。 Then, the similarity is calculated by the following equations (1) and (2) based on the number of keywords of both two vectors (vector 1 and vector 2) to be compared with the Hamming distance between the vectors.

（１）：ハミング距離＝ベクトル１にあってベクトル２にないキーワードの数＋ベクトル２にあってベクトル１にないキーワードの数
（２）：類似度＝１―ハミング距離／（ベクトル１が含むキーワード数＋ベクトル２が含むキーワード数） (1): Hamming distance = number of keywords in vector 1 but not in vector 2 + number of keywords in vector 2 but not in vector 1 (2): similarity = 1−hamming distance / (keywords included in vector 1) Number + number of keywords in vector 2)

式（１），（２）に基づいてハミング距離と類似度が算出されると、クラスタリング部１０７は、算出した類似度を元に似たような対象文字情報の全体または一部を集めたクラスタを作成する。そして、クラスタリング部１０７は、そのクラスタをクラスタテーブルに記録する（Ｓ１０５）。このとき、クラスタリング部１０７は、以下の式（３）を満たすベクトルＶｉ，Ｖｊをクラスタリング対象とする。クラスタリング部１０７は、式（３）を満たす２つのベクトルＶｉ，Ｖｊからなる一次クラスタを形成する。 When the Hamming distance and the similarity are calculated based on the equations (1) and (2), the clustering unit 107 collects all or part of the target character information similar based on the calculated similarity. Create Then, the clustering unit 107 records the cluster in the cluster table (S105). At this time, the clustering unit 107 sets vectors Vi and Vj that satisfy the following expression (3) as clustering targets. The clustering unit 107 forms a primary cluster composed of two vectors Vi and Vj that satisfy Expression (3).

（３）：Ｓ（Ｖｉ，Ｖｊ）＞Ｔ（ｉ≠ｊ，Ｓは類似度、Ｔはしきい値） (3): S (Vi, Vj)> T (i ≠ j, S is a similarity, T is a threshold value)

クラスタリング部１０７は、一次クラスタ同士を比較し、比較したクラスタ中に共通するベクトルを１つ以上含むクラスタ同士を一つに結合して、二次クラスタを形成する（Ｓ１０６）。このとき、クラスタリング部１０７は、結合された一次クラスタを削除する。 The clustering unit 107 compares the primary clusters and combines the clusters including one or more common vectors in the compared clusters to form a secondary cluster (S106). At this time, the clustering unit 107 deletes the combined primary cluster.

クラスタリング部１０７は、上述のクラスタリング処理を全てのベクトルに対して行う。また、クラスタリング部１０７は、この処理によって結合されずに残った一次クラスタをそのまま二次クラスタとする。そして、クラスタリング部１０７は、生成された二次クラスタをクラスタリング処理の最終結果とする。 The clustering unit 107 performs the above-described clustering process on all vectors. Further, the clustering unit 107 sets the primary cluster that remains without being combined by this processing as a secondary cluster. Then, the clustering unit 107 sets the generated secondary cluster as the final result of the clustering process.

出力部１０８は、ディスプレイなどを介して、クラスタリング処理の最終結果を出力する（Ｓ１０７）。本実施の形態において、出力部１０８からのクラスタリング処理の最終結果の出力は、例えば、ユーザのコンピュータに常駐して動作するいわゆるエージェントアプリケーションソフトウェア（以下、エージェントソフトとする）に備えられる以下の画面出力方法によってユーザに通知する。 The output unit 108 outputs the final result of the clustering process via a display or the like (S107). In the present embodiment, the output of the final result of the clustering process from the output unit 108 is, for example, the following screen output provided in so-called agent application software (hereinafter referred to as agent software) that operates resident on the user's computer. Notify users by method.

本実施の形態において、クラスタリング処理の時期としては、以下の３つが挙げられる。すなわち、上記処理時期としては、例えば、（１）所定時間毎に処理を実行、（２）指定日時または曜日毎に処理を実行、及び（３）ユーザの操作に応じて処理を実行、の３通りが考えられる。本実施の形態において、クラスタリング処理の時期は上記３通りのうち、いずれの時期に応じて処理を行ってもよい。 In the present embodiment, there are the following three times for the clustering process. That is, as the processing time, for example, (1) processing is performed every predetermined time, (2) processing is performed every designated date and time or day of the week, and (3) processing is performed according to the user's operation. A street is conceivable. In the present embodiment, the clustering process may be performed according to any of the above three times.

また、本実施の形態に係るエージェントソフトの第一の画面出力方法は、一定時間毎に更新されたクラスタリング処理の結果を、コンピュータの画面上に表示される小さなポップアップウインドウに出力し、ユーザがこのポップアップウインドウをマウスなどのポインティングデバイスでクリックすることによって、このクラスタリング処理結果の詳細情報を画面上に出力する。 In addition, the first screen output method of the agent software according to the present embodiment outputs the result of the clustering process updated every fixed time to a small pop-up window displayed on the computer screen. By clicking on the pop-up window with a pointing device such as a mouse, detailed information on the clustering processing result is output on the screen.

上記第一のクラスタリング処理結果の出力方法を実施するときに、文字情報分類装置１００の出力部１０８は、個々のクラスタに含まれる記事件数の多いものから順に、このクラスタをソート（整列）する。そして、出力部１０８は、含まれる記事件数が多いものから順にポップアップウインドウに出力する。 When the first clustering processing result output method is performed, the output unit 108 of the character information classification device 100 sorts (sorts) the clusters in descending order of the number of articles included in each cluster. Then, the output unit 108 outputs to the pop-up window in order from the largest number of articles included.

図５は、本実施の形態に係る文字情報分類装置１００によって生成されたクラスタリング処理の結果をエージェントソフトによって画面出力した方法の一例を示す図である。コンピュータの出力画面１０には、クラスタリング処理の結果がポップアップウインドウ１１によって表示される（１）。 FIG. 5 is a diagram showing an example of a method in which the result of the clustering process generated by the character information classification device 100 according to this embodiment is output on the screen by agent software. The clustering process result is displayed on the output screen 10 of the computer by a pop-up window 11 (1).

ユーザは、この出力画面１０に表示されたポップアップウインドウ１１（詳細表示したポップアップウインドウ１２（２））にテキストで表示される記事情報をポインティングデバイスでクリックすると、エージェントキャラクタ１３とともに記事一覧表示１４（詳細表示した記事一覧表示１５）が表示される。この記事一覧表示１４（詳細表示した記事一覧表示１５）には、クラスタリング処理の結果に基づいて、クラスタリング処理を行ったＲＳＳの件名（対象文字情報の全体または一部）が表示されている（３）。さらに、ユーザが記事一覧表示１４から要求する記事のテキストをクリックすると、文字情報分類装置１００は、クリックした記事を画面上に表示する。 When the user clicks the article information displayed as text in the pop-up window 11 (detailed pop-up window 12 (2)) displayed on the output screen 10 with a pointing device, the article list display 14 (details) together with the agent character 13 is displayed. The displayed article list display 15) is displayed. The article list display 14 (detailed article list display 15) displays the subject of the RSS subjected to the clustering process (all or part of the target character information) based on the result of the clustering process (3). ). Further, when the user clicks the text of the article requested from the article list display 14, the character information classification device 100 displays the clicked article on the screen.

また、本実施の形態に係る第二の画面出力方法は、ポップアップウインドウなどの出力方法を用いずに、クラスタリング処理が完了した後（１）、あるいはユーザの操作に応じてクラスタリング処理の結果を出力する。 The second screen output method according to the present embodiment outputs the result of the clustering process after completion of the clustering process (1) or according to the user's operation without using an output method such as a pop-up window. To do.

図６は、第二の画面出力方法によるＲＳＳのクラスタリング処理の結果の出力画面の一例である。図６には、クラスタリング処理の結果をクラスタに属する対象文字情報の全体または一部の件数に基づいてソート（整列）したクラスタリング処理結果画面２１、及びクラスタリング処理の結果に対して、ユーザが指定したキーワードを含む対象文字情報の全体または一部を検索し、その対象文字情報の全体または一部が属するクラスタを表示するクラスタリング処理結果画面２２の一例を示す。なお、本実施の形態において、第一の画面出力方法によってクラスタリング処理結果を表示する場合には、一定時間経過後に結果を表示するが、第二の画面出力方法によってクラスタリング処理結果を出力する場合には、上述のように処理後の経過時間に関係なく、クラスタリング処理が完了したことに応じて、あるいはユーザの操作によって結果が表示される。 FIG. 6 is an example of an output screen as a result of RSS clustering processing by the second screen output method. In FIG. 6, the clustering processing result screen 21 in which the clustering processing results are sorted based on the number of all or a part of the target character information belonging to the cluster, and the clustering processing results are designated by the user. An example of the clustering processing result screen 22 that searches for all or part of target character information including a keyword and displays a cluster to which the whole or part of the target character information belongs is shown. In this embodiment, when the clustering processing result is displayed by the first screen output method, the result is displayed after a predetermined time has elapsed, but when the clustering processing result is output by the second screen output method. As described above, regardless of the elapsed time after the processing, the result is displayed in response to the completion of the clustering processing or by the user's operation.

また、図８は、クラスタリングのＲＳＳの設定画面の一例である。この設定画面１には、追加したいＲＳＳのＵＲＬを入力するテキストボックス２、テキストボックス２に入力されたＵＲＬをＵＲＬリストに追加するボタン３、ＵＲＬリストからＵＲＬを削除するボタン４、現在登録されているＵＲＬを表示するリストボックス５、リストボックス５の変更内容を反映せずに設定画面１を閉じるキャンセルボタン６、及び変更内容を反映して設定画面１を閉じるボタン７を備える。すなわち、この設定画面１は、ＲＳＳブックマークに記録された複数のＵＲＬを編集（追加・削除）するための画面である。そして、文字情報分類装置１００は、このユーザが設定画面１から設定したＵＲＬのＲＳＳに対して、クラスタリング処理を行う。 FIG. 8 is an example of an RSS setting screen for clustering. In this setting screen 1, a text box 2 for inputting the URL of the RSS to be added, a button 3 for adding the URL inputted in the text box 2 to the URL list, a button 4 for deleting the URL from the URL list, and a currently registered URL. A list box 5 that displays a URL, a cancel button 6 that closes the setting screen 1 without reflecting the changed contents of the list box 5, and a button 7 that closes the setting screen 1 while reflecting the changed contents. That is, the setting screen 1 is a screen for editing (adding / deleting) a plurality of URLs recorded in the RSS bookmark. Then, the character information classification device 100 performs a clustering process on the RSS of the URL set by the user from the setting screen 1.

図９は、本実施の形態における、クラスタリング処理の実行をユーザが指示する画面の一例である。この場合（１）には、画面上に表示されたエージェントキャラクタ１３を不図示のポインティングデバイスで右クリックすることによって、ポップアップメニューが表示される。そして、ユーザは、このポップアップメニューからクラスタリング処理の指示などを入力する。 FIG. 9 is an example of a screen for the user to instruct the execution of the clustering process in the present embodiment. In this case (1), a pop-up menu is displayed by right-clicking the agent character 13 displayed on the screen with a pointing device (not shown). Then, the user inputs an instruction for clustering processing and the like from this pop-up menu.

また、図９には、本実施の形態における、クラスタリング処理の実行をユーザが指示する画面の別の一例が示される。この場合（２）には、エージェントキャラクタ１３をユーザがポインティングデバイスでクリックすることによって、あるいは所定のタイミングに応じて自動的に表示されるメニューウインドウから、ユーザは本装置に対して指示を与え
る。このとき、このメニューには、クラスタリング処理に必要な様々な指示に関するメニューが表示される。 FIG. 9 shows another example of a screen in which the user instructs execution of clustering processing in the present embodiment. In this case (2), the user gives an instruction to the apparatus when the user clicks on the agent character 13 with a pointing device or from a menu window automatically displayed at a predetermined timing. At this time, menus related to various instructions necessary for the clustering process are displayed in this menu.

図７は、第二の画面出力方法によるＲＳＳのクラスタリング処理の結果の出力画面３１の一例である。文字情報分類装置１００は、ランキング設定画面２１、またはキーワード設定画面２２に表示されたクラスタをポインティングデバイスでクリックすると、出力画面３１を表示する。すなわち、文字情報分類装置１００は、一つのクラスタに属する複数の記事を出力画面３１から出力する。そして、ユーザが出力画面３１から所望の記事を探し出してクリックすると、コンピュータ内で既存のウェブページのブラウザアプリケーションプログラムが起動することによって、コンピュータの出力画面（ディスプレイ）には、クリックした記事のウェブページのブラウザ画面３２が表示される。 FIG. 7 is an example of an output screen 31 as a result of the RSS clustering process by the second screen output method. When the cluster displayed on the ranking setting screen 21 or the keyword setting screen 22 is clicked with a pointing device, the character information classification device 100 displays an output screen 31. That is, the character information classification device 100 outputs a plurality of articles belonging to one cluster from the output screen 31. When the user finds and clicks on a desired article from the output screen 31, the browser application program for the existing web page is activated in the computer, and the web page of the clicked article is displayed on the computer output screen (display). The browser screen 32 is displayed.

本発明の一実施の形態である、文字情報分類装置の一例を示す。1 shows an example of a character information classification device according to an embodiment of the present invention. 本発明の文字情報分類プログラムをコンピュータに実行させることによって実現される、文字情報分類装置の処理を示すフローチャートである。It is a flowchart which shows the process of the character information classification device implement | achieved by making a computer run the character information classification program of this invention. キーワード抽出部による上述の条件に基づいたキーワード抽出の一例を示す。An example of the keyword extraction based on the above-mentioned conditions by a keyword extraction part is shown. 本実施の形態にかかる、キーワードベクトル生成手法の一例を示す。An example of a keyword vector generation method according to the present embodiment will be shown. 文字情報分類装置によって生成されたクラスタリング処理の結果をエージェントソフトによって画面出力した方法の一例を示す図である。It is a figure which shows an example of the method which output the screen of the result of the clustering process produced | generated by the character information classification device by agent software. 第二の画面出力方法によるＲＳＳのクラスタリング処理の結果の出力画面の一例である。It is an example of the output screen of the result of the RSS clustering process by the 2nd screen output method. 第二の画面出力方法によるＲＳＳのクラスタリング処理の結果の出力画面の一例である。It is an example of the output screen of the result of the RSS clustering process by the 2nd screen output method. クラスタリングのＲＳＳの設定画面の一例である。It is an example of the setting screen of RSS of clustering. 本実施の形態における、クラスタリング処理の実行をユーザが指示する画面の一例である。It is an example of the screen which a user instruct | indicates execution of clustering processing in this Embodiment.

Explanation of symbols

１設定画面
２テキストボックス
３ボタン
４削除ボタン
５リストボックス
６キャンセルボタン
７ボタン
１０出力画面
１１ポップアップウインドウ
１３エージェントキャラクタ
１４記事一覧表示
２１ランキング設定画面
２２キーワード設定画面
３１クラスタリング出力画面
３２ウェブページのブラウザ画面
１００文字情報分類装置
１０１対象文字抽出部
１０２キーワード抽出部
１０３キーワード格納部
１０４キーワードベクトル生成部
１０５類似度算出部
１０６類似度テーブル
１０７クラスタリング部
１０８出力部
１０９記憶装置
１１０入力装置 1 setting screen 2 text box 3 button 4 delete button 5 list box 6 cancel button 7 button 10 output screen 11 pop-up window 13 agent character 14 article list display 21 ranking setting screen 22 keyword setting screen 31 clustering output screen 32 web page browser screen 100 character information classification device 101 target character extraction unit 102 keyword extraction unit 103 keyword storage unit 104 keyword vector generation unit 105 similarity calculation unit 106 similarity table 107 clustering unit 108 output unit 109 storage device 110 input device

Claims

A program executable by a computer,
Extracting target character information that is determined in advance from character information published on the network;
Extracting a keyword by considering a character part satisfying a predetermined condition in the target character information as a keyword;
Performing keyword matching on the keywords;
Generating a keyword vector from the character information according to the presence or absence of the keyword;
Calculating a degree of similarity for the keyword vectors in the character information based on the generated keyword vectors;
Associating the keyword vectors with each other based on the similarity;
And a step of outputting a result of associating the keyword vectors with each other.

In the step of extracting the keyword,
The character information classification program according to claim 1, wherein a character portion regarded as a keyword is extracted based on a type of character information.

In the keyword matching step,
The character information classification program according to claim 1, wherein keyword matching is performed using a regular expression.

A target character extraction unit that extracts target character information that is predetermined as a processing target from character information that is disclosed on the network;
A keyword extraction unit that regards a character portion satisfying a predetermined condition in the target character information as a keyword and extracts the keyword;
A keyword vector generating unit that performs keyword matching on the keyword and generates a keyword vector from the character information according to the presence or absence of the keyword;
Based on the generated keyword vector, a similarity calculation unit that calculates a similarity between the keyword vectors in the character information;
A clustering unit for associating the keyword vectors based on the similarity,
An output unit that outputs a result of associating the keyword vectors with each other.

The keyword extraction unit
The character information classification device according to claim 4, wherein a character portion regarded as a keyword is extracted based on the type of character information.

The keyword vector generation unit
The character information classification device according to claim 4, wherein keyword matching is performed using a regular expression.

Extracting target character information that is determined in advance from character information published on the network;
Extracting a keyword by considering a character part satisfying a predetermined condition in the target character information as a keyword;
Performing keyword matching on the keywords;
Generating a keyword vector from the character information according to the presence or absence of the keyword;
Calculating a degree of similarity for the keyword vectors in the character information based on the generated keyword vectors;
Associating the keyword vectors with each other based on the similarity;
A character information classification method in which a computer executes a step of outputting a result of associating the keyword vectors.

In the step of extracting the keyword,
The character information classification method according to claim 7, wherein a character portion regarded as a keyword is extracted based on the type of character information.

In the keyword matching step,
The character information classification method according to claim 7, wherein keyword matching is performed using a regular expression.