JP6136794B2

JP6136794B2 - Information processing method, program, and information processing apparatus

Info

Publication number: JP6136794B2
Application number: JP2013189779A
Authority: JP
Inventors: 忠延古川; 太田　唯子; 唯子太田; 井形　伸之; 伸之井形
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-09-12
Filing date: 2013-09-12
Publication date: 2017-05-31
Anticipated expiration: 2033-09-12
Also published as: JP2015056066A

Description

本発明は、コンピュータを用いた情報処理方法、プログラム、及び、情報処理装置に関する。 The present invention relates to an information processing method, a program, and an information processing apparatus using a computer.

従来、テキスト中の任意の単語ペアについて、複数種類の類似度を計算し、各類似度を要素とする素性ベクトルを生成する関係抽出装置等が提案されている（例えば、特許文献１参照）。 Conventionally, there has been proposed a relationship extraction device that calculates a plurality of types of similarity for an arbitrary word pair in a text and generates a feature vector having each similarity as an element (see, for example, Patent Document 1).

特開２０１１−１１８５２６号公報JP 2011-118526 A

しかしながら、従来の技術では、情報分析の際の計算量が膨大であるという問題があった。 However, the conventional technique has a problem that the amount of calculation at the time of information analysis is enormous.

一つの側面では、本発明は計算量の多い用語及び用語数に基づく類似度の演算量を低減することが可能な情報処理方法等を提供することを目的とする。 In one aspect, an object of the present invention is to provide an information processing method and the like that can reduce the amount of calculation based on terms having a large amount of calculation and the number of terms.

本願に開示する情報処理方法は、コンピュータを用いた情報処理方法において、ユーザに対応付けられたカテゴリを参照し、カテゴリ間の組み合わせ数を算出し、算出したカテゴリ間の組み合わせ数に基づき、カテゴリ間の第１類似度を算出し、閾値を超える第１類似度を有するカテゴリの組み合わせを抽出し、抽出した各カテゴリに対応する用語及び用語数に基づき、抽出したカテゴリ間の第２類似度を算出する。 The information processing method disclosed in the present application refers to a category associated with a user in an information processing method using a computer, calculates the number of combinations between categories, and based on the calculated number of combinations between categories, The first similarity is calculated, the combination of categories having the first similarity exceeding the threshold is extracted, and the second similarity between the extracted categories is calculated based on the terms and the number of terms corresponding to each extracted category To do.

一つの側面では、計算量を低減することが可能となる。 In one aspect, the amount of calculation can be reduced.

情報処理システムの概要を示す説明図である。It is explanatory drawing which shows the outline | summary of an information processing system. コンピュータのハードウェア群を示すブロック図である。It is a block diagram which shows the hardware group of a computer. ブログ記事の例を示す説明図である。It is explanatory drawing which shows the example of a blog article. カテゴリリストファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of a category list file. ユーザブログ記事リストファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of a user blog article list file. ユーザカテゴリリストファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of a user category list file. ユーザ記事リストファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of a user article list file. カテゴリ語句リストファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of a category word list file. カテゴリ記事リストファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of a category article list file. カテゴリ共起ファイルを示す説明図である。It is explanatory drawing which shows a category co-occurrence file. カテゴリ共起類似度ファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of a category co-occurrence similarity file. 内容類似度ファイルを示す説明図である。It is explanatory drawing which shows a content similarity file. 削除後のカテゴリ記事リストファイルのレコードレイアウトを示す説明図である。It is explanatory drawing which shows the record layout of the category article list file after deletion. カテゴリ記事リストファイルの生成処理手順を示すフローチャートである。It is a flowchart which shows the production | generation procedure of a category article list file. カテゴリ共起ファイルの生成処理手順を示すフローチャートである。It is a flowchart which shows the production | generation procedure of a category co-occurrence file. 第１類似度算出処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a 1st similarity calculation process. 第２類似度算出処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a 2nd similarity calculation process. 第２類似度算出処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a 2nd similarity calculation process. 削除処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a deletion process. 上述した形態のコンピュータの動作を示す機能ブロック図である。It is a functional block diagram which shows operation | movement of the computer of the form mentioned above. 実施の形態２に係るコンピュータのハードウェア群を示すブロック図である。FIG. 6 is a block diagram illustrating a hardware group of a computer according to a second embodiment.

実施の形態１
以下実施の形態を、図面を参照して説明する。図１は情報処理システムの概要を示す説明図である。情報処理システムは情報処理装置１及びサーバコンピュータ２等を含む。情報処理装置１及びサーバコンピュータ２はインターネット等の通信網Ｎを介して接続されている。情報処理装置１は例えばパーソナルコンピュータ、サーバコンピュータ、携帯電話機、ＰＤＡ（Personal Digital Assistant）等である。以下では情報処理装置１をコンピュータ１と読み替えて説明する。サーバコンピュータ２は複数の単語を含む記事、歌詞、つぶやき、商品説明、店舗の説明、または、論文等の文章と、当該文章に対するカテゴリを付与した情報を記憶している。サーバコンピュータ２は図示しない他のコンピュータから文章及びカテゴリを受信し、受信した文章及びカテゴリを記憶する。 Embodiment 1
Hereinafter, embodiments will be described with reference to the drawings. FIG. 1 is an explanatory diagram showing an outline of an information processing system. The information processing system includes an information processing apparatus 1, a server computer 2, and the like. The information processing apparatus 1 and the server computer 2 are connected via a communication network N such as the Internet. The information processing apparatus 1 is, for example, a personal computer, a server computer, a mobile phone, a PDA (Personal Digital Assistant) or the like. Hereinafter, the information processing apparatus 1 will be described as being replaced with the computer 1. The server computer 2 stores articles including a plurality of words, lyrics, tweets, product descriptions, store descriptions, or articles such as papers, and information with categories for the sentences. The server computer 2 receives sentences and categories from another computer (not shown) and stores the received sentences and categories.

以下では、一例として複数のユーザがブログの記事にカテゴリを付して、サーバコンピュータ２にアップロードした例を挙げて説明する。コンピュータ１は定期的にサーバコンピュータ２にアクセスし、各ユーザのブログ記事及びカテゴリをダウンロードする。コンピュータ１はダウンロードした記事及びカテゴリについて情報処理を行う。以下詳細を説明する。 In the following, an example in which a plurality of users attach categories to blog articles and upload them to the server computer 2 will be described as an example. The computer 1 periodically accesses the server computer 2 and downloads each user's blog articles and categories. The computer 1 performs information processing on the downloaded article and category. Details will be described below.

図２はコンピュータ１のハードウェア群を示すブロック図である。コンピュータ１は制御部としてのＣＰＵ（Central Processing Unit）１１、ＲＡＭ(Random Access Memory)１２、入力部１３、表示部１４、記憶部１５、及び通信部１６等を含む。ＣＰＵ１１は、バス１７を介してハードウェア各部と接続されている。ＣＰＵ１１は記憶部１５に記憶された制御プログラム１５Ｐに従いハードウェア各部を制御する。ＲＡＭ１２は例えばＳＲＡＭ（Static RAM）、ＤＲＡＭ(Dynamic RAM)、フラッシュメモリ等である。ＲＡＭ１２は、記憶部としても機能し、ＣＰＵ１１による各種プログラムの実行時に発生する種々のデータを一時的に記憶する。 FIG. 2 is a block diagram showing a hardware group of the computer 1. The computer 1 includes a central processing unit (CPU) 11 as a control unit, a random access memory (RAM) 12, an input unit 13, a display unit 14, a storage unit 15, a communication unit 16, and the like. The CPU 11 is connected to each part of the hardware via the bus 17. The CPU 11 controls each part of the hardware according to the control program 15P stored in the storage unit 15. The RAM 12 is, for example, SRAM (Static RAM), DRAM (Dynamic RAM), flash memory, or the like. The RAM 12 also functions as a storage unit, and temporarily stores various data generated when the CPU 11 executes various programs.

入力部１３はマウスまたはキーボード、マウスまたはタッチパネル等の入力デバイスであり、受け付けた操作情報をＣＰＵ１１へ出力する。表示部１４は液晶ディスプレイまたは有機ＥＬ（electroluminescence）ディスプレイ等であり、ＣＰＵ１１の指示に従い各種情報を表示する。通信部１６は通信モジュールであり、通信網Ｎを介してサーバコンピュータ２との間で情報の送受信を行う。 The input unit 13 is an input device such as a mouse or a keyboard, a mouse or a touch panel, and outputs received operation information to the CPU 11. The display unit 14 is a liquid crystal display, an organic EL (electroluminescence) display, or the like, and displays various information according to instructions from the CPU 11. The communication unit 16 is a communication module, and transmits and receives information to and from the server computer 2 via the communication network N.

記憶部１５はハードディスクまたは大容量メモリであり、上述した制御プログラム１５Ｐの他、カテゴリリストファイル１５１、ユーザブログ記事リストファイル１５２、ユーザカテゴリリストファイル１５３、ユーザ記事リストファイル１５４等を記憶する。その他、記憶部１５は、カテゴリ語句リストファイル１５５、カテゴリ記事リストファイル１５６、カテゴリ共起ファイル１５７、カテゴリ共起類似度ファイル１５８及び内容類似度ファイル１５９等を記憶する。なお、本実施形態ではコンピュータ１の記憶部１５にカテゴリリストファイル１５１等を記憶する例を挙げたがこれに限るものではない。例えば通信網Ｎを介して接続される図示しないデータベースサーバ等に上述した各種ファイルを適宜記憶しても良い。この場合、ＣＰＵ１１は、必要に応じてデータベースサーバにアクセスし、データの書き込み及び読み出しを行う。 The storage unit 15 is a hard disk or a large-capacity memory, and stores a category list file 151, a user blog article list file 152, a user category list file 153, a user article list file 154, in addition to the control program 15P described above. In addition, the storage unit 15 stores a category phrase list file 155, a category article list file 156, a category co-occurrence file 157, a category co-occurrence similarity file 158, a content similarity file 159, and the like. In the present embodiment, an example in which the category list file 151 and the like are stored in the storage unit 15 of the computer 1 has been described. For example, the various files described above may be appropriately stored in a database server (not shown) connected via the communication network N. In this case, the CPU 11 accesses the database server as necessary, and writes and reads data.

図３はブログ記事の例を示す説明図である。図３の例はユーザＡのブログ記事であり、４つのブログ記事が記載されている。なお、Ａ００１はユーザのブログ記事を特定するための識別情報（以下、記事ＩＤという）である。各ブログ記事にはカテゴリが記載されている。例えば、記事ＩＤがＡ００１の「ラーメンを食べた」には「グルメ」のカテゴリが付与されており、記事ＩＤがＡ００２の「渋谷で服を買った」には「ファッション」のカテゴリが付与されている。なお、本実施形態においては、記事ＩＤにユーザを特定するための情報も付している。例えば記事ＩＤ「Ａ００１」はユーザＡの記事であり、記事ＩＤ「Ｂ００１」はユーザＢの記事である。 FIG. 3 is an explanatory diagram showing an example of a blog article. The example of FIG. 3 is a blog article of user A, and four blog articles are described. A001 is identification information (hereinafter referred to as article ID) for specifying the user's blog article. Each blog post has a category. For example, “Gourmet” category is assigned to “I ate ramen” with article ID A001, and “Fashion” category is assigned to “I bought clothes in Shibuya” with article ID A002. Yes. In the present embodiment, information for identifying the user is also attached to the article ID. For example, the article ID “A001” is the article of the user A, and the article ID “B001” is the article of the user B.

図４はカテゴリリストファイル１５１のレコードレイアウトを示す説明図である。図４に示すように予めカテゴリリストファイル１５１には分析対象となるカテゴリが記憶されている。図４の例ではグルメ、お酒、ファッション等の合計１０のカテゴリが記憶されている。なお、本実施形態では１０のカテゴリを分析する例を挙げて説明するが、あくまで一例でありカテゴリ数及びカテゴリの内容を限定するものではない。 FIG. 4 is an explanatory diagram showing a record layout of the category list file 151. As shown in FIG. 4, the category list file 151 stores categories to be analyzed in advance. In the example of FIG. 4, a total of 10 categories such as gourmet, alcohol, and fashion are stored. In this embodiment, an example in which ten categories are analyzed will be described. However, this is merely an example, and the number of categories and the contents of the categories are not limited.

図５はユーザブログ記事リストファイル１５２のレコードレイアウトを示す説明図である。ユーザブログ記事リストファイル１５２はユーザフィールド、記事ＩＤフィールド、カテゴリフィールド、本文フィールド等を含む。ユーザフィールドにはユーザの情報が記憶されている。記事ＩＤフィールドには、ユーザのブログ記事を特定するための記事ＩＤが記憶されている。ＣＰＵ１１は、サーバコンピュータ２からブログ記事をダウンロードした場合、新規に記事ＩＤを生成し、ユーザ、記事ＩＤ、カテゴリ及び本文をユーザブログ記事リストファイル１５２に記憶する。 FIG. 5 is an explanatory diagram showing a record layout of the user blog article list file 152. The user blog article list file 152 includes a user field, an article ID field, a category field, a body field, and the like. User information is stored in the user field. In the article ID field, an article ID for specifying the user's blog article is stored. When the blog article is downloaded from the server computer 2, the CPU 11 newly generates an article ID and stores the user, article ID, category, and text in the user blog article list file 152.

カテゴリフィールドには、記事ＩＤに対応付けて付与されたカテゴリが記憶されている。本文フィールドには記事ＩＤに対応付けてブログ記事の本文がテキスト形式で記憶されている。図５の例では、ユーザＡの記事ＩＤがＡ００１の「ラーメンを食べた」がカテゴリ「グルメ」で記憶されており、またユーザＢの記事ＩＤがＢ００１の「ビールを飲んだ」がカテゴリ「お酒」で記憶されている。なお、本実施形態ではブログ記事本文の情報を記憶しているが、ブログ記事のタイトルを記憶し、分析を行っても良い。 The category field stores a category assigned in association with the article ID. In the body field, the body of the blog article is stored in text format in association with the article ID. In the example of FIG. 5, “I ate ramen” whose user ID is A001 is stored in the category “Gourmet”, and “I drank beer” whose user ID is B001 is “ It is remembered in “Sake”. In the present embodiment, information of the blog article text is stored, but the title of the blog article may be stored and analyzed.

図６はユーザカテゴリリストファイル１５３のレコードレイアウトを示す説明図である。ユーザカテゴリリストファイル１５３は、ユーザフィールド及びカテゴリフィールド等を含む。カテゴリフィールドには、ユーザに対応付けてカテゴリが記憶されている。ＣＰＵ１１は、ユーザブログ記事リストファイル１５２を参照し、各ユーザのカテゴリを抽出する。ＣＰＵ１１は、抽出した各ユーザのカテゴリをユーザカテゴリリストファイル１５３に記憶する。図６の例では、ユーザＡのカテゴリはグルメ、ファッション、及びお酒となる。またユーザＢのカテゴリはお酒のみとなる。 FIG. 6 is an explanatory diagram showing a record layout of the user category list file 153. The user category list file 153 includes user fields and category fields. In the category field, a category is stored in association with the user. The CPU 11 refers to the user blog article list file 152 and extracts the category of each user. The CPU 11 stores the extracted category of each user in the user category list file 153. In the example of FIG. 6, the category of the user A is gourmet, fashion, and liquor. The category of user B is only alcohol.

図７はユーザ記事リストファイル１５４のレコードレイアウトを示す説明図である。ユーザ記事リストファイル１５４はユーザフィールド、記事ＩＤフィールド、カテゴリフィールド、本文フィールド及び語句フィールド等を含む。語句フィールドには、ブログの記事本文から抽出された語句（用語）が記憶されている。ＣＰＵ１１は、本文の内容に対して公知の形態素解析を行い、語句を抽出する。ＣＰＵ１１は、抽出した語句を本文に対応付けて記憶する。図７の例では「ラーメンを食べた」に対して「ラーメン」及び「食べる」の語句が抽出されている。なお、本実施形態では形態素解析を用いた例を示したが、他の手法により語句を抽出しても良い。例えば予め記憶部に辞書を用意しておき、辞書内の語句に一致する本文中の語句を抽出するようにしても良い。 FIG. 7 is an explanatory diagram showing a record layout of the user article list file 154. The user article list file 154 includes a user field, an article ID field, a category field, a body text field, and a phrase field. In the phrase field, a phrase (term) extracted from the blog article body is stored. CPU11 performs well-known morphological analysis with respect to the content of a text, and extracts a phrase. The CPU 11 stores the extracted word / phrase in association with the text. In the example of FIG. 7, the phrases “ramen” and “eat” are extracted for “eating ramen”. In the present embodiment, an example using morphological analysis is shown, but a phrase may be extracted by another method. For example, a dictionary may be prepared in the storage unit in advance, and words in the text that match the words in the dictionary may be extracted.

図８はカテゴリ語句リストファイル１５５のレコードレイアウトを示す説明図である。カテゴリ語句リストファイル１５５は、カテゴリフィールド及び語句ｘ出現頻度（用語数）フィールド等を含む。語句ｘ出現頻度フィールドにはカテゴリに対応付けてカテゴリに属する語句と、当該語句のブログ記事本文内での出現頻度を記憶している。ＣＰＵ１１は、ユーザ記事リストファイル１５４のカテゴリフィールド及び語句フィールドを参照し、カテゴリ内で使用されている語句を抽出し、当該語句のカテゴリ内での出現頻度を計数する。図８の例では、カテゴリ「グルメ」について「ラーメン」は出現頻度が３００であり、「食べる」は出現頻度が９００である。またカテゴリ「お酒」について「食べる」は出現頻度が「３００」である。このように、「飲む」、「食べる」のようにカテゴリ間で重複する語句が存在する。 FIG. 8 is an explanatory diagram showing a record layout of the category word / phrase list file 155. The category phrase list file 155 includes a category field, a phrase x appearance frequency (number of terms) field, and the like. The word x appearance frequency field stores words that belong to a category in association with the category and the appearance frequency of the word in the blog article body. The CPU 11 refers to the category field and the phrase field of the user article list file 154, extracts the phrase used in the category, and counts the appearance frequency of the phrase in the category. In the example of FIG. 8, “Ramen” has an appearance frequency of 300 and “Eat” has an appearance frequency of 900 for the category “Gourmet”. In addition, regarding the category “alcohol”, “eating” has an appearance frequency of “300”. Thus, there are overlapping phrases between categories such as “drink” and “eat”.

図９はカテゴリ記事リストファイル１５６のレコードレイアウトを示す説明図である。カテゴリ記事リストファイル１５６はカテゴリフィールド及び記事リストフィールド等を含む。記事リストフィールドにはカテゴリに対応付けて各ユーザの記事ＩＤが記憶されている。ＣＰＵ１１は、ユーザ記事リストファイル１５４を参照し、カテゴリに属する各ユーザの記事ＩＤを抽出する。ＣＰＵ１１は、ユーザ名及び抽出した記事ＩＤをカテゴリに対応付けて、カテゴリ記事リストファイル１５６に記憶する。図９の例ではカテゴリ「グルメ」についてユーザＡの記事ＩＤ「Ａ００１」及び「Ａ００４」で特定される２つの記事が記憶されている。またカテゴリ「ファッション」についてユーザＡの記事ＩＤ「Ａ００２」及び「Ａ００３」で特定される２つの記事が記憶されている。なお、本実施形態では記事ＩＤを記憶する例を挙げたが、記事ＩＤにユーザを特定するための情報が記述されていない場合、ユーザの情報と共に、記事ＩＤを記憶するようにすればよい。 FIG. 9 is an explanatory diagram showing a record layout of the category article list file 156. As shown in FIG. The category article list file 156 includes a category field, an article list field, and the like. In the article list field, the article ID of each user is stored in association with the category. The CPU 11 refers to the user article list file 154 and extracts the article ID of each user belonging to the category. The CPU 11 stores the user name and the extracted article ID in the category article list file 156 in association with the category. In the example of FIG. 9, two articles specified by the article IDs “A001” and “A004” of the user A for the category “gourmet” are stored. In addition, two articles specified by the user A's article IDs “A002” and “A003” are stored for the category “fashion”. In the present embodiment, the article ID is stored as an example. However, when the information for specifying the user is not described in the article ID, the article ID may be stored together with the user information.

図１０はカテゴリ共起ファイル１５７を示す説明図である。カテゴリ共起ファイル１５７には２つのカテゴリ間の共起回数を記憶している。すなわち一のユーザが２つのカテゴリの記事を記載している場合に共起していると判断する。ＣＰＵ１１は、ユーザカテゴリリストファイル１５３を参照し、特定のユーザのカテゴリを読み出す。ＣＰＵ１１は、複数のカテゴリを検出した場合、全てのカテゴリの組み合わせについての共起回数をインクリメントする。ＣＰＵ１１は、他の全てのユーザについてもカテゴリの組み合わせを計数し、カテゴリ共起ファイル１５７に記憶する。図６及び図１０の例では、ユーザＡはカテゴリとして「グルメ」、「ファッション」及び「お酒」の３つがある。ＣＰＵ１１は、「グルメ」と「ファッション」との組み合わせを検出し、カテゴリ共起ファイル１５７の「グルメ」と「ファッション」との組み合わせ数をインクリメントする。 FIG. 10 is an explanatory diagram showing the category co-occurrence file 157. The category co-occurrence file 157 stores the number of co-occurrence between two categories. That is, it is determined that one user co-occurs when two categories of articles are described. The CPU 11 refers to the user category list file 153 and reads the category of a specific user. When the CPU 11 detects a plurality of categories, the CPU 11 increments the number of co-occurrence for all combinations of categories. The CPU 11 counts the combinations of categories for all other users and stores them in the category co-occurrence file 157. In the example of FIGS. 6 and 10, the user A has three categories of “gourmet”, “fashion”, and “alcohol”. The CPU 11 detects the combination of “gourmet” and “fashion” and increments the number of combinations of “gourmet” and “fashion” in the category co-occurrence file 157.

同様にＣＰＵ１１は、「グルメ」と「お酒」との組み合わせを検出し、カテゴリ共起ファイル１５７の「グルメ」と「お酒」との組み合わせ数をインクリメントする。またＣＰＵ１１は、「ファッション」と「お酒」との組み合わせを検出し、カテゴリ共起ファイル１５７の「ファッション」と「お酒」との組み合わせ数をインクリメントする。続いてＣＰＵ１１は、ユーザＢのカテゴリリストをユーザカテゴリリストファイル１５３から読み出す。ユーザＢはカテゴリ「お酒」しか存在しないので、共起回数はゼロとなる。以上の処理を分析対象の全ユーザに対して行う。 Similarly, the CPU 11 detects a combination of “gourmet” and “alcohol” and increments the number of combinations of “gourmet” and “alcohol” in the category co-occurrence file 157. Further, the CPU 11 detects the combination of “fashion” and “alcohol”, and increments the number of combinations of “fashion” and “alcohol” in the category co-occurrence file 157. Subsequently, the CPU 11 reads the category list of the user B from the user category list file 153. Since user B has only the category “alcohol”, the number of co-occurrence is zero. The above processing is performed for all users to be analyzed.

図１０の例では「グルメ」と「お酒」との双方のカテゴリに言及しているユーザは３０、「グルメ」と「ファッション」との双方のカテゴリに言及しているユーザは１００となる。なお、「グルメ」と「お酒」との組み合わせ、及び、「お酒」と「グルメ」のように、カテゴリの順序のみが異なる組み合わせは同一の組み合わせとしている。また本実施形態ではカテゴリリストファイル１５３を参照する例を挙げたが、ユーザ記事リストファイル１５４を参照して、カテゴリ共起ファイル１５７を生成してもよい。 In the example of FIG. 10, 30 users refer to both “gourmet” and “alcohol” categories, and 100 users refer to both “gourmet” and “fashion” categories. A combination of “gourmet” and “alcohol”, and a combination that differs only in the order of categories such as “alcohol” and “gourmet” are the same combination. In this embodiment, the category list file 153 is referred to as an example, but the category co-occurrence file 157 may be generated with reference to the user article list file 154.

図１１はカテゴリ共起類似度ファイル１５８のレコードレイアウトを示す説明図である。カテゴリ共起類似度ファイル１５８には、２つのカテゴリ間の第１類似度が記憶されている。例えばカテゴリ「グルメ」と「お酒」の第１類似度は０．９９となる。ここで、第１のカテゴリを「グルメ」とし、第２のカテゴリを「お酒」とし、第１類似度を算出する手順を説明する。ＣＰＵ１１は、カテゴリ共起類似度ファイル１５８を参照し、第１のカテゴリである「グルメ」と、第２のカテゴリである「お酒」以外の他のカテゴリとの間の組み合わせ数を読み出す。図１０の例ではファッションが１００、クルマが１２、音楽が７０、映画が４５、アニメが４０、ゲームが２５、野球が１５、サッカーが１５として読み出される。 FIG. 11 is an explanatory diagram showing a record layout of the category co-occurrence similarity file 158. The category co-occurrence similarity file 158 stores the first similarity between two categories. For example, the first similarity between the categories “gourmet” and “alcohol” is 0.99. Here, a procedure for calculating the first similarity by setting the first category as “gourmet” and the second category as “alcohol” will be described. The CPU 11 refers to the category co-occurrence similarity file 158 and reads the number of combinations between “gourmet” as the first category and other categories other than “alcohol” as the second category. In the example of FIG. 10, fashion is 100, car is 12, music is 70, movie is 45, animation is 40, game is 25, baseball is 15, and soccer is 15.

同様に、ＣＰＵ１１は、カテゴリ共起類似度ファイル１５８を参照し、第２のカテゴリである「お酒」と、第１のカテゴリである「グルメ」以外の他のカテゴリとの間の組み合わせ数を読み出す。図１０の例ではファッションが４５、クルマが８、音楽が５０、映画が２０、アニメが２４、ゲームが２０、野球が１０、サッカーが５として読み出される。ＣＰＵ１１は、第１のカテゴリ「グルメ」のベクトルを｛１００、１２、７０、４５、４０、２５、１５、１５｝とする。またＣＰＵ１１は、第２のカテゴリ「お酒」のベクトルを｛４５、８、５０、２０、２４、２０、１０、５｝とする。なお、ベクトルの列方向は第１ベクトル及び第２ベクトル共に他のカテゴリ｛ファッション、クルマ、音楽、映画、アニメ、ゲーム、野球、サッカー｝である。なお、本実施形態では第１のカテゴリと、第２のカテゴリ以外の他のカテゴリとの間の組み合わせ数に基づきベクトルを求める例を挙げたがこれに限るものではない。第１のカテゴリと、第２のカテゴリを含む他のカテゴリとの間の組み合わせ数に基づき、ベクトルを求めても良い。また他のカテゴリも全ての他のカテゴリの組み合わせ数を用いる必要は必ずしも無く、一部の他のカテゴリの組み合わせ数を用いても良い。 Similarly, the CPU 11 refers to the category co-occurrence similarity file 158 and determines the number of combinations between “alcohol” as the second category and other categories other than “gourmet” as the first category. read out. In the example of FIG. 10, fashion is 45, car is 8, music is 50, movie is 20, animation is 24, game is 20, baseball is 10, and soccer is 5. The CPU 11 sets the vector of the first category “gourmet” to {100, 12, 70, 45, 40, 25, 15, 15}. Further, the CPU 11 sets the vector of the second category “alcohol” to {45, 8, 50, 20, 24, 20, 10, 5}. The column direction of the vector is the other category {fashion, car, music, movie, animation, game, baseball, soccer} for both the first vector and the second vector. In the present embodiment, an example is given in which a vector is obtained based on the number of combinations between the first category and other categories other than the second category, but the present invention is not limited to this. The vector may be obtained based on the number of combinations between the first category and other categories including the second category. Further, it is not always necessary to use the number of combinations of all other categories for other categories, and the number of combinations of some other categories may be used.

ＣＰＵ１１は、第１カテゴリのベクトル及び第２カテゴリのベクトルから第１類似度を算出する。第１類似度はコサイン類似度、内積、または、相関関数等を用いて算出すればよい。例えば各ベクトルの差の２乗の合計値に基づき第１類似度を算出しても良い。本実施形態では一例としてコサイン類似度を用いる例を挙げて説明する。コサイン類似度は記憶部１５に記憶した下記式１にて算出する。 The CPU 11 calculates the first similarity from the first category vector and the second category vector. The first similarity may be calculated using a cosine similarity, an inner product, a correlation function, or the like. For example, the first similarity may be calculated based on the sum of squares of the differences between the vectors. In this embodiment, an example using cosine similarity will be described as an example. The cosine similarity is calculated by the following formula 1 stored in the storage unit 15.

（式１）

(Formula 1)

なお、Ｖは列数であり、上述した例では８となる。ＣＰＵ１１は、第１カテゴリのベクトル及び第２カテゴリのベクトルを式１に代入し、第１類似度０．９９を算出する。ＣＰＵ１１は、他の組み合わせについても同様の処理を行い、各カテゴリの組み合わせについて第１類似度を算出し、算出した第１類似度をカテゴリ共起類似度ファイル１５８に記憶する。例えば、「グルメ」と「ファッション」との第１類似度は０．５７と低くなっている。なお、図１１に示した数値は説明を容易にするために適当な値を示している。 V is the number of columns, and is 8 in the above-described example. The CPU 11 substitutes the vector of the first category and the vector of the second category into Equation 1, and calculates the first similarity 0.99. The CPU 11 performs the same process for other combinations, calculates the first similarity for each category combination, and stores the calculated first similarity in the category co-occurrence similarity file 158. For example, the first similarity between “gourmet” and “fashion” is as low as 0.57. Note that the numerical values shown in FIG. 11 are appropriate values for ease of explanation.

次いでＣＰＵ１１は、記憶部１５に記憶した第１閾値を読み出す。ＣＰＵ１１は、カテゴリ共起類似度ファイル１５８を参照し、第１閾値を超える第１類似度を有するカテゴリの組み合わせを抽出する。第１閾値はユーザが入力部１３から適宜の値を設定することがかのである。本実施形態では一例として第１閾値０．７４とする。ＣＰＵ１１は、第１閾値を超える組み合わせとして、「グルメ」と「ファッション」、「音楽」と「映画」、「アニメ」と「ゲーム」、「野球」と「サッカー」の４つの組み合わせを抽出する。 Next, the CPU 11 reads the first threshold value stored in the storage unit 15. The CPU 11 refers to the category co-occurrence similarity file 158 and extracts a combination of categories having a first similarity exceeding the first threshold. The first threshold is whether the user sets an appropriate value from the input unit 13. In this embodiment, the first threshold value is 0.74 as an example. The CPU 11 extracts four combinations of “gourmet” and “fashion”, “music” and “movie”, “anime” and “game”, “baseball” and “soccer” as combinations exceeding the first threshold.

図１２は内容類似度ファイル１５９を示す説明図である。ＣＰＵ１１は、抽出したカテゴリの組み合わせについて内容類似度（以下、第２類似度）を算出する。ＣＰＵ１１は、カテゴリの組み合わせに対応付けて算出した第２類似度を内容類似度ファイル１５９に記憶する。第２類似度の算出手順は以下のとおりである。ＣＰＵ１１は、抽出した一のカテゴリの語句及び出現頻度をカテゴリ語句リストファイル１５５から読み出す。例えばカテゴリ「グルメ」の場合、「ラーメン」が300回、「食べる」が900回、「美味しい」が600回、「ワイン」が200回、「飲む」が60回、「ビール」が45回、・・・と読み出される。ＣＰＵ１１は、一のカテゴリのベクトルを生成する。上述の例では一のカテゴリのベクトルは｛３００、９００、６００、２００、６０、４５・・｝となる。なお、ベクトルの列方向は予め定められており｛ラーメン、食べる、美味しい、ワイン、飲む、ビール・・｝となる。 FIG. 12 is an explanatory diagram showing the content similarity file 159. The CPU 11 calculates a content similarity (hereinafter referred to as a second similarity) for the extracted category combination. The CPU 11 stores the second similarity calculated in association with the category combination in the content similarity file 159. The procedure for calculating the second similarity is as follows. The CPU 11 reads the extracted words and appearance frequencies of one category from the category word / phrase list file 155. For example, in the category “Gourmet”, “Ramen” is 300 times, “Eat” is 900 times, “Delicious” is 600 times, “Wine” is 200 times, “Drink” is 60 times, “Beer” is 45 times, Is read out. The CPU 11 generates a vector of one category. In the above example, the vector of one category is {300, 900, 600, 200, 60, 45. The column direction of the vector is determined in advance and is {ramen, eat, delicious, wine, drink, beer ...}.

ＣＰＵ１１は、同様に抽出した他のカテゴリの語句及び出現頻度をカテゴリ語句リストファイル１５５から読み出す。例えばカテゴリ「お酒」の場合、「ビール」が320回、飲む」が600回、「食べる」が300回、「美味しい」が400回、「ワイン」が280回、「ラーメン」が80回・・・と読み出される。ＣＰＵ１１は、他のカテゴリのベクトルを生成する。上述の例では他のカテゴリのベクトルは｛３２０、６００、３００、４００、２８０、８０・・｝となる。ＣＰＵ１１は、一のカテゴリ「グルメ」のベクトルと、他のカテゴリ「お酒」のベクトルとを、式１に代入し第１カテゴリ及び第２カテゴリの語句及び出現頻度に基づく第２類似度を算出する。なお、第２類似度の算出は式１を用いる例を挙げるが他の方式により算出しても良い。 The CPU 11 reads the words and appearance frequencies of other categories extracted in the same manner from the category word / phrase list file 155. For example, in the category “Sake”, “Beer” 320 times, “Drink” 600 times, “Eat” 300 times, “Delicious” 400 times, “Wine” 280 times, “Ramen” 80 times Read as CPU11 produces | generates the vector of another category. In the above example, the vectors of other categories are {320, 600, 300, 400, 280, 80. The CPU 11 substitutes the vector of one category “gourmet” and the vector of the other category “sake” into Equation 1, and calculates the second similarity based on the words and appearance frequencies of the first category and the second category. To do. In addition, although the example which uses Formula 1 gives the calculation of the 2nd similarity degree, you may calculate by another system.

図１２の例では「グルメ」と「お酒」の第２類似度は０．８となる。また、「音楽」と「映画」の第２類似度は０．３、「アニメ」と「ゲーム」との第２類似度は０．７、「野球」と「サッカー」の第２類似度は０．２となる。続いてＣＰＵ１１は、記憶部１５から第２閾値を読み出す。本実施形態では第２閾値を第１閾値と同じ値とするが、他の値であっても良い。ＣＰＵ１１は、第２閾値を超える第２類似度を有するカテゴリの組み合わせを抽出する。図１２の例では「グルメ」と「お酒」、及び、「アニメ」と「ゲーム」の２つの組み合わせが抽出される。 In the example of FIG. 12, the second similarity between “gourmet” and “alcohol” is 0.8. The second similarity between “music” and “movie” is 0.3, the second similarity between “anime” and “game” is 0.7, and the second similarity between “baseball” and “soccer” is 0.2. Subsequently, the CPU 11 reads the second threshold value from the storage unit 15. In the present embodiment, the second threshold value is set to the same value as the first threshold value, but may be another value. CPU11 extracts the combination of the category which has the 2nd similarity exceeding a 2nd threshold value. In the example of FIG. 12, two combinations of “gourmet” and “alcohol” and “animation” and “game” are extracted.

ＣＰＵ１１は、カテゴリ記事リストファイル１５６を参照し、抽出したカテゴリの組み合わせ間で一致しないユーザの情報を削除する処理を行う。図１３は削除後のカテゴリ記事リストファイル１５６のレコードレイアウトを示す説明図である。ＣＰＵ１１は、抽出した一のカテゴリのユーザの記事ＩＤを読み出す。また、ＣＰＵ１１は、抽出した他のカテゴリのユーザの記事ＩＤを読み出す。ＣＰＵ１１は、記事ＩＤに付されたユーザの情報を参照し、カテゴリ間で相互にユーザが一致しないユーザの記事ＩＤを削除する。図９の例では「グルメ」の記事を記載しているユーザＨは、「お酒」の記事を記載していないため、ユーザＨの記事に係る記事ＩＤ（Ｈ００１、Ｈ００５）は削除される。同様にユーザＢの記事もカテゴリ「お酒」から削除される。一方、ユーザＡ及びユーザＣの記事は共にカテゴリ「グルメ」及び「お酒」の双方に記憶されているので、削除されない。ＣＰＵ１１は、同様の処理をもう一組の組み合わせ「アニメ」と「ゲーム」についても行う。ユーザＥ及びＨの記事が削除されず、カテゴリ「ゲーム」のユーザＦ及びＨの記事が削除される。 The CPU 11 refers to the category article list file 156 and performs processing for deleting user information that does not match between the extracted category combinations. FIG. 13 is an explanatory diagram showing the record layout of the category article list file 156 after deletion. The CPU 11 reads the extracted article ID of the user in one category. Moreover, CPU11 reads the article ID of the user of the other category extracted. The CPU 11 refers to the user information attached to the article ID, and deletes the article ID of the user whose user does not match between the categories. In the example of FIG. 9, the user H who describes the article “gourmet” does not describe the article “alcohol”, so the article IDs (H001, H005) related to the article of the user H are deleted. Similarly, the article of user B is also deleted from the category “alcohol”. On the other hand, since the articles of user A and user C are both stored in both the categories “gourmet” and “sake”, they are not deleted. The CPU 11 performs the same processing for another set of combinations “animation” and “game”. The articles of users E and H are not deleted, and the articles of users F and H in the category “game” are deleted.

以上のハードウェア群において、各種処理内容を、フローチャートを用いて説明する。図１４はカテゴリ記事リストファイル１５６の生成処理手順を示すフローチャートである。ＣＰＵ１１は、サーバコンピュータ２からブログ記事をダウンロードする（ステップＳ１４１）。ＣＰＵ１１は、新たなブログ記事が存在する場合、記事ＩＤを生成する（ステップＳ１４２）。ＣＰＵ１１は、ブログ記事からユーザ名、カテゴリ、及び本文を抽出する。ＣＰＵ１１は、ユーザ名、記事ＩＤ、カテゴリ及び本文をユーザブログ記事リストファイル１５２に記憶する（ステップＳ１４３）。ＣＰＵ１１は、各ユーザのカテゴリを抽出し、ユーザ毎に抽出したカテゴリを記憶することで、カテゴリ語句リストファイル１５５を生成する（ステップＳ１４４）。 Various processing contents in the above hardware group will be described with reference to flowcharts. FIG. 14 is a flowchart showing a procedure for generating the category article list file 156. The CPU 11 downloads a blog article from the server computer 2 (step S141). CPU11 produces | generates article ID, when a new blog article exists (step S142). CPU11 extracts a user name, a category, and a text from a blog article. The CPU 11 stores the user name, article ID, category, and text in the user blog article list file 152 (step S143). The CPU 11 extracts the category of each user and stores the category extracted for each user, thereby generating the category word / phrase list file 155 (step S144).

ＣＰＵ１１は、ユーザブログ記事リストファイル１５２を参照し、形態素解析により、本文中の語句を抽出する（ステップＳ１４５）。ＣＰＵ１１は、記事ＩＤに対応付けて抽出した語句をユーザ記事リストファイル１５４に記憶する（ステップＳ１４６）。ＣＰＵ１１は、ユーザ記事リストファイル１５４を参照し、カテゴリ毎に語句を抽出し、各語句の出現頻度を計数し、語句及び出現頻度を、カテゴリ語句リストファイル１５５に記憶する（ステップＳ１４７）。ＣＰＵ１１は、ユーザブログ記事リストファイル１５２を参照し、カテゴリ毎に各ユーザの記事ＩＤを、カテゴリ記事リストファイル１５６に記憶する（ステップＳ１４８）。 The CPU 11 refers to the user blog article list file 152 and extracts words / phrases in the text by morphological analysis (step S145). The CPU 11 stores the word / phrase extracted in association with the article ID in the user article list file 154 (step S146). The CPU 11 refers to the user article list file 154, extracts words for each category, counts the appearance frequency of each word, and stores the words and appearance frequency in the category word list file 155 (step S147). The CPU 11 refers to the user blog article list file 152 and stores the article ID of each user in the category article list file 156 for each category (step S148).

図１５はカテゴリ共起ファイル１５７の生成処理手順を示すフローチャートである。ＣＰＵ１１は、ユーザに対応するカテゴリをユーザカテゴリリストファイル１５３から読み出す（ステップＳ１５１）。ＣＰＵ１１は、読み出したカテゴリが複数存在するか否かを判断する（ステップＳ１５２）。ＣＰＵ１１は、複数存在すると判断した場合（ステップＳ１５２でＹＥＳ）、処理をステップＳ１５３へ移行させる。ＣＰＵ１１は、複数のカテゴリの組み合わせを抽出する（ステップＳ１５３）。図６のユーザＡの例では、「グルメ」と「ファッション」、「グルメ」と「お酒」、及び、「ファッション」と「お酒」の３つの組み合わせが抽出される。 FIG. 15 is a flowchart showing a procedure for generating the category co-occurrence file 157. The CPU 11 reads out a category corresponding to the user from the user category list file 153 (step S151). The CPU 11 determines whether there are a plurality of read categories (step S152). If the CPU 11 determines that there are a plurality (YES in step S152), the process proceeds to step S153. CPU11 extracts the combination of a some category (step S153). In the example of user A in FIG. 6, three combinations of “gourmet” and “fashion”, “gourmet” and “alcohol”, and “fashion” and “alcohol” are extracted.

ＣＰＵ１１は、抽出したカテゴリの組み合わせについて、カテゴリ共起ファイル１５７の対応するカテゴリの組み合わせ数をインクリメントする（ステップＳ１５４）。図６のユーザＡの例では、「グルメ」と「ファッション」、「グルメ」と「お酒」、及び、「ファッション」と「お酒」に対応するカテゴリ共起ファイル１５７の数がそれぞれインクリメントされる。ＣＰＵ１１は、カテゴリが複数存在しないと判断した場合（ステップＳ１５２でＮＯ）、ステップＳ１５３及びＳ１５４をスキップし、処理をステップＳ１５５へ移行させる。 The CPU 11 increments the number of corresponding category combinations in the category co-occurrence file 157 for the extracted category combinations (step S154). In the example of user A in FIG. 6, the number of category co-occurrence files 157 corresponding to “gourmet” and “fashion”, “gourmet” and “alcohol”, and “fashion” and “alcohol” are respectively incremented. The If the CPU 11 determines that there are not a plurality of categories (NO in step S152), the CPU 11 skips steps S153 and S154 and shifts the processing to step S155.

ＣＰＵ１１は、以上述べた処理を全てのユーザについて処理を終了したか否かを判断する（ステップＳ１５５）。ＣＰＵ１１は、処理を終了していないと判断した場合（ステップＳ１５５でＮＯ）、ステップＳ１５６へ移行する。ＣＰＵ１１は、他のユーザのカテゴリを、ユーザカテゴリリストファイル１５３から読み出す（ステップＳ１５６）。その後処理をステップＳ１５２に戻す。以上の処理を繰り返すことにより、各ユーザのカテゴリの組み合わせに基づくカテゴリ共起ファイル１５７が完成する。ＣＰＵ１１は、全てのユーザについて処理を終了したと判断した場合（ステップＳ１５５でＹＥＳ）、一連の処理を終了する。 The CPU 11 determines whether or not the processing described above has been completed for all users (step S155). If the CPU 11 determines that the process has not ended (NO in step S155), the process proceeds to step S156. The CPU 11 reads out other user categories from the user category list file 153 (step S156). Thereafter, the process returns to step S152. By repeating the above processing, the category co-occurrence file 157 based on the combination of categories of each user is completed. If the CPU 11 determines that the processing has been completed for all users (YES in step S155), the series of processing ends.

図１６は第１類似度算出処理の手順を示すフローチャートである。ＣＰＵ１１は、第１類似度を算出する対象となる第１カテゴリ及び第２カテゴリを抽出する（ステップＳ１６１）。ＣＰＵ１１は、カテゴリ共起ファイル１５７を参照し、第１カテゴリと、第２カテゴリ以外の他のカテゴリとの間の組み合わせ数を読み出す（ステップＳ１６２）。ＣＰＵ１１は、読み出した組み合わせ数に基づき、第１ベクトルを生成する（ステップＳ１６３）。ＣＰＵ１１は、カテゴリ共起ファイル１５７を参照し、第２カテゴリと、第１カテゴリ以外の他のカテゴリとの間の組み合わせ数を読み出す（ステップＳ１６４）。ＣＰＵ１１は、読み出した組み合わせ数に基づき、第２ベクトルを生成する（ステップＳ１６５）。 FIG. 16 is a flowchart showing the procedure of the first similarity calculation process. CPU11 extracts the 1st category and the 2nd category used as the object which computes the 1st similarity (Step S161). The CPU 11 refers to the category co-occurrence file 157 and reads the number of combinations between the first category and other categories other than the second category (step S162). The CPU 11 generates a first vector based on the read number of combinations (step S163). The CPU 11 refers to the category co-occurrence file 157 and reads the number of combinations between the second category and other categories other than the first category (step S164). The CPU 11 generates a second vector based on the read combination number (step S165).

ＣＰＵ１１は、記憶部１５に記憶した式１を読み出す（ステップＳ１６６）。ＣＰＵ１１は、第１ベクトル及び第２ベクトルを式１に代入し、第１類似度を算出する（ステップＳ１６７）。ＣＰＵ１１は、第１カテゴリ及び第２カテゴリに対応付けて、算出した第１類似度を、カテゴリ共起類似度ファイル１５８に記憶する（ステップＳ１６８）。ＣＰＵ１１は、カテゴリ共起ファイル１５７に記憶した全てのカテゴリの組み合わせについて処理を終了したか否かを判断する（ステップＳ１６９）。ＣＰＵ１１は、全てのカテゴリの組み合わせについて処理を終了していないと判断した場合（ステップＳ１６９でＮＯ）、処理をステップＳ１６１０へ移行させる。 CPU11 reads Formula 1 memorize | stored in the memory | storage part 15 (step S166). The CPU 11 substitutes the first vector and the second vector into Equation 1 to calculate the first similarity (step S167). The CPU 11 stores the calculated first similarity in association with the first category and the second category in the category co-occurrence similarity file 158 (step S168). The CPU 11 determines whether or not the processing has been completed for all the combinations of categories stored in the category co-occurrence file 157 (step S169). If the CPU 11 determines that the process has not been completed for all combinations of categories (NO in step S169), the process proceeds to step S1610.

ＣＰＵ１１は、ステップＳ１６１とは異なる、他の第１カテゴリ及び第２カテゴリの組み合わせを抽出する（ステップＳ１６１０）。その後処理をステップＳ１６２に戻す。以上の処理を繰り返すことにより、全てのカテゴリの組み合わせについて第１類似度を算出することができる。ＣＰＵ１１は、全てのカテゴリの組み合わせについて処理を終了したと判断した場合（ステップＳ１６９でＹＥＳ）、一連の処理を終了する。 CPU11 extracts the combination of the other 1st category and 2nd category different from step S161 (step S1610). Thereafter, the process returns to step S162. By repeating the above processing, the first similarity can be calculated for all combinations of categories. If the CPU 11 determines that the process has been completed for all the combinations of categories (YES in step S169), the series of processes is terminated.

図１７及び図１８は第２類似度算出処理の手順を示すフローチャートである。ＣＰＵ１１は、記憶部１５に記憶した第１閾値を読み出す（ステップＳ１７１）。ＣＰＵ１１は、カテゴリ共起類似度ファイル１５８を参照し、第１閾値を超える第１類似度を有するカテゴリの組み合わせを、抽出する（ステップＳ１７２）。ＣＰＵ１１は、抽出したカテゴリの組み合わせについて以下の処理を行う。ＣＰＵ１１は、カテゴリ語句リストファイル１５５を参照し、一のカテゴリの語句及び出現頻度を読み出す（ステップＳ１７３）。ＣＰＵ１１は、語句順に出現頻度を並び替える（ステップＳ１７４）。この語句順は例えば５０音順等予め定めておけば良い。ＣＰＵ１１は、語句の出現頻度を列値とする一のカテゴリのベクトルを生成する（ステップＳ１７５）。 17 and 18 are flowcharts showing the procedure of the second similarity calculation process. The CPU 11 reads the first threshold value stored in the storage unit 15 (step S171). The CPU 11 refers to the category co-occurrence similarity file 158 and extracts a combination of categories having the first similarity exceeding the first threshold (step S172). The CPU 11 performs the following processing for the extracted category combination. The CPU 11 refers to the category word / phrase list file 155 and reads the words and appearance frequencies of one category (step S173). The CPU 11 rearranges the appearance frequencies in the order of words (step S174). This word order may be determined in advance, for example, in the order of the Japanese syllabary. CPU11 produces | generates the vector of one category which uses the appearance frequency of a phrase as a column value (step S175).

ＣＰＵ１１は、カテゴリ語句リストファイル１５５を参照し、一のカテゴリと対になる他のカテゴリの語句及び出現頻度を読み出す（ステップＳ１７６）。ＣＰＵ１１は、語句順に出現頻度を並び替える（ステップＳ１７７）。ＣＰＵ１１は、語句の出現頻度を列値とする他のカテゴリのベクトルを生成する（ステップＳ１７８）。ＣＰＵ１１は、記憶部１５から式１を読み出す（ステップＳ１７９）。ＣＰＵ１１は、一のカテゴリのベクトルと、他のカテゴリのベクトルとを式１へ代入し、第２類似度を算出する（ステップＳ１８１）。 The CPU 11 refers to the category word / phrase list file 155 and reads the words / occurrence frequencies of other categories paired with one category (step S176). The CPU 11 rearranges the appearance frequencies in the order of words (step S177). CPU11 produces | generates the vector of the other category which makes the appearance frequency of a phrase a column value (step S178). CPU11 reads Formula 1 from the memory | storage part 15 (step S179). CPU11 substitutes the vector of one category and the vector of another category to Formula 1, and calculates the 2nd similarity (Step S181).

ＣＰＵ１１は、算出した第２類似度を、一のカテゴリ及び他のカテゴリに対応付けて、内容類似度ファイル１５９に記憶する（ステップＳ１８２）。ＣＰＵ１１は、ステップＳ１７２で抽出した全てのカテゴリの組み合わせについて処理を終了したか否かを判断する（ステップＳ１８３）。ＣＰＵ１１は、処理を終了していないと判断した場合（ステップＳ１８３でＮＯ）、処理をステップＳ１８４へ移行させる。ＣＰＵ１１は、他の組み合わせとなる一のカテゴリ及び他のカテゴリを抽出する（ステップＳ１８４）。その後処理をステップＳ１７３へ戻す。以上の処理を繰り返すことにより第１閾値を超える第１類似度を有する全てのカテゴリの組み合わせについて第２類似度を算出する事ができる。ＣＰＵ１１は、全てのカテゴリの組み合わせについて処理を終了したと判断した場合（ステップＳ１８３でＹＥＳ）、一連の処理を終了する。
本実施例では、以上の処理により、第２類似度よりも演算量の少ない第1類似度によって、第２類似度の算出を行う組み合わせを絞り込むことで、第２類似度の演算量を削減する。これにより、本実施例では、他のカテゴリとの間でモデルに揺れが生じやすいカテゴリを実用的な時間内で求めることが可能となる。 The CPU 11 stores the calculated second similarity in the content similarity file 159 in association with one category and another category (step S182). The CPU 11 determines whether or not the processing has been completed for all the combinations of categories extracted in step S172 (step S183). If the CPU 11 determines that the process has not ended (NO in step S183), the process proceeds to step S184. The CPU 11 extracts one category and another category that are another combination (step S184). Thereafter, the process returns to step S173. By repeating the above processing, the second similarity can be calculated for all the combinations of categories having the first similarity exceeding the first threshold. If the CPU 11 determines that the process has been completed for all the combinations of categories (YES in step S183), the series of processes is terminated.
In the present embodiment, the amount of calculation of the second similarity is reduced by narrowing down the combinations for calculating the second similarity by the first similarity having a smaller amount of calculation than the second similarity by the above processing. . As a result, in this embodiment, it is possible to obtain a category in which the model is likely to fluctuate with other categories within a practical time.

図１９は削除処理の手順を示すフローチャートである。ＣＰＵ１１は、記憶部１５に記憶した第２閾値を読み出す（ステップＳ１９１）。ＣＰＵ１１は、内容類似度ファイル１５９を参照し、第２閾値を超える第２類似度を有するカテゴリの組み合わせを抽出する（ステップＳ１９２）。ＣＰＵ１１は、カテゴリ記事リストファイル１５６を参照し、一のカテゴリの記事ＩＤを読み出す（ステップＳ１９３）。ＣＰＵ１１は、記事ＩＤに対応するユーザを特定する（ステップＳ１９４）。具体的には記事ＩＤに付与されたユーザの識別情報を参照するか、または、ユーザ記事リストファイル１５４を参照して記事ＩＤに対応するユーザを特定する。 FIG. 19 is a flowchart showing the procedure of the deletion process. CPU11 reads the 2nd threshold value which is remembered in memory section 15 (step S191). The CPU 11 refers to the content similarity file 159 and extracts a combination of categories having the second similarity exceeding the second threshold (step S192). The CPU 11 refers to the category article list file 156 and reads the article ID of one category (step S193). CPU11 specifies the user corresponding to article ID (step S194). Specifically, the user identification information given to the article ID is referred to, or the user corresponding to the article ID is specified with reference to the user article list file 154.

ＣＰＵ１１は、カテゴリ記事リストファイル１５６を参照し、ステップＳ１９２で抽出した一のカテゴリに対応する他のカテゴリの記事ＩＤを読み出す（ステップＳ１９５）。ＣＰＵ１１は、記事ＩＤに対応するユーザを特定する（ステップＳ１９６）。ＣＰＵ１１は、特定したユーザが一致しない記事ＩＤをカテゴリ記事リストファイル１５６から削除する（ステップＳ１９７）。換言すればＣＰＵ１１は、特定したユーザがカテゴリ間で共通する記事ＩＤのみを保存する。ＣＰＵ１１は、ステップＳ１９２で抽出した全てのカテゴリの組み合わせについて処理したか否かを判断する（ステップＳ１９８）。 The CPU 11 refers to the category article list file 156 and reads the article ID of another category corresponding to the one category extracted in step S192 (step S195). CPU11 specifies the user corresponding to article ID (step S196). The CPU 11 deletes the article ID that does not match the identified user from the category article list file 156 (step S197). In other words, the CPU 11 stores only the article ID that the identified user has in common between categories. The CPU 11 determines whether or not all combinations of categories extracted in step S192 have been processed (step S198).

ＣＰＵ１１は、全てのカテゴリの組み合わせについて処理していないと判断した場合（ステップＳ１９８でＮＯ）、処理をステップＳ１９９へ移行させる。ＣＰＵ１１は、他のカテゴリの組み合わせを選択する（ステップＳ１９９）。その後処理をステップＳ１９３へ移行させる。以上の処理を繰り返すことにより、第２閾値を超える第２類似度を有する全てのカテゴリの組み合わせについて、削除処理が終了する。ＣＰＵ１１は、全てのカテゴリの組み合わせについて処理したと判断した場合（ステップＳ１９８でＹＥＳ）、一連の処理を終了する。このように、カテゴリ間で第１類似度の高いカテゴリの組み合わせを事前に絞り込むことで、計算量を大幅に低減することが可能となる。 If the CPU 11 determines that all combinations of categories have not been processed (NO in step S198), the process proceeds to step S199. The CPU 11 selects another category combination (step S199). Thereafter, the process proceeds to step S193. By repeating the above processing, the deletion processing ends for all the combinations of categories having the second similarity exceeding the second threshold. If the CPU 11 determines that all combinations of categories have been processed (YES in step S198), the series of processing ends. As described above, it is possible to significantly reduce the amount of calculation by previously narrowing down combinations of categories having high first similarity between categories.

実施の形態２
図２０は上述した形態のコンピュータ１の動作を示す機能ブロック図である。ＣＰＵ１１が制御プログラム１５Ｐを実行することにより、コンピュータ１は以下のように動作する。算出部２０１は、ユーザに対応付けられたカテゴリを参照し、カテゴリ間の組み合わせ数を算出する。第１類似度算出部２０２は、算出したカテゴリ間の組み合わせ数に基づき、カテゴリ間の第１類似度を算出する。抽出部２０３は閾値を超える第１類似度を有するカテゴリの組み合わせを抽出する。第２類似度算出部２０４は抽出した各カテゴリに対応する用語及び用語数に基づき、抽出したカテゴリ間の第２類似度を算出する。 Embodiment 2
FIG. 20 is a functional block diagram showing the operation of the computer 1 of the above-described form. When the CPU 11 executes the control program 15P, the computer 1 operates as follows. The calculation unit 201 refers to the category associated with the user and calculates the number of combinations between categories. The first similarity calculation unit 202 calculates the first similarity between categories based on the calculated number of combinations between categories. The extraction unit 203 extracts a combination of categories having a first similarity exceeding a threshold value. The second similarity calculation unit 204 calculates a second similarity between the extracted categories based on the terms corresponding to each extracted category and the number of terms.

図２１は実施の形態２に係るコンピュータ１のハードウェア群を示すブロック図である。コンピュータ１を動作させるためのプログラムは、ディスクドライブ等の読み取り部１０ＡにCD-ROM、DVD（Digital Versatile Disc）ディスク、メモリーカード、またはUSB(Universal Serial Bus)メモリ等の可搬型記録媒体１Ａを読み取らせて記憶部１５に記憶しても良い。また当該プログラムを記憶したフラッシュメモリ等の半導体メモリ１Ｂをコンピュータ１内に実装しても良い。さらに、当該プログラムは、インターネット等の通信網Ｎを介して接続される他のサーバコンピュータ（図示せず）からダウンロードすることも可能である。以下に、その内容を説明する。 FIG. 21 is a block diagram illustrating a hardware group of the computer 1 according to the second embodiment. A program for operating the computer 1 reads a portable recording medium 1A such as a CD-ROM, a DVD (Digital Versatile Disc) disk, a memory card, or a USB (Universal Serial Bus) memory into a reading unit 10A such as a disk drive. It may be stored in the storage unit 15. Further, a semiconductor memory 1B such as a flash memory storing the program may be mounted in the computer 1. Further, the program can be downloaded from another server computer (not shown) connected via a communication network N such as the Internet. The contents will be described below.

図２１に示すコンピュータ１は、上述した各種ソフトウェア処理を実行するプログラムを、可搬型記録媒体１Ａまたは半導体メモリ１Ｂから読み取り、或いは、通信網Ｎを介して他のサーバコンピュータ（図示せず）からダウンロードする。当該プログラムは、制御プログラム１５Ｐとしてインストールされ、ＲＡＭ１２にロードして実行される。これにより、上述したコンピュータ１として機能する。 The computer 1 shown in FIG. 21 reads a program for executing the above-described various software processes from the portable recording medium 1A or the semiconductor memory 1B or downloads it from another server computer (not shown) via the communication network N. To do. The program is installed as the control program 15P, loaded into the RAM 12, and executed. Thereby, it functions as the computer 1 described above.

本実施の形態２は以上の如きであり、その他は実施の形態１と同様であるので、対応する部分には同一の参照番号を付してその詳細な説明を省略する。 The second embodiment is as described above, and the other parts are the same as those of the first embodiment. Therefore, the corresponding parts are denoted by the same reference numerals, and detailed description thereof is omitted.

以上の実施の形態１及び２を含む実施形態に関し、さらに以下の付記を開示する。 With respect to the embodiments including the first and second embodiments, the following additional notes are disclosed.

（付記１）
コンピュータを用いた情報処理方法において、
ユーザに対応付けられたカテゴリを参照し、カテゴリ間の組み合わせ数を算出し、
算出したカテゴリ間の組み合わせ数に基づき、カテゴリ間の第１類似度を算出し、
閾値を超える第１類似度を有するカテゴリの組み合わせを抽出し、
抽出した各カテゴリに対応する用語及び用語数に基づき、抽出したカテゴリ間の第２類似度を算出する
情報処理方法。 (Appendix 1)
In an information processing method using a computer,
Refer to the category associated with the user, calculate the number of combinations between categories,
Based on the calculated number of combinations between categories, calculate the first similarity between categories,
Extracting a combination of categories having a first similarity exceeding a threshold;
An information processing method for calculating a second similarity between extracted categories based on terms and the number of terms corresponding to each extracted category.

（付記２）
各ユーザに対応付けてカテゴリを記憶した記憶部を参照し、ユーザ毎にカテゴリ間の組み合わせを計数することで、複数のユーザのカテゴリ間の組み合わせ数を算出する
付記１に記載の情報処理方法。 (Appendix 2)
The information processing method according to claim 1, wherein the number of combinations between categories of a plurality of users is calculated by referring to a storage unit that stores categories in association with each user and counting combinations between categories for each user.

（付記３）
第１のカテゴリと第２のカテゴリ以外の他のカテゴリとの間の算出した組み合わせ数、及び、前記第２のカテゴリと前記第１のカテゴリ以外の他のカテゴリとの間の算出した組み合わせ数とに基づき、第１カテゴリと第２カテゴリとの間の第１類似度を算出する
付記１または２に記載の情報処理方法。 (Appendix 3)
The calculated number of combinations between the first category and other categories other than the second category, and the calculated number of combinations between the second category and other categories other than the first category; The information processing method according to attachment 1 or 2, wherein a first similarity between the first category and the second category is calculated based on

（付記４）
カテゴリに対応付けて用語及び用語数を記憶した記憶部を参照し、抽出した一のカテゴリの用語及び用語数を読み出し、
前記記憶部を参照し、他のカテゴリの用語及び用語数を読み出し、
読み出した前記一のカテゴリの用語及び用語数と前記他のカテゴリの用語及び用語数とに基づき、前記一のカテゴリと前記他のカテゴリとの間の第２類似度を算出する
付記１から３のいずれか一つに記載の情報処理方法。 (Appendix 4)
Refer to the storage unit storing the term and the number of terms in association with the category, and read out the term and the number of terms in one extracted category,
Refer to the storage unit, read the terms and the number of terms in other categories,
The second similarity between the one category and the other category is calculated based on the read terms and the number of terms in the one category and the terms and the number of terms in the other category. The information processing method as described in any one.

（付記５）
閾値を超える第２類似度を有するカテゴリの組み合わせを抽出し、
カテゴリに対応付けてユーザに関する情報を記憶した記憶部を参照し、抽出したカテゴリ間で一致しないユーザに関する情報を削除する
付記１から４のいずれか一つに記載の情報処理方法。 (Appendix 5)
Extracting a combination of categories having a second similarity exceeding a threshold;
The information processing method according to any one of supplementary notes 1 to 4, wherein information relating to a user that does not match between the extracted categories is deleted with reference to a storage unit that stores information relating to the user in association with the category.

（付記６）
コンピュータに、
ユーザに対応付けられたカテゴリを参照し、カテゴリ間の組み合わせ数を算出し、
算出したカテゴリ間の組み合わせ数に基づき、カテゴリ間の第１類似度を算出し、
閾値を超える第１類似度を有するカテゴリの組み合わせを抽出し、
抽出した各カテゴリに対応する用語及び用語数に基づき、抽出したカテゴリ間の第２類似度を算出する
処理を実行させるプログラム。 (Appendix 6)
On the computer,
Refer to the category associated with the user, calculate the number of combinations between categories,
Based on the calculated number of combinations between categories, calculate the first similarity between categories,
Extracting a combination of categories having a first similarity exceeding a threshold;
A program that executes a process of calculating a second similarity between extracted categories based on terms and the number of terms corresponding to each extracted category.

（付記７）
ユーザに対応付けられたカテゴリを参照し、カテゴリ間の組み合わせ数を算出する算出部と、
算出したカテゴリ間の組み合わせ数に基づき、カテゴリ間の第１類似度を算出する第１類似度算出部と、
閾値を超える第１類似度を有するカテゴリの組み合わせを抽出する抽出部と、
抽出した各カテゴリに対応する用語及び用語数に基づき、抽出したカテゴリ間の第２類似度を算出する第２類似度算出部と
を備える情報処理装置。 (Appendix 7)
A calculation unit that refers to the category associated with the user and calculates the number of combinations between categories;
A first similarity calculator that calculates a first similarity between categories based on the calculated number of combinations between categories;
An extraction unit for extracting a combination of categories having a first similarity exceeding a threshold;
An information processing apparatus comprising: a second similarity calculation unit that calculates a second similarity between extracted categories based on terms and the number of terms corresponding to each extracted category.

１コンピュータ
１Ａ可搬型記録媒体
１Ｂ半導体メモリ
２サーバコンピュータ
１０Ａ読み取り部
１１ＣＰＵ
１２ＲＡＭ
１３入力部
１４表示部
１５記憶部
１５Ｐ制御プログラム
１６通信部
１５１カテゴリリストファイル
１５２ユーザブログ記事リストファイル
１５３ユーザカテゴリリストファイル
１５４ユーザ記事リストファイル
１５５カテゴリ語句リストファイル
１５６カテゴリ記事リストファイル
１５７カテゴリ共起ファイル
１５８カテゴリ共起類似度ファイル
１５９内容類似度ファイル
２０１算出部
２０２第１類似度算出部
２０３抽出部
２０４第２類似度算出部
Ｎ通信網 DESCRIPTION OF SYMBOLS 1 Computer 1A Portable recording medium 1B Semiconductor memory 2 Server computer 10A Reading part 11 CPU
12 RAM
13 Input unit 14 Display unit 15 Storage unit 15P Control program 16 Communication unit 151 Category list file 152 User blog article list file 153 User category list file 154 User article list file 155 Category phrase list file 156 Category article list file 157 Category co-occurrence file 158 Category co-occurrence similarity file 159 Content similarity file 201 Calculation unit 202 First similarity calculation unit 203 Extraction unit 204 Second similarity calculation unit N Communication network

Claims

In an information processing method using a computer,
Refer to the category associated with the user, calculate the number of combinations between categories,
Based on the calculated number of combinations between categories, calculate the first similarity between categories,
Extracting a combination of categories having a first similarity exceeding a threshold;
An information processing method for calculating a second similarity between extracted categories based on terms and the number of terms corresponding to each extracted category.

The information processing method according to claim 1, wherein the number of combinations between categories of a plurality of users is calculated by referring to a storage unit that stores categories in association with each user and counting combinations between categories for each user. .

The calculated number of combinations between the first category and other categories other than the second category, and the calculated number of combinations between the second category and other categories other than the first category; The information processing method according to claim 1 or 2, wherein a first similarity between the first category and the second category is calculated based on the first category.

On the computer,
Refer to the category associated with the user, calculate the number of combinations between categories,
Based on the calculated number of combinations between categories, calculate the first similarity between categories,
Extracting a combination of categories having a first similarity exceeding a threshold;
A program that executes a process of calculating a second similarity between extracted categories based on terms and the number of terms corresponding to each extracted category.

A calculation unit that refers to the category associated with the user and calculates the number of combinations between categories;
A first similarity calculator that calculates a first similarity between categories based on the calculated number of combinations between categories;
An extraction unit for extracting a combination of categories having a first similarity exceeding a threshold;
An information processing apparatus comprising: a second similarity calculation unit that calculates a second similarity between extracted categories based on terms and the number of terms corresponding to each extracted category.