JP6390139B2

JP6390139B2 - Document search device, document search method, program, and document search system

Info

Publication number: JP6390139B2
Application number: JP2014074159A
Authority: JP
Inventors: 侑吾西川
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2014-03-31
Filing date: 2014-03-31
Publication date: 2018-09-19
Anticipated expiration: 2034-03-31
Also published as: JP2015197722A

Description

本発明は、ユーザに適した文書や書籍等のコンテンツを検索する文書検索技術に関するものである。 The present invention relates to a document search technique for searching for contents such as documents and books suitable for a user.

近年、文書や書籍等を選択したユーザに対して、ユーザの嗜好に応じた文書を推薦するマッチング方法が研究開発されている。既存の主なマッチング方法として、内容が類似する文書を推薦する方法（文書ベース）や、同一の文書を選択した他のユーザが所望した文書を推薦する方法(協調ベース)や、ユーザの属性に応じた文書を推薦する方法(ルールベース)等がある。 In recent years, a matching method for recommending a document according to a user's preference to a user who has selected a document or a book has been researched and developed. As existing main matching methods, a method of recommending documents with similar contents (document base), a method of recommending a document desired by another user who has selected the same document (cooperation base), and a user attribute There is a method (rule base) for recommending a corresponding document.

一方で、特許文献１には、文書から意外性のある文章を抽出する技術が記載されている。 On the other hand, Patent Document 1 describes a technique for extracting an unexpected sentence from a document.

特開２０１１−９５９０５号公報JP 2011-95905 A

しかしながら、既存のマッチング方法を用いた場合、同一シリーズ等の既知の文書が提示されることや、類似性が高すぎる文書が複数提示されることが多く、推薦結果に意外性や多様性が無いという課題があった。 However, when existing matching methods are used, known documents such as the same series are often presented, or multiple documents with too high similarity are often presented, and the recommendation result has no unexpectedness or diversity. There was a problem.

特許文献１には、ユーザが選択した文書（以下、検索対象文書と表記）に登場する各単語のカテゴリ内での登場回数に基づいて、検索対象文書内の文章に意外性スコアを付与するという記載はあるが、意外性スコアに基づいて文書を検索するといった記載は無い。 According to Patent Document 1, an unexpectedness score is given to sentences in a search target document based on the number of appearances of each word appearing in a document selected by the user (hereinafter referred to as a search target document). Although there is a description, there is no description that the document is searched based on the unexpectedness score.

本発明は、前述した問題点に鑑みてなされたもので、その目的とすることは、ユーザが指定した文書に対して、類似性に加えて、意外性や多様性のある文書を検索する文書検索装置等を提供することである。 The present invention has been made in view of the above-described problems, and an object of the present invention is to search for a document having unexpectedness and diversity in addition to similarity to a document specified by a user. It is to provide a search device and the like.

前述の課題を解決するために第１の発明は、検索対象となる検索対象文書と関連性のある推薦文書を検索する文書検索装置であって、推薦文書の文書ごとに特徴語を管理し、記憶する第１の記憶手段と、推薦文書の分類ごとに特徴語を管理し、記憶する第２の記憶手段と、前記第１の記憶手段を参照して、前記対象文書と推薦文書との文書単位での類似度である文書類似度を算出する文書類似度算出手段と、前記第２の記憶手段を参照して、前記対象文書と推薦文書との分類単位での類似度である分類類似度を算出する分類類似度算出手段と、前記文書類似度と前記分類類似度との差分を取り、算出された差分値を参照して前記推薦文書を抽出する検索手段と、を備えることを特徴とする文書検索装置である。
第１の発明により、起点となる検索対象文書に対して、意外性や多様性のある推薦文書を検索することができる。また、文書の類似性と分類の類似性との関係に基づいて、推薦文書を抽出することができるため、意外性や多様性のある推薦文書を検索することができる。
なお、「分類」とは、同一の種類の推薦文書を集めたもので、実施形態におけるカテゴリに相当する。
In order to solve the above-described problem, a first invention is a document search device that searches for a recommended document that is related to a search target document to be searched, and manages a feature word for each document of the recommended document, First storage means for storing, second storage means for managing and storing feature words for each category of recommended documents, and documents of the target document and recommended documents with reference to the first storage means A document similarity calculation unit that calculates a document similarity that is a similarity in units, and a classification similarity that is a similarity in a classification unit between the target document and the recommended document with reference to the second storage unit Classification similarity calculation means for calculating the difference, and a search means for taking the difference between the document similarity and the classification similarity and extracting the recommended document with reference to the calculated difference value. This is a document retrieval device.
According to the first invention, it is possible to search for a recommended document having unexpectedness and diversity with respect to a search target document as a starting point. Further, since the recommended document can be extracted based on the relationship between the similarity of the document and the similarity of the classification, it is possible to search for a recommended document having unexpectedness and diversity.
The “classification” is a collection of recommended documents of the same type and corresponds to a category in the embodiment.

また、前記差分値が予め設定された閾値を超えるものを抽出することが望ましい。
これにより、文書類似度は高いが分類類似度の低い推薦文書を抽出することができるため、意外性や多様性のある推薦文書を検索することができる。 In addition, it is desirable to extract a case where the difference value exceeds a preset threshold value.
This makes it possible to extract a recommended document having a high document similarity but a low classification similarity, and therefore, it is possible to search for recommended documents having unexpectedness and diversity.

また、前記検索手段は、前記分類類似度に対して所定係数を乗じて差分を取ることが望ましい。
これにより、分類の類似性を所定係数により調整して、分類がある程度類似している推薦文書を検索したり、分類が似ていない推薦文書を検索したりすることができる。 Further, it is desirable that the search means obtains a difference by multiplying the classification similarity by a predetermined coefficient.
Thereby, the similarity of classification can be adjusted by a predetermined coefficient, and a recommended document whose classification is somewhat similar can be searched, or a recommended document whose classification is not similar can be searched.

前記検索手段は、予め設定された閾値を満たす前記文書類似度及び／又は前記分類類似度を用いて前記推薦文書を検索することが望ましい。
これにより、類似していない文書を除外したり、類似性が高い分類を除外したりすることができるため、意外性や多様性のある推薦文書を検索することができる。 It is preferable that the search unit searches the recommended document using the document similarity and / or the classification similarity satisfying a preset threshold.
As a result, dissimilar documents can be excluded or classifications with high similarity can be excluded, so that it is possible to search for recommended documents with unexpectedness and diversity.

第２の発明は、検索対象となる検索対象文書と関連性のある推薦文書を検索するコンピュータによる文書検索方法であって、前記コンピュータの制御部が、推薦文書の文書ごとに特徴語を管理し、記憶する第１の記憶手段を参照して、前記対象文書と推薦文書との文書単位での類似度である文書類似度を算出する文書類似度算出ステップと、推薦文書の分類ごとに特徴語を管理し、記憶する第２の記憶手段を参照して、前記対象文書と推薦文書との分類単位での類似度である分類類似度を算出する分類類似度算出ステップと、前記文書類似度と前記分類類似度との差分を取り、算出された差分値を参照して前記推薦文書を抽出する検索ステップと、を含むことを特徴とする文書検索方法である。
第２の発明により、起点となる検索対象文書に対して、意外性や多様性のある推薦文書を検索することができる。また、文書の類似性と分類の類似性との関係に基づいて、推薦文書を抽出することができるため、意外性や多様性のある推薦文書を検索することができる。
According to a second aspect of the present invention, there is provided a computer-based document search method for searching for a recommended document having a relation to a search target document to be searched, wherein the control unit of the computer manages a feature word for each document of the recommended document. A document similarity calculating step for calculating a document similarity, which is a similarity in document units, between the target document and the recommended document with reference to the first storage means, and a feature word for each category of the recommended document A classification similarity calculating step of calculating a classification similarity that is a similarity in a classification unit between the target document and the recommended document, with reference to the second storage means that manages and stores the document similarity, The document search method includes a search step of taking a difference from the classification similarity and extracting the recommended document by referring to the calculated difference value .
According to the second invention, it is possible to search for a recommended document having unexpectedness and diversity with respect to a search target document as a starting point. Further, since the recommended document can be extracted based on the relationship between the similarity of the document and the similarity of the classification, it is possible to search for a recommended document having unexpectedness and diversity.

第３の発明は、コンピュータを、検索対象となる検索対象文書と関連性のある推薦文書を検索する文書検索装置として機能させるためのプログラムであって、前記コンピュータを、推薦文書の文書ごとに特徴語を管理し、記憶する第１の記憶手段、推薦文書の分類ごとに特徴語を管理し、記憶する第２の記憶手段、前記第１の記憶手段を参照して、前記対象文書と推薦文書との文書単位での類似度である文書類似度を算出する文書類似度算出手段、前記第２の記憶手段を参照して、前記対象文書と推薦文書との分類単位での類似度である分類類似度を算出する分類類似度算出手段、前記文書類似度と前記分類類似度との差分を取り、算出された差分値を参照して前記推薦文書を抽出する検索手段、として機能させるためのプログラムである。
第３の発明により、起点となる検索対象文書に対して、意外性や多様性のある推薦文書を検索することができる。また、文書の類似性と分類の類似性との関係に基づいて、推薦文書を抽出することができるため、意外性や多様性のある推薦文書を検索することができる。
A third invention is a program for causing a computer to function as a document search device that searches for a recommended document that is related to a search target document to be searched. The computer is characterized for each document of the recommended document. First storage means for managing and storing words, second storage means for managing and storing feature words for each category of recommended documents, and referring to the first storage means, the target document and the recommended document A document similarity calculating means for calculating a document similarity that is a similarity in document units with the second storage means, and a classification that is a similarity in a classification unit between the target document and the recommended document. A program for functioning as classification similarity calculation means for calculating similarity, a search means for taking the difference between the document similarity and the classification similarity, and extracting the recommended document with reference to the calculated difference value It is.
According to the third invention, it is possible to search for a recommended document having unexpectedness and diversity with respect to a search target document as a starting point. Further, since the recommended document can be extracted based on the relationship between the similarity of the document and the similarity of the classification, it is possible to search for a recommended document having unexpectedness and diversity.

第４の発明は、ユーザ端末と、検索対象となる検索対象文書と関連性のある推薦文書を検索する文書検索装置とがネットワークを介して接続された文書検索システムであって、前記ユーザ端末は、前記検索対象文書の入力を受付ける入力受付手段と、前記検索対象文書を前記文書検索装置に送信する送信手段と、前記文書検索装置から検索結果を受信して表示する表示手段と、を備え、前記文書検索装置は、推薦文書の文書ごとに特徴語を管理し、記憶する第１の記憶手段と、推薦文書の分類ごとに特徴語を管理し、記憶する第２の記憶手段と、前記ユーザ端末から前記検索対象文書を受信する受信手段と、前記第１の記憶手段を参照して、前記対象文書と推薦文書との文書単位での類似度である文書類似度を算出する文書類似度算出手段と、前記第２の記憶手段を参照して、前記対象文書と推薦文書との分類単位での類似度である分類類似度を算出する分類類似度算出手段と、前記文書類似度と前記分類類似度との差分を取り、算出された差分値を参照して前記推薦文書を抽出する検索手段と、抽出した前記推薦文書を前記ユーザ端末へ送信する送信手段と、を備えることを特徴とする文書検索システムである。 A fourth invention is a document search system in which a user terminal and a document search device that searches for a recommended document that is related to a search target document to be searched are connected via a network, wherein the user terminal is An input receiving unit that receives an input of the search target document, a transmission unit that transmits the search target document to the document search device, and a display unit that receives and displays a search result from the document search device, The document search apparatus includes a first storage unit that manages and stores a feature word for each document of a recommended document, a second storage unit that manages and stores a feature word for each category of a recommended document, and the user Document similarity calculation for calculating a document similarity, which is a similarity in document units between the target document and the recommended document with reference to the receiving unit that receives the search target document from the terminal and the first storage unit Means and With reference to the second storage unit, a classification similarity calculation means for calculating a classification similarity is similarity classification units of the target document and recommendations document, and the classification similarity between the document similarity A document search system comprising: search means for extracting the recommended document with reference to the calculated difference value; and transmission means for transmitting the extracted recommended document to the user terminal. It is.

本発明の文書検索装置等によって、ユーザが指定した文書に対して、類似性に加えて、意外性や多様性のある文書を検索することができる。 With the document search device of the present invention, it is possible to search for documents with unexpectedness and diversity in addition to similarity to documents specified by the user.

本実施形態に係る文書検索システムのシステム構成の一例を示す図1 is a diagram illustrating an example of a system configuration of a document search system according to the present embodiment. 本実施形態に係る文書検索サービスの概要を説明する図The figure explaining the outline | summary of the document search service which concerns on this embodiment 本実施形態に係る文書検索装置（ユーザ端末）のハードウエアの構成例を示すブロック図1 is a block diagram showing a hardware configuration example of a document search device (user terminal) according to the present embodiment. カテゴリ情報データベースに記憶されるカテゴリ情報の一例を示す図The figure which shows an example of the category information memorize | stored in a category information database 文書情報データベースに記憶される文書情報の一例を示す図The figure which shows an example of the document information memorize | stored in a document information database 文書検索装置に一時的に保持される情報の一例を示す図The figure which shows an example of the information temporarily hold | maintained at a document search device 文書推薦処理の流れを示すフローチャートFlow chart showing the flow of document recommendation processing 文書ベクトルを説明する図Illustration explaining the document vector 文書推薦画面の一例を示す図Figure showing an example of a document recommendation screen カテゴリ類似度算出処理の流れを示すフローチャートFlow chart showing the flow of category similarity calculation processing 文書類似度算出処理の流れを示すフローチャートFlow chart showing the flow of document similarity calculation processing 意外性スコア算出処理の流れを示すフローチャートFlow chart showing the flow of unexpectedness score calculation processing ネットワークに接続されていない文書検索装置の文書推薦処理の流れを示すフローチャートThe flowchart which shows the flow of the document recommendation process of the document search apparatus which is not connected to the network

以下、図面に基づいて、本発明の好適な実施形態について詳細に説明する。
まず、図１〜図６を参照して本実施形態の構成について説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
First, the configuration of the present embodiment will be described with reference to FIGS.

図１は、本実施形態に係る文書検索システム１００のシステム構成の一例を示す図である。図１に示す様に、文書検索システム１００は、文書検索装置１と、１又は複数のユーザ端末２（２ａ、２ｂ）がネットワーク３を介して互いに通信接続されて構成される。 FIG. 1 is a diagram illustrating an example of a system configuration of a document search system 100 according to the present embodiment. As shown in FIG. 1, the document search system 100 is configured by a document search apparatus 1 and one or a plurality of user terminals 2 (2a, 2b) connected to each other via a network 3.

本実施形態では、ユーザ端末２を利用するユーザが選択した文書（検索対象文書）に対して、検索対象文書との類似性等に基づいて文書情報データベース６に登録される文書を検索して前記ユーザ端末２に提示する文書推薦サービスにて本発明に係る文書検索システム１００を利用する例について説明する。図２は、文書推薦サービスの概要を説明する図である。 In the present embodiment, a document registered in the document information database 6 is searched for a document (search target document) selected by a user using the user terminal 2 based on similarity to the search target document, and the like. An example in which the document search system 100 according to the present invention is used in the document recommendation service presented to the user terminal 2 will be described. FIG. 2 is a diagram for explaining the outline of the document recommendation service.

尚、本発明において文書とは、電子化された書籍、雑誌、記事、論文、その他の書類、インターネットで公開される記事コンテンツ等である。 In the present invention, a document is an electronic book, magazine, article, paper, other document, article content published on the Internet, or the like.

図２に示す様に、文書推薦方法ａ（従来の文書推薦サービス）では、検索対象文書１０（文書Ａ）と類似度の高い文書をデータベースから検索して推薦する。この場合、検索対象文書１０と推薦先文書１１（文書Ｂ、文書Ｃ、文書Ｄ）との類似性が高く、また、検索対象文書１０と推薦先文書が属するカテゴリ１２(カテゴリＸ)との類似性も高くなるという結果となる。従って、文書推薦方法ａにおいては、推薦先文書１１と検索対象文書１０とは分類が同じで内容も似ていることから、意外性の少ない文書をユーザに提示することとなる。 As shown in FIG. 2, in the document recommendation method a (conventional document recommendation service), a document having a high similarity to the search target document 10 (document A) is searched from a database and recommended. In this case, the similarity between the search target document 10 and the recommendation destination document 11 (document B, document C, document D) is high, and the similarity between the search target document 10 and the category 12 (category X) to which the recommendation destination document belongs. As a result, the result is high. Therefore, in the document recommendation method a, the recommendation destination document 11 and the search target document 10 have the same classification and similar contents, so that a document with little unexpectedness is presented to the user.

一方、文書推薦方法ｂでは、検索対象文書１０（文書Ａ）と類似度の低い文書をデータベースから検索して推薦する。この場合、検索対象文書１０と推薦先文書１１(文書Ｅ)との類似性は低く、また、検索対象文書１０と推薦先文書が属するカテゴリ１２(カテゴリＹ)との類似性も低くなるという結果となる。従って、文書推薦方法ｂにおいては、推薦先文書１１と検索対象文書１０とは分類が遠く内容も似ていないことから、無関係の文書をユーザに提示することとなる。 On the other hand, in the document recommendation method b, a document having a low similarity to the search target document 10 (document A) is searched from the database and recommended. In this case, the similarity between the search target document 10 and the recommendation destination document 11 (document E) is low, and the similarity between the search target document 10 and the category 12 to which the recommendation destination document belongs (category Y) is also low. It becomes. Therefore, in the document recommendation method b, the recommendation destination document 11 and the search target document 10 are classified so far that their contents are not similar, so that an unrelated document is presented to the user.

本発明に係る文書推薦サービスとは文書推薦方法ｃに示すものであり、検索対象文書１０（文書Ａ）と類似性は高く、推薦先文書が属するカテゴリ１２（カテゴリＺ）との類似性は低い文書（文書Ｆ）をデータベースから検索して推薦する。これにより、分類は違うが内容が似ている文書、即ち、意外性のある文書をユーザに提示することができる。 The document recommendation service according to the present invention is shown in the document recommendation method c, and has high similarity with the search target document 10 (document A) and low similarity with the category 12 (category Z) to which the recommended document belongs. A document (document F) is retrieved from the database and recommended. As a result, it is possible to present to the user a document with a different classification but similar contents, that is, a surprising document.

図１の説明に戻る。
文書検索装置１は、文書推薦サービスを提供するサイトのサーバ装置であり、文書検索装置１の記憶部２２は文書が属するカテゴリ情報を管理するカテゴリ情報データベース（ＤＢ）５、文書情報を管理する文書情報データベース（ＤＢ）６等を保持する。また、文書検索装置１は、文書推薦サービスを利用するユーザ端末２の識別情報と文書（例えば、書籍）購入（又は、閲覧等）履歴とを紐付けて管理するユーザの履歴情報を保持しても良い。詳細は後述する。 Returning to the description of FIG.
The document search device 1 is a server device of a site that provides a document recommendation service. The storage unit 22 of the document search device 1 has a category information database (DB) 5 that manages category information to which the document belongs, and a document that manages document information. An information database (DB) 6 and the like are held. In addition, the document search apparatus 1 holds user history information that manages the identification information of the user terminal 2 that uses the document recommendation service and the history of purchasing (or browsing, etc.) documents (for example, books) in association with each other. Also good. Details will be described later.

ユーザ端末２は、文書推薦サービスを利用するユーザが利用するコンピュータであり、文書検索装置１から送信される文書推薦画面８０(図９参照)等を表示する。ユーザ端末２は、汎用なコンピュータに代えて、携帯端末、モバイル端末等であっても良い。 The user terminal 2 is a computer used by a user who uses the document recommendation service, and displays a document recommendation screen 80 (see FIG. 9) transmitted from the document search device 1. The user terminal 2 may be a mobile terminal, a mobile terminal, or the like instead of a general-purpose computer.

図３は、本発明の実施形態に係る文書検索装置１（ユーザ端末２）を実現するコンピュータのハードウエア構成図である。コンピュータは、図３に示すように、例えば、制御部２１、記憶部２２、メディア入出力部２３、通信制御部２４、入力部２５、表示部２６、周辺機器Ｉ／Ｆ部２７等が、バス２８を介して接続されて構成される。 FIG. 3 is a hardware configuration diagram of a computer that realizes the document search apparatus 1 (user terminal 2) according to the embodiment of the present invention. As shown in FIG. 3, the computer includes a control unit 21, a storage unit 22, a media input / output unit 23, a communication control unit 24, an input unit 25, a display unit 26, a peripheral device I / F unit 27, etc. 28 is connected and configured.

制御部２１は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only
Memory）、ＲＡＭ（Random Access Memory）等により構成される。
ＣＰＵは、記憶部２２、ＲＯＭ、記憶媒体等に格納されるプログラムをＲＡＭ上のワークメモリ領域に呼び出して実行し、バス２８を介して接続された各装置を駆動制御し、文書検索装置１（ユーザ端末２）が行う後述する処理を実現する。ＲＯＭは、不揮発性メモリであり、コンピュータのブートプログラムやＢＩＯＳ等のプログラム、データ等を恒久的に保持する。ＲＡＭは、揮発性メモリであり、ロードしたプログラムや、データ等を一時的に保持すると共に、制御部２１が各処理を行うために使用するワークエリアを備える。 The control unit 21 includes a CPU (Central Processing Unit) and a ROM (Read Only).
Memory), RAM (Random Access Memory) and the like.
The CPU calls and executes a program stored in the storage unit 22, ROM, storage medium or the like in the work memory area on the RAM, drives and controls each device connected via the bus 28, and the document search device 1 ( The processing to be described later performed by the user terminal 2) is realized. The ROM is a non-volatile memory and permanently holds a computer boot program, a program such as BIOS, data, and the like. The RAM is a volatile memory, and temporarily stores a loaded program, data, and the like, and includes a work area used by the control unit 21 to perform each process.

記憶部２２は、ＨＤＤ（Hard Disk Drive）等であり、制御部２１が実行するプログラムや、プログラム実行に必要なデータ、ＯＳ（Operating System）等が格納されている。これらのプログラムコードは、制御部２１により必要に応じて読み出されてＲＡＭに移され、ＣＰＵに読み出されて実行される。 The storage unit 22 is an HDD (Hard Disk Drive) or the like, and stores a program executed by the control unit 21, data necessary for program execution, an OS (Operating System), and the like. These program codes are read by the control unit 21 as necessary, transferred to the RAM, and read and executed by the CPU.

メディア入出力部２３は、例えば、ＣＤドライブ、ＤＶＤドライブ、ＭＯドライブ、フロッピー（登録商標）ディスクドライブ、等のメディア入出力装置であり、画像等のデータの入出力を行う。
通信制御部２４は、通信制御装置、通信ポート等を有し、コンピュータとネットワーク３間の通信を媒介する通信インターフェースであり、ネットワーク３を介して、他の装置間との通信制御を行う。ネットワーク３は有線、無線を問わない。 The media input / output unit 23 is a media input / output device such as a CD drive, a DVD drive, an MO drive, and a floppy (registered trademark) disk drive, and inputs / outputs data such as images.
The communication control unit 24 includes a communication control device, a communication port, and the like, and is a communication interface that mediates communication between the computer and the network 3, and performs communication control with other devices via the network 3. The network 3 may be wired or wireless.

入力部２５は、データ入力を行い、例えば、キーボード、マウスなどのポインティングデバイス、テンキーなどの入力装置を有する。入力されたデータを制御部２１へ出力する。
表示部２６は、例えば、ＣＲＴモニタ、液晶パネル等のディスプレイ装置と、ディスプレイ装置と連携して表示処理を実行するための論理回路（ビデオアダプタ等）で構成され、制御部２１の制御により入力された表示情報をディスプレイ装置上に表示させる。
尚、入力部２５と表示部２６は、それらの機能が一体化した、例えば、タッチパネル付ディスプレイであっても良い。 The input unit 25 performs data input and includes, for example, a keyboard, a pointing device such as a mouse, and an input device such as a numeric keypad. The input data is output to the control unit 21.
The display unit 26 includes, for example, a display device such as a CRT monitor or a liquid crystal panel, and a logic circuit (video adapter or the like) for executing display processing in cooperation with the display device, and is input by the control of the control unit 21. The displayed information is displayed on the display device.
The input unit 25 and the display unit 26 may be, for example, a display with a touch panel in which those functions are integrated.

周辺機器Ｉ／Ｆ部（インターフェース）２７は、コンピュータに周辺機器を接続させるためのポートであり、周辺機器Ｉ／Ｆ部２７を介してコンピュータは周辺機器とのデータの送受信を行う。周辺機器Ｉ／Ｆ部２７は、ＵＳＢやＩＥＥＥ１３９４やＲＳ−２３２Ｃ等で構成されており、通常複数の周辺機器Ｉ／Ｆを有する。周辺機器との接続形態は、有線、無線を問わない。
バス２８は、各装置間の制御信号、データ信号等の授受を媒介する経路である。 The peripheral device I / F unit (interface) 27 is a port for connecting a peripheral device to the computer, and the computer transmits and receives data to and from the peripheral device via the peripheral device I / F unit 27. The peripheral device I / F unit 27 is configured by USB, IEEE 1394, RS-232C, or the like, and usually includes a plurality of peripheral devices I / F. The connection form with the peripheral device may be wired or wireless.
The bus 28 is a path that mediates transmission / reception of control signals, data signals, and the like between the devices.

図４は、カテゴリ情報データベース５が取り扱うカテゴリ情報５０の一例を示す図である。図４に示すように、カテゴリ情報５０とは、各カテゴリの「カテゴリコード５１」、「カテゴリ名５２」、「カテゴリ説明文５３」、「特徴語５４」等の情報を含むものである。カテゴリコード５１とは、カテゴリを一意に識別する識別子であり、例えば、書籍を分類する日本十進法のコード番号や、記事文書を分類するために登録されたジャンル等である。カテゴリ名５２とは、カテゴリの内容を表す見出しである。カテゴリ説明文５３とは、カテゴリを説明する文章であり、特徴語５４とはカテゴリの内容を表す単語（キーワード）とその重要度（後述する）を示すものである。 FIG. 4 is a diagram illustrating an example of category information 50 handled by the category information database 5. As shown in FIG. 4, the category information 50 includes information such as “category code 51”, “category name 52”, “category description 53”, “feature word 54”, and the like of each category. The category code 51 is an identifier for uniquely identifying a category, such as a Japanese decimal code number for classifying books, a genre registered for classifying article documents, and the like. The category name 52 is a heading representing the content of the category. The category explanation sentence 53 is a sentence explaining the category, and the feature word 54 is a word (keyword) representing the content of the category and its importance (described later).

尚、カテゴリ情報５０は特徴語５４を保持せずにカテゴリ説明文５３や、カテゴリに属する各文書の特徴語６５に基づいて、後述する文書推薦処理を実行するごとにカテゴリ情報５０の特徴語５４を抽出しても良い。 It should be noted that the category information 50 does not hold the feature word 54 but based on the category description 53 and the feature word 65 of each document belonging to the category, the feature word 54 of the category information 50 is executed every time document recommendation processing described later is executed. May be extracted.

図５は、文書情報データベース６が取り扱う文書情報６０の一例を示す図である。図５に示すように、文書情報とは、各文書の「文書コード６１」、「文書名６２」、「カテゴリコード６３」、「文書説明文６４」、「特徴語６５」等の情報を含むものである。文書コード６１とは、文書を一意に識別する識別子である。文書名６２とは、文書のタイトルである。カテゴリコード６３とは、文書が属するカテゴリを示すものであり、カテゴリ情報データベース５のカテゴリコード５１に紐付く。文書情報データベース６に登録される文書は、予め１つ以上のカテゴリに分類され登録されている。 FIG. 5 is a diagram showing an example of document information 60 handled by the document information database 6. As shown in FIG. 5, the document information includes information such as “document code 61”, “document name 62”, “category code 63”, “document description 64”, and “feature word 65” of each document. It is a waste. The document code 61 is an identifier that uniquely identifies a document. The document name 62 is a document title. The category code 63 indicates the category to which the document belongs, and is associated with the category code 51 of the category information database 5. Documents registered in the document information database 6 are classified and registered in advance into one or more categories.

文書説明文６４とは文書の内容を表す紹介文であり、特徴語６５とは文書の内容を表す単語（キーワード）とその重要度（後述する）を示すものである。尚、文書情報６０は特徴語６５を保持せずに文書説明文６４や文書の本文から後述する文書推薦処理を実行するごとに特徴語６５を抽出しても良い。 The document description 64 is an introductory sentence representing the contents of the document, and the feature word 65 is a word (keyword) representing the contents of the document and its importance (described later). The document information 60 may extract the feature word 65 every time a document recommendation process (to be described later) is executed from the document description 64 or the document body without holding the feature word 65.

図６は、文書推薦処理において文書検索装置１に一時的に保持される情報の一例を示す図である。図６の（ａ）に示す様に、文書検索装置１の制御部２１のＲＡＭ等は、「カテゴリコード７１」と、カテゴリ類似度算出処理（図７のＳ１０６）にて算出される「検索対象文書とのカテゴリ類似度７２」とを紐付けてカテゴリ類似度情報５６として記憶する。カテゴリコード７１は、カテゴリ情報データベース５のカテゴリコード５１に紐付く。 FIG. 6 is a diagram illustrating an example of information temporarily stored in the document search apparatus 1 in the document recommendation process. As shown in FIG. 6A, the RAM or the like of the control unit 21 of the document search apparatus 1 stores “category code 71” and “search target” calculated in the category similarity calculation process (S106 in FIG. 7). The category similarity 72 ”with the document is linked and stored as category similarity information 56. The category code 71 is associated with the category code 51 of the category information database 5.

図６の（ｂ）に示す様に、文書検索装置１の制御部２１のＲＡＭ等は、「文書コード７３」と、文書類似度算出処理（図７のＳ１０７）にて算出される「検索対象文書との文書類似度７４」と、意外性スコア算出処理（図７のＳ１０８）にて算出される「検索対象文書との意外性スコア７５」とを紐付けて文書類似度情報５７として記憶する。文書コード７３は、文書情報データベース６の文書コード６１に紐付く。 As shown in FIG. 6B, the RAM or the like of the control unit 21 of the document search apparatus 1 stores “document code 73” and “search target” calculated in the document similarity calculation process (S107 in FIG. 7). The document similarity 74 with the document and the “unexpectedness score 75 with the search target document” calculated in the unexpectedness score calculation process (S108 in FIG. 7) are linked and stored as the document similarity information 57. . The document code 73 is associated with the document code 61 of the document information database 6.

[文書推薦処理]
続いて、図７〜図１３を参照して、文書検索装置１とユーザ端末２が実行する文書推薦処理について説明する。 [Document recommendation process]
Next, document recommendation processing executed by the document search device 1 and the user terminal 2 will be described with reference to FIGS.

図７は、文書推薦処理の一例を示すフローチャートである。
ユーザ端末２が文書検索装置１にアクセスすると、文書検索装置１の制御部２１は、ユーザ端末２に文書選択受付画面（図示せず）を送信する。文書選択受付画面には、文書情報ＤＢ６に予め登録される文書が一覧表示またはサムネイル表示され、ユーザ端末２の制御部２１は、ユーザ操作により文書の選択を受付ける（ステップＳ１０１）。 FIG. 7 is a flowchart illustrating an example of the document recommendation process.
When the user terminal 2 accesses the document search apparatus 1, the control unit 21 of the document search apparatus 1 transmits a document selection reception screen (not shown) to the user terminal 2. On the document selection reception screen, documents registered in advance in the document information DB 6 are displayed as a list or a thumbnail, and the control unit 21 of the user terminal 2 receives the selection of the document by a user operation (step S101).

ユーザ端末２の制御部２１は、選択を受付けた文書（検索対象文書）の文書コード６１を、文書検索装置１に送信する（ステップＳ１０２）。文書検索装置１の制御部２１は、検索対象文書の文書コード６１を受信する(ステップＳ１０３)。 The control unit 21 of the user terminal 2 transmits the document code 61 of the selected document (search target document) to the document search device 1 (step S102). The control unit 21 of the document search apparatus 1 receives the document code 61 of the search target document (step S103).

文書検索装置１の制御部２１は、文書情報ＤＢ６から、受信した文書コード６１に対応する特徴語６５のキーワードとキーワードの出現回数を抽出し（ステップＳ１０４）、検索対象文書の文書ベクトルを生成する（ステップＳ１０５）。 The control unit 21 of the document search apparatus 1 extracts the keyword of the feature word 65 corresponding to the received document code 61 and the number of appearances of the keyword from the document information DB 6 (step S104), and generates a document vector of the search target document. (Step S105).

具体的には、文書検索装置１の制御部２１は、各文書の文書（文書名６２、文書説明文６４、文書の内容）に対し汎用的な形態素解析用ソフトウェアを用いて形態素解析を行い、キーワードを抽出する。抽出したキーワードのＴＦを算出して文書ベクトルを生成する。
なお、ＴＦ（Term-Frequency）とは、文書内にキーワードが出現する頻度（出現回数）である。 Specifically, the control unit 21 of the document search apparatus 1 performs morphological analysis on each document document (document name 62, document description 64, document content) using general-purpose morpheme analysis software, Extract keywords. A document vector is generated by calculating the TF of the extracted keyword.
Note that TF (Term-Frequency) is the frequency (number of appearances) that a keyword appears in a document.

文書検索装置１の制御部２１は、同一カテゴリに属する文書群との類似度を表すカテゴリ類似度をカテゴリ情報ＤＢ５に登録されるカテゴリ毎に算出し（ステップＳ１０６）、検索対象文書と他の文書との類似度を表す文書類似度を文書情報ＤＢ６に登録される文書毎に算出し(ステップＳ１０７)、検索対象文書と他の文書との意外性スコアを文書毎に算出する（ステップＳ１０８）。ステップＳ１０６〜ステップＳ１０８で実行される各処理の詳細は、後述する。 The control unit 21 of the document search apparatus 1 calculates the category similarity indicating the similarity with the document group belonging to the same category for each category registered in the category information DB 5 (step S106), and the search target document and other documents are calculated. Is calculated for each document registered in the document information DB 6 (step S107), and an unexpected score between the search target document and another document is calculated for each document (step S108). Details of each process executed in steps S106 to S108 will be described later.

文書検索装置１の制御部２１は、ステップＳ１０８で算出された検索対象文書との意外性スコア７５（文書類似度情報５７に記憶される）に基づいて、意外性スコアの高い文書を文書情報ＤＢ６から検索し（ステップＳ１０９）、それらを推薦先文書１１として該当するユーザ端末２に送信する（ステップＳ１１０）。 Based on the unexpectedness score 75 (stored in the document similarity information 57) with the search target document calculated in step S108, the control unit 21 of the document search apparatus 1 selects a document with a high unexpectedness score from the document information DB 6. (Step S109) and transmit them to the corresponding user terminal 2 as the recommendation destination document 11 (step S110).

ユーザ端末２の制御部２１は、文書検索装置１から送信されたデータを受信して（ステップＳ１１１）、文書推薦画面８０を表示部２６に表示して（ステップＳ１１２）、処理を終了する。 The control unit 21 of the user terminal 2 receives the data transmitted from the document search device 1 (step S111), displays the document recommendation screen 80 on the display unit 26 (step S112), and ends the process.

図９に文書推薦画面８０の一例を示す。図９に示す様に、文書推薦画面８０には、ユーザが選択した検索対象文書１０（文書Ａ）と文書推薦処理によって検索された推薦先文書１１（文書Ｂ、文書Ｇ、文書Ｋ）とが一覧表示またはサムネイル表示される。 FIG. 9 shows an example of the document recommendation screen 80. As shown in FIG. 9, the document recommendation screen 80 includes a search target document 10 (document A) selected by the user and a recommendation destination document 11 (document B, document G, document K) searched by the document recommendation process. List display or thumbnail display.

以上、文書推薦処理によって、文書検索装置１は、ユーザ端末２から検索対象文書の選択を受付けると、検索対象文書の特徴を表現する文書ベクトルを生成する。続いて、カテゴリ情報ＤＢ５に登録される各カテゴリに属する文書群と検索対象文書との類似度を算出し、文書情報ＤＢ６に登録される各文書と検索対象文書との類似度を算出する。文書検索装置１は、これらに基づいてユーザが選択した文書と類似性が低いカテゴリに属するが、文書間で内容が類似する文書を検索して、検索された文書を推薦先文書としてユーザ端末２に提示する。
これにより、文書検索装置１はユーザが選択した文書に対して、意外性や多様性のある文書を推薦することができる。 As described above, when receiving the selection of the search target document from the user terminal 2 by the document recommendation process, the document search device 1 generates a document vector representing the characteristics of the search target document. Subsequently, the similarity between the document group belonging to each category registered in the category information DB 5 and the search target document is calculated, and the similarity between each document registered in the document information DB 6 and the search target document is calculated. The document search apparatus 1 belongs to a category having a low similarity to the document selected by the user based on the above, but searches for a document having similar content between the documents, and uses the searched document as a recommended destination document. To present.
Thereby, the document search device 1 can recommend a document with unexpectedness and diversity to the document selected by the user.

なお、文書推薦処理は、前述のものに限られず、その趣旨を逸脱しない範囲で変更可能である。
例えば、ステップＳ１０１〜ステップＳ１０３において、ユーザ端末２において、文書検索装置１の文書情報ＤＢ６に予め登録される文書から検索対象文書を選択させるようにしたが、ユーザ端末２において、検索対象文書として、一般的な検索窓から任意の文字列の入力を受付け、文書検索装置１が受信したり、ユーザ端末２において、作成しておいた文書を文書検索装置１へアップロード（送信）して、文書検索装置１が受信したりしてもよい。文書検索装置１は、受信した検索対象文書から、キーワードを抽出するとともに重要度を算出し、検索対象文書の文書ベクトルを生成する。 The document recommendation process is not limited to the above-described one, and can be changed without departing from the spirit of the document recommendation process.
For example, in step S101 to step S103, the user terminal 2 is made to select a search target document from documents registered in advance in the document information DB 6 of the document search apparatus 1, but in the user terminal 2, as the search target document, An arbitrary character string input is accepted from a general search window, and the document search device 1 receives it, or the user terminal 2 uploads (sends) the created document to the document search device 1 to search the document. The apparatus 1 may receive it. The document search apparatus 1 extracts keywords from the received search target document, calculates importance, and generates a document vector of the search target document.

[カテゴリ類似度算出処理]
続いて、図１０及び適宜図８を参照して、文書推薦処理のステップＳ１０６にて実行されるカテゴリ類似度算出処理の一例について説明する。カテゴリ類似度算出処理において、文書検索装置１の制御部２１は、検索対象文書と各比較対象カテゴリとの類似度を計算する。 [Category similarity calculation processing]
Next, an example of the category similarity calculation process executed in step S106 of the document recommendation process will be described with reference to FIG. 10 and FIG. 8 as appropriate. In the category similarity calculation process, the control unit 21 of the document search device 1 calculates the similarity between the search target document and each comparison target category.

まず、ステップＳ２０１において、文書検索装置１の制御部２１は、検索対象文書から文書ベクトルを生成する。 First, in step S201, the control unit 21 of the document search apparatus 1 generates a document vector from the search target document.

ステップＳ２０１における、検索対象文書の文書ベクトルの生成方法について、図８（ａ）を参照しながら具体的に説明する。なお、図８（ａ）の文書Ａが本実施形態における検索対象文書に相当する。 A method for generating the document vector of the search target document in step S201 will be specifically described with reference to FIG. Note that the document A in FIG. 8A corresponds to the search target document in the present embodiment.

文書検索装置１の制御部２１は、文書Ａの「キーワード」とそのキーワードの「出現回数」から、文書Ａの文書ベクトルを生成する。
例えば、図８（ａ）では、「キーワード」は「月、地球、衛星、大阪、名古屋、野球」であり、「出現回数」は表に記載の数字である。このとき、文書Ａの文書ベクトルＶａはＶａ＝「２、１、１、０、０、１」となる。 The control unit 21 of the document search apparatus 1 generates a document vector of the document A from the “keyword” of the document A and the “number of appearances” of the keyword.
For example, in FIG. 8A, “Keyword” is “Moon, Earth, Satellite, Osaka, Nagoya, Baseball”, and “Number of appearances” is a number described in the table. At this time, the document vector Va of the document A is Va = “2, 1, 1, 0, 0, 1”.

次に、ステップＳ２０２において、文書検索装置１の制御部２２は、各比較対象のカテゴリの文書ベクトルを生成する。 Next, in step S202, the control unit 22 of the document search apparatus 1 generates a document vector of each comparison target category.

比較対象のカテゴリの文書ベクトルの生成方法について、図８（ｂ）を参照しながら具体的に説明する。図８（ｂ）に示すように、カテゴリＸが文章Ｂと文書Ｃを有するものとする。 A method for generating a document vector of a category to be compared will be specifically described with reference to FIG. As shown in FIG. 8B, it is assumed that category X has sentence B and document C.

文書検索装置１の制御部２１は、文書Ａの文書ベクトルＶаの生成と同様に、文書Ｂと文書Ｃの「キーワード」とそのキーワードの「出現回数」から、文書Ｂの文書ベクトルＶｂと文書Ｃの文書ベクトルＶｃを生成する。 Similarly to the generation of the document vector Vа of the document A, the control unit 21 of the document search apparatus 1 determines the document vector Vb and the document C of the document B from the “keyword” of the document B and the document C and the “number of appearances” of the keyword. Document vector Vc is generated.

例えば、図８では、「キーワード」は「月、地球、衛星、大阪、名古屋、野球」であり、「出現回数」は表に記載の数字である。このとき、文書Ｂの文書ベクトルＶｂはＶｂ＝「１、１、２、０、０、２」となり、文書Ｃの文書ベクトルＶｃはＶｃ＝「１、０、１、０、０、０」となる。 For example, in FIG. 8, “keyword” is “month, earth, satellite, Osaka, Nagoya, baseball”, and “number of appearances” is a number described in the table. At this time, the document vector Vb of the document B is Vb = “1, 1, 2, 0, 0, 2”, and the document vector Vc of the document C is Vc = “1, 0, 1, 0, 0, 0”. Become.

そして、最終的に、カテゴリＸの文書ベクトルは、文書Ｂの文書ベクトルＶｂと文書Ｃの文書ベクトルＶｃとの和、つまり、カテゴリＸの文書ベクトルＶｘはＶｘ＝「２、１、３、０、０、２」となる。 Finally, the document vector of category X is the sum of the document vector Vb of document B and the document vector Vc of document C, that is, the document vector Vx of category X is Vx = “2, 1, 3, 0, 0, 2 ".

上記カテゴリの文書ベクトルは、全ての比較対象カテゴリにつき、制御部２２によって生成される。 The document vectors of the above categories are generated by the control unit 22 for all comparison target categories.

そして、ステップＳ２０３において、文書検索装置１の制御部２１は、ステップＳ２０１において生成した文書ベクトルＶａと、ステップＳ２０２において生成した各比較対象のカテゴリの文書ベクトルとの類似度（カテゴリ類似度）を計算する。 In step S203, the control unit 21 of the document search device 1 calculates the similarity (category similarity) between the document vector Va generated in step S201 and the document vector of each comparison target category generated in step S202. To do.

例えば、カテゴリＸの文書ベクトルＶｘとの類似度を算出する場合、文書検索装置１の制御部２１が、文書Ａの文書ベクトルＶａと、カテゴリＸの文書ベクトルＶｘとから、ベクトルＶａとベクトルＶｘとの類似度を算出する（カテゴリ類似度）。カテゴリ類似度としては、例えば、コサイン類似度（Ｖａ・Ｖｘ）／（｜Ｖａ｜｜Ｖｘ｜）を用いることができる。例えば、図８の場合、カテゴリ類似度は、約０．８９となる。 For example, when calculating the similarity with the document vector Vx of category X, the control unit 21 of the document search apparatus 1 calculates the vector Va and the vector Vx from the document vector Va of the document A and the document vector Vx of the category X. Is calculated (category similarity). As the category similarity, for example, cosine similarity (Va · Vx) / (| Va || Vx |) can be used. For example, in the case of FIG. 8, the category similarity is about 0.89.

また、ステップＳ２０３において、文書検索装置１の制御部２１は、算出したカテゴリ類似度７２と比較対象カテゴリのカテゴリコード７１とを紐付けてカテゴリ類似度情報５６として、文書検索装置１の記憶部２２または制御部２１のＲＡＭ等に記憶する。 In step S <b> 203, the control unit 21 of the document search device 1 associates the calculated category similarity 72 with the category code 71 of the comparison target category and associates it with the category similarity information 56 to store the storage unit 22 of the document search device 1. Or it memorize | stores in RAM etc. of the control part 21. FIG.

そして、ステップＳ２０３において、検索対象文書と全ての比較対象カテゴリとの類似度を計算したら、文書検索装置１の制御部２１は、処理を終了する。 When the similarity between the search target document and all comparison target categories is calculated in step S203, the control unit 21 of the document search device 1 ends the process.

以上、カテゴリ類似度算出処理によって、文書検索装置１は、同一カテゴリに属する文書群の特徴を表す文書ベクトル（カテゴリの文書ベクトル）により各カテゴリの特徴を求め、ユーザが選択した文書との類似性を数値化する。これにより推薦先コンテンツの検索の際に、ユーザが選択した文書と類似性が低いカテゴリに属するコンテンツを抽出することができる。 As described above, through the category similarity calculation process, the document search apparatus 1 obtains the characteristics of each category from the document vector (document vector of the category) representing the characteristics of the document group belonging to the same category, and the similarity with the document selected by the user. Is digitized. As a result, when searching for recommended content, it is possible to extract content belonging to a category having low similarity to the document selected by the user.

なお、本実施形態では、文書ベクトルの生成にＴＦを用いたが、これに限らず、ＴＦ−ＩＤＦ（Term Frequency-Inverse Document Frequency）により文書ベクトルを生成してもよい。 In the present embodiment, TF is used to generate a document vector. However, the present invention is not limited to this, and the document vector may be generated by TF-IDF (Term Frequency-Inverse Document Frequency).

[文書類似度算出処理]
続いて、図１１を参照して、文書推薦処理のステップＳ１０７にて実行される文書類似度算出処理の一例について説明する。 [Document similarity calculation processing]
Next, an example of the document similarity calculation process executed in step S107 of the document recommendation process will be described with reference to FIG.

図１１は、文書類似度算出処理の一例を示すフローチャートである。
文書検索装置１の制御部２１は、文書情報ＤＢ６に登録される１つの文書である比較対象文書を入力する（ステップＳ４０１）。 FIG. 11 is a flowchart illustrating an example of a document similarity calculation process.
The control unit 21 of the document search apparatus 1 inputs a comparison target document that is one document registered in the document information DB 6 (step S401).

文書検索装置１の制御部２１は、比較対象文書の特徴語６５を文書情報ＤＢ６から抽出し（ステップＳ４０２）、比較対象文書の文書ベクトルを生成する（ステップＳ４０３）。文書検索装置１の制御部２１は、Ｓ１０５にて生成した検索対象文書の文書ベクトルとステップＳ４０３にて生成した比較対象文書の文書ベクトルから、文書類似度を算出する（ステップＳ４０４）。文書類似度としては、前述したコサイン類似度を用いればよい。 The control unit 21 of the document search apparatus 1 extracts the feature word 65 of the comparison target document from the document information DB 6 (step S402), and generates a document vector of the comparison target document (step S403). The control unit 21 of the document search apparatus 1 calculates the document similarity from the document vector of the search target document generated in S105 and the document vector of the comparison target document generated in Step S403 (Step S404). As the document similarity, the above-described cosine similarity may be used.

文書検索装置１の制御部２１は、算出した文書類似度７４と比較対象文書の文書コード７３とを紐付けて文書類似度情報５７として、文書検索装置１の記憶部２２または制御部２１のＲＡＭ等に記憶して(ステップＳ４０５)、文書情報ＤＢ６に登録される全ての文書を入力したか否かを判定する（ステップＳ４０６）。 The control unit 21 of the document search device 1 associates the calculated document similarity 74 with the document code 73 of the comparison target document to generate the document similarity information 57 as the storage unit 22 of the document search device 1 or the RAM of the control unit 21. (Step S405), and it is determined whether all the documents registered in the document information DB 6 have been input (step S406).

入力済みでない場合には（ステップＳ４０６のＮＯ）、文書検索装置１の制御部２１は、ステップＳ４０１に戻る。入力済みの場合には（ステップＳ４０６のＹＥＳ）、文書検索装置１の制御部２１は、処理を終了する。 If the input has not been completed (NO in step S406), the control unit 21 of the document search device 1 returns to step S401. If it has been input (YES in step S406), the control unit 21 of the document search device 1 ends the process.

以上、文書類似度算出処理によれば、文書検索装置１は、各文書の特徴を表現する文書ベクトルにより各文書の特徴を求め、ユーザが選択した文書との類似性を数値化する。これにより、推薦文書の検索の際に、ユーザが選択した文書と類似する文書を抽出することができる。 As described above, according to the document similarity calculation process, the document search apparatus 1 obtains the feature of each document from the document vector expressing the feature of each document, and quantifies the similarity with the document selected by the user. This makes it possible to extract a document similar to the document selected by the user when searching for recommended documents.

[意外性スコア算出処理]
続いて、図１２を参照して、文書推薦処理のステップＳ１０８にて実行される意外性スコア算出処理の一例について説明する。 [Surprising score calculation process]
Next, an example of the unexpectedness score calculation process executed in step S108 of the document recommendation process will be described with reference to FIG.

図１２は、意外性スコア算出処理の一例を示すフローチャートである。
文書検索装置１の制御部２１は、文書情報ＤＢ６に登録される１つの文書である比較対象文書を入力する（ステップＳ５０１）。 FIG. 12 is a flowchart illustrating an example of the unexpectedness score calculation process.
The control unit 21 of the document search apparatus 1 inputs a comparison target document that is one document registered in the document information DB 6 (step S501).

文書検索装置１の制御部２１は、比較対象文書が属するカテゴリコードを検索し、カテゴリコードに基づいてカテゴリ類似度情報５６から検索対象文書とのカテゴリ類似度７２を抽出する（ステップＳ５０２）。 The control unit 21 of the document search apparatus 1 searches for the category code to which the comparison target document belongs, and extracts the category similarity 72 with the search target document from the category similarity information 56 based on the category code (step S502).

文書検索装置１の制御部２１は、比較対象文書の文書コードに基づいて、文書類似度情報５７から検索対象文書との文書類似度７４を抽出する（ステップＳ５０３）。 The control unit 21 of the document search apparatus 1 extracts the document similarity 74 with the search target document from the document similarity information 57 based on the document code of the comparison target document (step S503).

文書検索装置１の制御部２１は、次式（１）を用いて、比較対象文書に対する検索対象文書の意外性スコア７５を算出する（ステップＳ５０４）。 The control unit 21 of the document search apparatus 1 calculates the unexpectedness score 75 of the search target document with respect to the comparison target document using the following formula (1) (step S504).

式（１）の重みづけ係数αは、検索対象文書との類似性が高い文書の内、検索対象文書と類似性が低いカテゴリに属する文書の意外性スコアが高くなるように設定される。重みづけ係数αは、カテゴリの分類数やカテゴリに属する文書数に基づいて適宜設定される値である。 The weighting coefficient α in Expression (1) is set so that the unexpectedness score of a document belonging to a category having a low similarity to the search target document among documents having a high similarity to the search target document is high. The weighting coefficient α is a value set as appropriate based on the number of categories and the number of documents belonging to the category.

文書検索装置１の制御部２１は、算出した意外性スコアを該当する文書類似度情報５７に記憶し(ステップＳ５０５)、文書情報ＤＢ６に登録される全ての文書を入力したか否かを判定する（ステップＳ５０６）。 The control unit 21 of the document search apparatus 1 stores the calculated unexpectedness score in the corresponding document similarity information 57 (step S505), and determines whether all the documents registered in the document information DB 6 have been input. (Step S506).

入力済みでない場合には（ステップＳ５０６のＮＯ）、文書検索装置１の制御部２１は、ステップＳ５０１に戻る。入力済みの場合には（ステップＳ５０６のＹＥＳ）、文書検索装置１の制御部２１は、処理を終了する。 If the input has not been completed (NO in step S506), the control unit 21 of the document search apparatus 1 returns to step S501. If it has been input (YES in step S506), the control unit 21 of the document search apparatus 1 ends the process.

以上、意外性スコア算出処理によれば、文書検索装置１は、検索対象文書との類似性が高い文書の内、検索対象文書と類似性が低いカテゴリに属する文書を検索するための意外性スコアを算出することができる。これにより、推薦文書の検索の際に、ユーザが選択した文書に対して意外性スコアの高い文書を文書情報ＤＢ６から抽出して、ユーザに意外性、多様性のある文書を提示することができる。 As described above, according to the unexpectedness score calculation process, the document search device 1 searches for a document belonging to a category having a low similarity to the search target document among documents having a high similarity to the search target document. Can be calculated. Thereby, when searching for a recommended document, a document having a high unexpectedness score with respect to a document selected by the user can be extracted from the document information DB 6 and a document with unexpectedness and diversity can be presented to the user. .

尚、文書推薦処理のステップＳ１０１でユーザより選択を受付ける文書は、文書情報ＤＢ６に未登録の文書であっても良い。例えば、ユーザが取得した記事文書、ユーザが作成した文書等であっても良い。その場合は、文書検索装置１の制御部２１は、その受付けた文書の内容自体に基づいて、特徴語を抽出して（ステップＳ１０４）、文書ベクトルを生成する（ステップＳ１０５）。 Note that the document that is selected by the user in step S101 of the document recommendation process may be a document that is not registered in the document information DB 6. For example, it may be an article document acquired by the user, a document created by the user, or the like. In this case, the control unit 21 of the document search apparatus 1 extracts feature words based on the accepted document content itself (step S104) and generates a document vector (step S105).

また、文書推薦処理のステップＳ１０１でユーザより選択を受付ける文書は、文書検索装置１の記憶部２２等に予め記憶されるユーザの履歴情報（例えば、文書の購入履歴等）に基づいて、決定されても良い。その場合は、文書検索装置１の制御部２１は、履歴情報に基づいて決定された１又は複数の検索対象文書から、特徴語を抽出して（ステップＳ１０４）、文書ベクトルを生成する（ステップＳ１０５）。 Further, the document to be selected from the user in step S101 of the document recommendation process is determined based on the user history information (for example, the document purchase history) stored in advance in the storage unit 22 of the document search device 1 or the like. May be. In that case, the control unit 21 of the document search apparatus 1 extracts feature words from one or more search target documents determined based on the history information (step S104), and generates a document vector (step S105). ).

また、文書検索装置１は、ネットワークに接続せず単体でも用いることができる。図１３は、ネットワークに接続されていない文書検索装置１ａの文書推薦処理の一例を示すフローチャートである。
文書検索装置１ａの制御部２１は、ユーザから検索対象文書の入力を受付ける（ステップＳ６０１）。
ステップＳ６０２〜ステップＳ６０７は、図７に示した文書検索装置１の文書推薦処理のステップＳ１０４〜ステップＳ１０９と同一である。
そして、文書検索装置１の制御部２１は、抽出された意外性スコアの高い文書を、文書推薦画面に表示する（ステップＳ６０８）。 The document search apparatus 1 can be used alone without being connected to a network. FIG. 13 is a flowchart showing an example of document recommendation processing of the document search apparatus 1a not connected to the network.
The control unit 21 of the document search apparatus 1a receives an input of a search target document from the user (step S601).
Steps S602 to S607 are the same as steps S104 to S109 of the document recommendation process of the document search apparatus 1 shown in FIG.
Then, the control unit 21 of the document search apparatus 1 displays the extracted document with a high unexpectedness score on the document recommendation screen (step S608).

以上、添付図面を参照しながら、本発明に係る文書検索システム１００等の好適な実施形態について説明したが、本発明はかかる例に限定されない。当業者であれば、本願で開示した技術的思想の範疇内において、各種の変更例又は修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される The preferred embodiments of the document search system 100 and the like according to the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to such examples. It will be apparent to those skilled in the art that various changes or modifications can be conceived within the scope of the technical idea disclosed in the present application, and these naturally belong to the technical scope of the present invention. Understood

１………文書検索装置
２（２ａ、２ｂ）………ユーザ端末
３………ネットワーク
５………カテゴリ情報ＤＢ
６………文書情報ＤＢ
２１………制御部
２２………記憶部
２３………メディア入出力部
２４………通信制御部
２５………入力部
２６………表示部
２７………周辺機器Ｉ／Ｆ部
５０………カテゴリ情報
５６………カテゴリ類似度情報
５７………文書類似度情報
６０………文書情報
１００………文書検索システム DESCRIPTION OF SYMBOLS 1 ......... Document search apparatus 2 (2a, 2b) ......... User terminal 3 ......... Network 5 ......... Category information DB
6 ... Document information DB
21 ......... Control unit 22 ......... Storage unit 23 ......... Media input / output unit 24 ......... Communication control unit 25 ......... Input unit 26 ......... Display unit 27 ......... Peripheral device I / F unit 50 ... …… Category information 56 ……… Category similarity information 57 ……… Document similarity information 60 ……… Document information 100 ……… Document search system

Claims

A document search device that searches for a recommended document that is related to a search target document to be searched,
First storage means for managing and storing feature words for each document of the recommended document;
Second storage means for managing and storing feature words for each category of recommended documents;
Referring to the first storage means, a document similarity calculating means for calculating a document similarity which is a similarity in document units between the target document and the recommended document;
Referring to the second storage means, a classification similarity calculating means for calculating a classification similarity which is a similarity in a classification unit between the target document and the recommended document;
Search means for taking the difference between the document similarity and the classification similarity and extracting the recommended document with reference to the calculated difference value ;
A document search apparatus comprising:

Document retrieval system according to claim 1, wherein the extracting those exceeding a threshold which the difference value is set in advance.

The document search apparatus according to claim 1 , wherein the search unit obtains a difference by multiplying the classification similarity by a predetermined coefficient.

The said search means searches the said recommended document using the said document similarity and / or the said classification similarity which satisfy | fill the preset threshold value. The one of Claim 1 characterized by the above-mentioned. Document retrieval device.

A computer-based document search method for searching for a recommended document that is related to a search target document to be searched,
A control unit of the computer,
Document similarity calculation for calculating a document similarity that is a similarity in document units between the target document and the recommended document with reference to the first storage unit that manages and stores the feature word for each document of the recommended document Steps,
Classification similarity calculation for calculating a classification similarity that is a similarity in a classification unit between the target document and the recommended document with reference to a second storage unit that manages and stores the feature word for each classification of the recommended document Steps,
A step of taking a difference between the document similarity and the classification similarity and extracting the recommended document with reference to the calculated difference value ;
A document retrieval method comprising:

A program for causing a computer to function as a document search device that searches for a recommended document that is related to a search target document to be searched,
The computer,
First storage means for managing and storing feature words for each document of the recommended document;
Second storage means for managing and storing feature words for each category of recommended documents;
Referring to the first storage means, a document similarity calculating means for calculating a document similarity which is a similarity in document units between the target document and the recommended document;
A classification similarity calculation unit that calculates a classification similarity that is a similarity in a classification unit between the target document and the recommended document with reference to the second storage unit;
Search means for taking the difference between the document similarity and the classification similarity and extracting the recommended document with reference to the calculated difference value ;
Program to function as.

A document search system in which a user terminal and a document search device that searches for a recommended document that is related to a search target document to be searched are connected via a network,
The user terminal is
Input receiving means for receiving input of the search target document;
Transmitting means for transmitting the search target document to the document search device;
Display means for receiving and displaying search results from the document search device;
With
The document search device includes:
First storage means for managing and storing feature words for each document of the recommended document;
Second storage means for managing and storing feature words for each category of recommended documents;
Receiving means for receiving the search target document from the user terminal;
Referring to the first storage means, a document similarity calculating means for calculating a document similarity which is a similarity in document units between the target document and the recommended document;
Referring to the second storage means, a classification similarity calculating means for calculating a classification similarity which is a similarity in a classification unit between the target document and the recommended document;
Search means for taking the difference between the document similarity and the classification similarity and extracting the recommended document with reference to the calculated difference value ;
Transmitting means for transmitting the extracted recommended document to the user terminal;
A document retrieval system comprising: