JP2010224623A

JP2010224623A - Method and program for recommending related article

Info

Publication number: JP2010224623A
Application number: JP2009068146A
Authority: JP
Inventors: Tomoyasu Okada; 智靖岡田
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2009-03-19
Filing date: 2009-03-19
Publication date: 2010-10-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide a related article recommendation method for preventing content from being substantially overlapped, and for precisely recommending a related article matched with tastes/interests of each user as an article related with a certain article which a user is interested. <P>SOLUTION: The related article recommendation method includes; a first step of calculating the weighting value of a featured word in an original article 401 to obtain featured word data; a second step of calculating the weighting value of the featured word in each subscription article 204 to obtain featured word data; a third step of calculating similarity between the featured word data in the original article 401 and the featured word data of each subscription article 204; a fourth step of classifying the related article 402 from the subscription article 204 based on the similarity; a fifth step of calculating the mean value of the weighting value of the featured word in each read article 207 to obtain featured word data indicating user tastes; a sixth step of calculating similarity between the featured word data indicating the user tastes and the featured word data of each related article 402; and a seventh step of preferentially presenting the related article 402 whose similarity is a high order to the user. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、テキスト情報のフィルタリング技術に関し、特に、特定の記事の内容に関連する内容を有する他の記事を自動で推奨する関連記事推奨方法および関連記事推奨プログラムに適用して有効な技術に関するものである。 The present invention relates to a text information filtering technique, and more particularly to a technique effectively applied to a related article recommendation method and a related article recommendation program that automatically recommends other articles having contents related to the contents of a specific article. It is.

近年、インターネット等の普及により、コンピュータを利用して非常に多くの種類の情報を入手することが可能である。例えば、各種ニュースサイトや検索エンジンを用いて多くの情報を収集することができる。また、電子メール等によっても多くの情報を得ることができる。また、インターネット上に限らず、例えば、社内のサーバに電子化されて保管されている各種の社内資料などからも多くの情報を入手することができる。これらの多くの電子情報を有効に利用するためには、ユーザが自分の関心と合致する内容の情報を精度良く見つけられるだけでなく、参照している情報に対して自分では気づかない関連する他の情報を発見できるようにしてユーザの関心の広がりをサポートするような情報のフィルタリングの仕組みが必要である。 In recent years, with the spread of the Internet and the like, it is possible to obtain a great variety of information using a computer. For example, a lot of information can be collected using various news sites and search engines. Also, a lot of information can be obtained by e-mail or the like. Further, not only on the Internet, for example, a large amount of information can be obtained from various in-house materials that are stored electronically on an in-house server. In order to make effective use of this large amount of electronic information, users can not only find information that matches their interests with high accuracy, but also other related information that they are not aware of by themselves. It is necessary to have an information filtering mechanism that can discover the information of the user and support the spread of the user's interest.

このような要望に対して、例えば、ユーザが関心のあるニュース記事等の情報に対して、自然言語処理などを利用した記事内容の類似度の判定により、類似する記事を関連記事として提示する技術がいくつか提案されている。このような技術では、ユーザが関連記事を参照する際の効率を維持するため、ユーザが参照している元の記事の内容と実質的に内容が同一で重複する記事については関連記事から除外して提示しないようにする必要がある。このため、実質的に内容が重複する記事を特定し、これを除外したり一まとめにしてタイトルのみ一覧表示したりするなどしてユーザが認識できるようにしている。 In response to such a request, for example, a technology for presenting similar articles as related articles by determining similarity of article contents using natural language processing or the like for information such as news articles that the user is interested in Several have been proposed. In such a technology, in order to maintain the efficiency when the user refers to related articles, articles that are substantially the same as the contents of the original article referenced by the user and that overlap are excluded from the related articles. It is necessary not to present it. For this reason, articles with substantially overlapping contents are specified, so that the user can recognize them by excluding them or listing them together in a list.

このような関連記事を提示する技術として、例えば、特開平９−１０１９９０号公報（特許文献１）には、記事表現を自然言語処理により記事間で比較することによって記事同士の類似度を算出し、その類似度に従ってユーザに提示される記事とそれに関連する関連記事を決定し、その際、互いに類似度が高く、かつ情報源が異なる記事の集合を、重複記事の集合として分類する技術が開示されている。 As a technique for presenting such a related article, for example, in Japanese Patent Laid-Open No. 9-101990 (Patent Document 1), the similarity between articles is calculated by comparing article expressions between articles by natural language processing. Discloses a technology that determines articles presented to the user according to their similarity and related articles related to the article, and classifies a set of articles with high similarity and different information sources as a set of duplicate articles. Has been.

また、例えば、特開２００５−３５２８５７号公報（特許文献２）には、特許文献１などの分類手法では出現する単語の分布などが似ていなくても実質的な内容が同一であるような記事の集合を特定できない場合もあることを考慮し、ユーザが動向を把握したいトピック等を表すキーワードを含む複数の記事について、発信日時の差があらかじめ登録された閾値より小さく、かつ発信者が互いに異なるものを実質的に同じ内容の記事として特定する技術が開示されている。 Further, for example, in Japanese Patent Laid-Open No. 2005-352857 (Patent Document 2), articles that have substantially the same content even though the distribution of words that appear in the classification method such as Patent Document 1 are not similar. In consideration of the fact that it may not be possible to identify a set of articles, the difference in the date and time of sending is smaller than a pre-registered threshold for multiple articles that contain keywords that represent topics or the like that the user wants to grasp trends, and the senders are different from each other A technique for identifying an article as an article having substantially the same content is disclosed.

特開平９−１０１９９０号公報Japanese Patent Laid-Open No. 9-101990 特開２００５−３５２８５７号公報JP 2005-352857 A

ユーザが関心のあるニュース記事等の情報を参照して利用する場合、上述のように、ユーザはそれぞれ自分の関心、興味や嗜好に合った意外な関連記事を発見したいと要望する。しかし、特許文献１、２に記載されている技術では、関連記事を提示する際に重複記事を特定することは可能であるが、任意の記事に対して提示される関連記事はどのユーザの場合でも同じものとなり、ユーザ毎の嗜好・関心に合った関連記事を推奨するということはできない。 When the user refers to and uses information such as news articles of interest, as described above, the user desires to discover unexpected related articles that match his interests, interests, and preferences. However, with the techniques described in Patent Documents 1 and 2, it is possible to identify duplicate articles when presenting related articles, but for which user the related articles presented for any article are However, it is the same, and it is not possible to recommend related articles that match the preferences and interests of each user.

そこで本発明の目的は、ユーザが関心のある記事に関連する記事として、内容が実質的に重複せず、かつユーザ毎の嗜好・関心に合った関連記事を精度良く推奨する関連記事推奨方法および関連記事推奨プログラムを提供することにある。本発明の前記ならびにその他の目的と新規な特徴は、本明細書の記述および添付図面から明らかになるであろう。 Accordingly, an object of the present invention is to provide a related article recommendation method that accurately recommends related articles that do not substantially overlap as articles related to articles that the user is interested in, and that match the preferences and interests of each user, and The related article is to provide a recommended program. The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.

本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、以下のとおりである。 Of the inventions disclosed in this application, the outline of typical ones will be briefly described as follows.

本発明の代表的な実施の形態による関連記事推奨方法は、コンピュータシステムによって、ユーザが参照しているテキストデータからなる元記事に対して、内容が関連する関連記事を提示して前記ユーザに推奨するものであって、前記コンピュータシステムは、前記ユーザ毎に、前記ユーザの参照対象である複数の購読記事と、前記各購読記事についての参照履歴とを保持し、前記元記事から所定の抽出条件に基づいて１つ以上の単語を特徴語として抽出し、抽出した前記各特徴語について、所定の算出条件に基づいて前記元記事における前記特徴語の重要度を示す重み付け値を算出して前記元記事の特徴語データとする第１ステップと、前記各購読記事から前記所定の抽出条件に基づいて１つ以上の単語を特徴語として抽出し、抽出した前記各特徴語について、前記所定の算出条件に基づいて前記各購読記事における前記特徴語の前記重み付け値を算出して前記各購読記事の特徴語データとする第２ステップと、所定の比較条件に基づいて、前記第１ステップで算出した前記元記事の特徴語データと、前記第２ステップで算出した前記各購読記事の特徴語データとの類似度を算出する第３ステップと、前記第３ステップで算出した前記類似度が所定の閾値より高くかつ前記元記事との発行日時の差が所定の時間間隔以上である前記購読記事を前記関連記事として分類し、前記第２ステップで算出した対象の前記各購読記事の特徴語データを、前記各関連記事の特徴語データとする第４ステップと、前記各購読記事と前記参照履歴とに基づいて前記ユーザの既読記事を取得し、全ての前記既読記事から前記所定の抽出条件に基づいて１つ以上の単語を特徴語として抽出し、抽出した前記各特徴語について、前記所定の算出条件に基づいて前記各既読記事における前記特徴語の前記重み付け値を算出し、前記既読記事全てにおけるその平均値を算出して前記ユーザの嗜好を表す特徴語データとする第５ステップと、前記所定の比較条件に基づいて、前記第５ステップで算出した前記ユーザの嗜好を表す特徴語データと、前記第４ステップで分類した前記各関連記事の特徴語データとの類似度を算出する第６ステップと、前記第６ステップで算出した前記類似度が上位の前記関連記事を優先的に前記ユーザに提示する第７ステップとを実行することを特徴とするものである。 According to a related article recommendation method according to a typical embodiment of the present invention, a related article whose contents are related to an original article composed of text data referred to by a user is recommended by the computer system. The computer system holds, for each user, a plurality of subscribed articles that are a reference target of the user and a reference history for each subscribed article, and a predetermined extraction condition from the original article One or more words are extracted as feature words based on the feature word, and for each of the extracted feature words, a weight value indicating the importance of the feature word in the original article is calculated based on a predetermined calculation condition, and the original word is calculated. A first step of using as feature word data of an article, and extracting one or more words as feature words from each of the subscribed articles based on the predetermined extraction condition; For each feature word, based on the predetermined comparison condition, a second step of calculating the weight value of the feature word in each subscribed article based on the predetermined calculation condition and making it the feature word data of each subscribed article The third step of calculating the similarity between the feature word data of the original article calculated in the first step and the feature word data of each subscribed article calculated in the second step, and the third step The subscribed articles in which the calculated similarity is higher than a predetermined threshold and the difference in issue date and time from the original article is equal to or greater than a predetermined time interval are classified as the related articles, and the target articles calculated in the second step The fourth step of using the feature word data of each subscribed article as the feature word data of each related article, and acquiring the user's read articles based on each subscribed article and the reference history, One or more words are extracted as feature words from the read article based on the predetermined extraction condition, and the feature words in the read articles are extracted based on the predetermined calculation condition for each extracted feature word. The weight value is calculated, the average value of all the read articles is calculated and used as feature word data representing the user's preference, and the fifth step based on the predetermined comparison condition A sixth step of calculating the similarity between the feature word data representing the user's preference calculated in step 4 and the feature word data of each related article classified in the fourth step, and the similarity calculated in the sixth step. And a seventh step of preferentially presenting the related article with a higher degree to the user.

本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば以下のとおりである。 Among the inventions disclosed in the present application, effects obtained by typical ones will be briefly described as follows.

本発明の代表的な実施の形態によれば、ユーザ毎に過去に参照した記事の履歴を利用することにより、ユーザが関心のある記事に関連する記事として、内容が実質的に重複せず、かつユーザ毎の嗜好・関心に合った関連記事を精度良く推奨することが可能となる。 According to the exemplary embodiment of the present invention, by using the history of articles that have been referred to in the past for each user, as the article related to the article that the user is interested in, the content does not substantially overlap, In addition, it is possible to accurately recommend related articles that match the preferences and interests of each user.

本発明の一実施の形態におけるユーザが元記事を参照する際の関連記事抽出部の処理フロー例の概要を説明する図である。It is a figure explaining the outline | summary of the example of a processing flow of the related article extraction part when the user in one embodiment of this invention refers to an original article. 本発明の一実施の形態である関連記事推奨方法を適用した情報収集管理システムの構成例の概要を示した図である。It is the figure which showed the outline | summary of the structural example of the information collection management system to which the related article recommendation method which is one embodiment of this invention is applied. 本発明の一実施の形態におけるデータベースのテーブル構成例の概要を示した図である。It is the figure which showed the outline | summary of the table structure example of the database in one embodiment of this invention. 本発明の一実施の形態における記事のテキストデータから特徴語を抽出して特徴語データを算出する処理の例を説明する図である。It is a figure explaining the example of the process which extracts the characteristic word from the text data of the article in one embodiment of this invention, and calculates characteristic word data. 本発明の一実施の形態における元記事の特徴語データと各購読記事の特徴語データとの類似度を算出する処理の例を説明する図である。It is a figure explaining the example of the process which calculates the similarity degree of the feature word data of the original article, and the feature word data of each subscription article in one embodiment of this invention. 本発明の一実施の形態における購読記事から関連記事と重複記事とを分類する処理の例を説明する図である。It is a figure explaining the example of the process which classify | categorizes a related article and a duplicate article from the subscription article in one embodiment of this invention. 本発明の一実施の形態における既読記事からユーザ嗜好を表す特徴語データを算出する処理の例を説明する図である。It is a figure explaining the example of the process which calculates the feature word data showing user preference from the read article in one embodiment of this invention. 本発明の一実施の形態におけるユーザ嗜好を表す特徴語データと各関連記事の特徴語データとの類似度を算出する処理の例を説明する図である。It is a figure explaining the example of the process which calculates the similarity degree of the feature word data showing the user preference in one embodiment of this invention, and the feature word data of each related article.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部には原則として同一の符号を付し、その繰り返しの説明は省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.

本発明の一実施の形態である関連記事推奨方法を適用した情報収集管理システムは、各種Ｗｅｂサイトや電子メール、電子文書などのテキスト情報（以下ではこれらを総称して「記事」という場合がある）を一ヶ所に収集して各ユーザから参照可能とし、各ユーザが情報を利用・参照した行動履歴を蓄積して、蓄積された行動履歴を再利用することによって各ユーザにとって価値ある情報を推奨することを可能とし、蓄積した情報を生産的に活用することができる仕組みを提供するシステムである。 An information collection management system to which a related article recommendation method according to an embodiment of the present invention is applied, may include text information such as various websites, e-mails, and electronic documents (hereinafter collectively referred to as “articles”). ) Is collected in one place and can be referred to by each user, the action history of each user using and referring to the information is accumulated, and the accumulated action history is reused to recommend valuable information for each user. It is a system that provides a mechanism that allows the accumulated information to be used productively.

本実施の形態では、各ユーザの購読の指定に基づいて収集した購読記事から、ユーザが関心のあるニュース等の記事を参照する際に、ユーザの関心・嗜好に合った関連記事を抽出して提示・推奨することにより、ユーザの嗜好に合った意外な関連記事を発見する可能性を向上させ、ユーザの関心・興味の広がりを支援することを可能としている。また、関連記事を提示する際に、実質的に内容が重複する重複記事を関連記事から除外することにより、ユーザが情報利用の活動を効率的に行うことを可能としている。 In this embodiment, when referring to articles such as news that the user is interested in from the subscribed articles collected based on the designation of each user's subscription, the related articles that match the user's interests and preferences are extracted. By presenting and recommending, it is possible to improve the possibility of finding an unexpected related article that matches the user's preference, and to support the spread of the user's interest and interest. Moreover, when presenting a related article, by excluding duplicate articles whose contents substantially overlap from the related articles, it is possible for the user to efficiently perform information utilization activities.

［システム構成］
図２は、本発明の一実施の形態である関連記事推奨方法を適用した情報収集管理システムの構成例の概要を示した図である。情報収集管理システムは、情報収集管理サーバ１００およびデータベース２００から構成される。データベース２００は、情報収集管理サーバ１００上に実装されてもよいし、別のデータベースサーバ等の機器上に実装されてもよい。 [System configuration]
FIG. 2 is a diagram showing an outline of a configuration example of an information collection management system to which a related article recommendation method according to an embodiment of the present invention is applied. The information collection management system includes an information collection management server 100 and a database 200. The database 200 may be mounted on the information collection management server 100 or may be mounted on a device such as another database server.

情報収集管理サーバ１００は、新着記事収集部１１０により、例えば、Ｗｅｂサーバ３１０上のＷｅｂサイトや、社内の文書サーバ３２０などから新着記事としてテキストデータを収集する。新着記事の収集方法としては、例えば、ＲＳＳ（RDF Site Summary）のフィードを利用することができる。新着記事収集部１１０がＲＳＳリーダーとして動作することにより、Ｗｅｂサイト上のニュース等の記事に限らず、文書サーバ３２０上に格納される社内資料等の電子データや電子メール等もＲＳＳ化することで収集対象とすることができる。 The information collection management server 100 collects text data as new articles from, for example, a web site on the web server 310 or an in-house document server 320 by the new article collection unit 110. As a method for collecting newly arrived articles, for example, an RSS (RDF Site Summary) feed can be used. By the new article collection unit 110 operating as an RSS reader, not only articles such as news on the website but also electronic data such as in-house materials stored in the document server 320, e-mails, etc. are converted to RSS. Can be collected.

上記のＲＳＳフィードの情報は、本実施の形態の情報収集管理システムを利用する各ユーザ毎に、各ユーザが購読したいＷｅｂサイト等についての情報である購読指定２０６としてデータベース２００のユーザ購読情報２０３の一部として保持される。購読指定２０６の内容に基づいて新着記事収集部１１０により収集された新着記事は、データベース２００の記事群２０１に記事２０２として格納される。 The RSS feed information is stored in the user subscription information 203 of the database 200 as a subscription designation 206 that is information about a website or the like that each user wants to subscribe for each user who uses the information collection management system of the present embodiment. Retained as part. New arrival articles collected by the new arrival article collection unit 110 based on the contents of the subscription designation 206 are stored as articles 202 in the article group 201 of the database 200.

ユーザ購読情報２０３には、記事２０２のうち、対象のユーザの購読指定２０６に基づいて収集された記事２０２を特定する情報が購読記事２０４として保持される。すなわち、記事群２０１には全てのユーザの購読指定２０６に基づいて収集された全ての記事２０２が格納されており、ユーザ購読情報２０３には各ユーザの参照対象である購読記事２０４がそれぞれ格納されているということと等価である。なお、記事２０２および購読記事２０４には、ＲＳＳフィード等の購読指定２０６によって収集されたものだけではなく、ユーザがテキストデータを直接クリップするなどして後の参照用に登録したものなども含まれる。 In the user subscription information 203, information specifying the article 202 collected based on the subscription designation 206 of the target user among the articles 202 is held as a subscription article 204. That is, all articles 202 collected based on the subscription designation 206 of all users are stored in the article group 201, and the subscribed articles 204 that are the reference targets of each user are stored in the user subscription information 203, respectively. Is equivalent to The article 202 and the subscribed article 204 include not only those collected by the subscription designation 206 such as an RSS feed but also those registered by the user for later reference by directly clipping the text data. .

データベース２００のユーザ購読情報２０３には、さらに、対象のユーザが購読記事２０４を実際に参照した履歴が参照履歴２０５として保持される。購読記事２０４と参照履歴２０５とに基づいて、対象のユーザが過去に参照した既読記事を特定することができる。 In the user subscription information 203 of the database 200, a history that the target user actually referred to the subscription article 204 is further stored as a reference history 205. Based on the subscribed article 204 and the reference history 205, it is possible to identify a read article that the target user has referred to in the past.

クライアント端末４００を介してユーザから自らの関心のある記事（元記事４０１）を参照したい旨の要求を受けた情報収集管理サーバ１００は、購読記事２０４の中から元記事４０１を取得して図示しないＷｅｂサーバプログラム等を介してクライアント端末４００に提示する。さらに情報収集管理サーバ１００は、関連記事抽出部１２０によって、各購読記事２０４の内容（特徴語）と元記事４０１の内容（特徴語）との類似度を判断することにより、元記事４０１と関連する内容を有する関連記事４０２を抽出し、同様にクライアント端末４００に提示する。 The information collection management server 100 that has received a request to refer to an article of interest (original article 401) from the user via the client terminal 400 acquires the original article 401 from the subscribed article 204 and does not illustrate it. It is presented to the client terminal 400 via a Web server program or the like. Further, the information collection management server 100 determines the degree of similarity between the contents (feature words) of each subscribed article 204 and the contents (feature words) of the original article 401 by the related article extraction unit 120, thereby relating to the original article 401. The related article 402 having the content to be extracted is extracted and similarly presented to the client terminal 400.

このとき、対象のユーザの既読記事全体の内容（特徴語）と各関連記事４０２の内容（特徴語）との類似度を判断することにより、関連記事４０２の中でもよりユーザの関心・嗜好に合った関連記事４０２を推奨するように提示する。また、実質的に内容が重複する重複記事は除外し、関連記事４０２として提示されないようにする。 At this time, by determining the degree of similarity between the content (feature word) of the entire read article of the target user and the content (feature word) of each related article 402, the user's interest / preference is more enhanced among the related articles 402. Suggested relevant articles 402 are recommended. Also, duplicate articles whose contents are substantially duplicated are excluded so that they are not presented as related articles 402.

なお、新着記事収集部１１０および関連記事抽出部１２０は、情報収集管理サーバ１００上で稼働するソフトウェアプログラムとして実装され、例えば、図示しないＷｅｂサーバ上で稼働するアプリケーションとして実装することができる。また、関連記事抽出部１２０は、詳細は後述するが、例えば、特徴語データ算出部１２１、関連記事分類部１２２、類似度算出部１２３からなり、関連記事抽出部１２０における上述したような機能を実現する。 The new article collection unit 110 and the related article extraction unit 120 are implemented as software programs that run on the information collection management server 100, and can be implemented as, for example, applications that run on a Web server (not shown). The related article extraction unit 120 includes, for example, a feature word data calculation unit 121, a related article classification unit 122, and a similarity calculation unit 123, and the functions of the related article extraction unit 120 as described above are described in detail later. Realize.

［テーブル構成］
図３は、データベース２００のテーブル構成例の概要を示した図である。データベース２００に格納されるテーブルは、例えば、ユーザ情報２１０、購読情報２２０、フィード一覧２３０、リアクション情報２４０、記事データ２５０からなる。図中のテーブル間の矢印は、例えば、Ａ→Ｂである場合に、Ａ：Ｂ＝１：ｎの関係（A has many Bs）にあることを示している。 [Table structure]
FIG. 3 is a diagram showing an outline of a table configuration example of the database 200. The table stored in the database 200 includes, for example, user information 210, subscription information 220, feed list 230, reaction information 240, and article data 250. The arrows between the tables in the figure indicate that when A → B, for example, A: B = 1: n (A has many Bs).

ユーザ情報２１０は、各ユーザに関する情報を保持するテーブルであり、例えば、ユーザＩＤ、パスワード、ユーザ名などの項目を有する。ユーザ情報２１０は、ユーザが情報収集管理システムへログインする際の認証時などに利用される。フィード一覧２３０は、各種情報（記事）を自動収集するためのＲＳＳフィードの基本的な情報を保持するテーブルであり、例えば、フィードＩＤ、サイト名、ＵＲＬ（Uniform Resource Locator）などの項目を有する。上述したように、Ｗｅｂサーバ３１０上のニュースサイト等から提供されるＲＳＳフィードだけでなく、文書サーバ３２０等に格納されている各種電子文書や電子メールなどもＲＳＳ化することで、これらの情報を新着記事収集部１１０により自動的に巡回して収集することができる。 The user information 210 is a table that holds information about each user, and includes items such as a user ID, a password, and a user name. The user information 210 is used at the time of authentication when the user logs in to the information collection management system. The feed list 230 is a table that holds basic information of an RSS feed for automatically collecting various information (articles), and includes items such as a feed ID, a site name, and a URL (Uniform Resource Locator). As described above, not only the RSS feed provided from the news site on the Web server 310 but also various electronic documents and e-mails stored in the document server 320 are converted to RSS so that the information can be changed. The newly arrived article collection unit 110 can automatically circulate and collect.

購読情報２２０は、各ユーザがどのＲＳＳフィードを購読しているかの情報を保持するテーブルであり、例えば、ユーザＩＤ、フィードＩＤ、購読開始日時、未読数などの項目を有する。購読情報２２０は、図２における購読指定２０６に相当する。各ユーザは複数のＲＳＳフィードを購読することができ、また、各ＲＳＳフィードは、複数のユーザから購読されることができる。 The subscription information 220 is a table that holds information on which RSS feed each user subscribes to, and includes items such as a user ID, a feed ID, a subscription start date and time, and an unread number. The subscription information 220 corresponds to the subscription designation 206 in FIG. Each user can subscribe to multiple RSS feeds, and each RSS feed can be subscribed to from multiple users.

記事データ２５０は、各ＲＳＳフィードに含まれる記事やユーザがＷｅｂサイトからクリップした記事、電子文書など、収集した記事の内容を保持するテーブルであり、例えば、記事ＩＤ、フィードＩＤ、発行日時、記事内容などの項目を有する。記事データ２５０は図２における記事２０２に相当し、購読情報２２０のフィードＩＤの値で特定される記事データ２５０のエントリは図２における購読記事２０４に相当する。なお、発行日時の項目は、記事ＩＤの項目で特定される記事２０２が発行もしくは発信された日時を表し、記事内容の項目は、記事ＩＤの項目で特定される記事２０２の具体的なテキストデータである。 The article data 250 is a table that holds the contents of collected articles such as articles included in each RSS feed, articles clipped from the website by the user, and electronic documents. For example, article data, feed ID, issue date, article It has items such as contents. The article data 250 corresponds to the article 202 in FIG. 2, and the entry of the article data 250 specified by the feed ID value of the subscription information 220 corresponds to the subscribed article 204 in FIG. The issue date / time item represents the date / time when the article 202 specified by the article ID item is issued or transmitted, and the article content item is specific text data of the article 202 specified by the article ID item. It is.

リアクション情報２４０は、各ユーザが各購読記事２０４に対してどのようなリアクションを行ったかの情報（行動履歴）を保持するテーブルであり、例えば、ユーザＩＤ、記事ＩＤ、参照日時、タグ、メモ、ハイライト範囲などの項目を有する。記事ＩＤおよび参照日時の項目は、図２における参照履歴２０５に相当する。なお、購読記事２０４に対するリアクションとしては、購読記事２０４の参照の他に、例えば、購読記事２０４に、その内容を表す分類用のタグを付与したり、テキストのメモを付加したり、任意の範囲をハイライトしたりすることなどが可能であり、これらの内容をそれぞれタグやメモ、ハイライト範囲の項目に保持することができる。なお、上述した各テーブルの項目は一例であり、これら以外の項目を有していてもよい。 The reaction information 240 is a table that holds information (action history) about what reaction each user has performed on each subscribed article 204. For example, the user ID, article ID, reference date, tag, memo, high It has items such as a light range. The item of article ID and reference date corresponds to the reference history 205 in FIG. As a reaction to the subscribed article 204, in addition to referring to the subscribed article 204, for example, to the subscribed article 204, a tag for classification indicating the content is added, a text memo is added, an arbitrary range, etc. Can be highlighted, and these contents can be held in tags, memos, and highlight range items, respectively. The items in each table described above are examples, and other items may be included.

［処理フロー］
図１は、ユーザが元記事４０１を参照する際の関連記事抽出部１２０の処理フロー例の概要を説明する図である。情報収集管理サーバ１００が、図示しないＷｅｂサーバプログラム等を介して、クライアント端末４００からユーザによる元記事４０１の参照要求を受け取ると、関連記事抽出部１２０は関連記事４０２を抽出する処理を開始する。 [Processing flow]
FIG. 1 is a diagram illustrating an outline of a processing flow example of the related article extraction unit 120 when the user refers to the original article 401. When the information collection management server 100 receives a reference request for the original article 401 by the user from the client terminal 400 via a Web server program (not shown) or the like, the related article extraction unit 120 starts a process of extracting the related article 402.

まず、特徴語データ算出部１２１により、元記事４０１のテキストデータから自然言語処理により１つ以上の単語を特徴語として抽出する。さらに抽出した各特徴語について後述するＴＦ−ＩＤＦ値（Term Frequency-Inverse Document Frequency）を算出して、元記事４０１の特徴語データとする（ステップＳ１０１）。同様に、特徴語データ算出部１２１により、全ての購読記事２０４のテキストデータから自然言語処理により１つ以上の単語を特徴語として抽出する。さらに抽出した各特徴語についてＴＦ−ＩＤＦ値を算出して、各購読記事２０４の特徴語データとする（ステップＳ１０２）。 First, the feature word data calculation unit 121 extracts one or more words as feature words from the text data of the original article 401 by natural language processing. Further, a TF-IDF value (Term Frequency-Inverse Document Frequency), which will be described later, is calculated for each extracted feature word and used as the feature word data of the original article 401 (step S101). Similarly, the feature word data calculation unit 121 extracts one or more words as feature words from the text data of all the subscribed articles 204 by natural language processing. Further, a TF-IDF value is calculated for each extracted feature word and used as feature word data of each subscribed article 204 (step S102).

次に、類似度算出部１２３により、ステップＳ１０１で算出した元記事４０１の特徴語データと、ステップＳ１０２で算出した各購読記事２０４の特徴語データとをそれぞれ後述するようにベクトル空間化する。さらに、元記事４０１の特徴語データについてのベクトルと、各購読記事２０４の特徴語データについてのベクトルとの内積を求め、内積の値に基づいて元記事４０１の特徴語データと各購読記事２０４の特徴語データとの類似度を算出する（ステップＳ１０３）。 Next, the similarity calculation unit 123 converts the feature word data of the original article 401 calculated in step S101 and the feature word data of each subscribed article 204 calculated in step S102 into a vector space as described later. Further, the inner product of the vector for the feature word data of the original article 401 and the vector for the feature word data of each subscribed article 204 is obtained, and the feature word data of the original article 401 and each subscribed article 204 are calculated based on the inner product value. The similarity with the feature word data is calculated (step S103).

次に、関連記事分類部１２２により、ステップＳ１０３で算出した類似度が所定の閾値より高く、かつ元記事４０１との発行日時の差が所定の時間間隔以上である購読記事２０４を、関連記事４０２として分類し、ステップＳ１０２で算出した対象の各購読記事２０４の特徴語データを、各関連記事４０２の特徴語データとする（ステップＳ１０４）。このとき、ステップＳ１０３で算出した類似度が所定の閾値より高く、かつ元記事４０１との発行日時の差が所定の時間間隔よりも小さい購読記事２０４は、実質的に内容が同じである重複記事４０３として分類する（ステップＳ１０４）。 Next, the related article classifying unit 122 converts the subscribed article 204 in which the similarity calculated in step S103 is higher than a predetermined threshold and the difference between the date and time of issue with the original article 401 is equal to or greater than a predetermined time interval. And the feature word data of each target subscribed article 204 calculated in step S102 is used as the feature word data of each related article 402 (step S104). At this time, the duplicate articles whose contents calculated in step S103 are higher than a predetermined threshold and whose difference in issue date and time from the original article 401 is smaller than a predetermined time interval are substantially the same. Classify as 403 (step S104).

次に、関連記事抽出部１２０により、ユーザの購読記事２０４と参照履歴２０５とに基づいてユーザの既読記事２０７を取得する。さらに、特徴語データ算出部１２１により、全ての既読記事２０７のテキストデータから自然言語処理により１つ以上の単語を特徴語として抽出する。さらに抽出した各特徴語についてＴＦ−ＩＤＦ値を算出し、各特徴語毎に既読記事２０７全てにおけるＴＦ−ＩＤＦ値の平均値を算出して、ユーザの嗜好を表す特徴語データとする（ステップＳ１０５）。 Next, the related article extraction unit 120 acquires the user's already read article 207 based on the user's subscription article 204 and the reference history 205. Further, the feature word data calculation unit 121 extracts one or more words as feature words from the text data of all the read articles 207 by natural language processing. Further, a TF-IDF value is calculated for each extracted feature word, and an average value of TF-IDF values in all the read articles 207 is calculated for each feature word to obtain feature word data representing the user's preference (step) S105).

次に、類似度算出部１２３により、ステップＳ１０３と同様に、ステップＳ１０５で算出したユーザの嗜好を表す特徴語データと、ステップＳ１０４で分類した各関連記事４０２の特徴語データとをそれぞれベクトル空間化する。さらに、ユーザの嗜好を表す特徴語データについてのベクトルと、各関連記事４０２の特徴語データについてのベクトルとの内積を求め、内積の値に基づいてユーザの嗜好を表す特徴語データと各関連記事４０２の特徴語データとの類似度を算出する（ステップＳ１０６）。最後に、ステップＳ１０６で算出した類似度が上位の順に関連記事４０２を並び替えてユーザに提示して処理を終了する（ステップＳ１０７）。 Next, similarly to step S103, the feature word data representing the user's preference calculated in step S105 and the feature word data of each related article 402 classified in step S104 are respectively converted into vector spaces by the similarity calculation unit 123. To do. Furthermore, the inner product of the vector about the feature word data representing the user's preference and the vector about the feature word data of each related article 402 is obtained, and the feature word data representing the user's preference based on the inner product value and each related article The similarity with the feature word data 402 is calculated (step S106). Finally, the related articles 402 are rearranged in the descending order of similarity calculated in step S106 and presented to the user, and the process ends (step S107).

［記事の特徴語データ算出］
図４は、特徴語データ算出部１２１における、記事のテキストデータから特徴語を抽出して特徴語データを算出する処理の例を説明する図である。ここでの処理は、上述した元記事４０１の特徴語データの算出処理（ステップＳ１０１）、各購読記事２０４の特徴語データの算出処理（ステップＳ１０２）、およびユーザ嗜好を表す特徴語データの算出処理（ステップＳ１０５）において行われる。 [Calculating article feature word data]
FIG. 4 is a diagram for explaining an example of processing in the feature word data calculation unit 121 for extracting feature words from article text data and calculating feature word data. The processing here includes the above-described feature word data calculation processing of the original article 401 (step S101), feature word data calculation processing of each subscribed article 204 (step S102), and feature word data calculation processing representing user preferences. (Step S105).

まず、対象の記事のテキストデータから、形態素解析により複合名詞を抽出する。形態素解析は自然言語処理において一般的に行われており、また、形態素解析エンジンやソフトウェアも種々のものが提供されているため、これらを利用してもよい。なお、ユーザ嗜好を表す特徴語の算出処理（ステップＳ１０５）では、後述するように、全ての既読記事２０７から複合名詞を抽出し、これをマージして特徴語とする。 First, compound nouns are extracted from the text data of the target article by morphological analysis. Morphological analysis is generally performed in natural language processing, and various morphological analysis engines and software are provided, and these may be used. In the process of calculating feature words representing user preferences (step S105), as will be described later, compound nouns are extracted from all read articles 207 and merged to form feature words.

次に、抽出した各複合名詞について、記事における単語の重要度についての重み付け値であるＴＦ−ＩＤＦ値を算出する。ＴＦ−ＩＤＦ値によって文章中の特徴的な単語（重要とみなされる単語）を抽出することは一般的に行われている。本実施の形態の関連記事推奨方法でもこの手法を用いて特徴語データを算出するが、特徴語データの算出手法はこれに限るものではなく、単語毎に数値（重み付け値）として評価が可能な手法であれば利用することができる。 Next, for each extracted compound noun, a TF-IDF value that is a weighting value for the importance of the word in the article is calculated. It is common practice to extract characteristic words (words regarded as important) in a sentence based on TF-IDF values. The feature word data is calculated using this method also in the related article recommendation method of the present embodiment. However, the feature word data calculation method is not limited to this, and can be evaluated as a numerical value (weighting value) for each word. Any technique can be used.

ここで、ＴＦ（Term Frequency）は、記事中の単語（複合名詞）の出現頻度であり、この値が大きいほどこの単語はこの記事の特徴をよく表しているものと考えられる。ある記事Ｄにおけるある単語ｔのＴＦ値は、例えば、記事Ｄ中の単語ｔの出現頻度をｆとすると、記事Ｄにおける単語の種類数ｍおよび対数で正規化して以下の式で表される。 Here, TF (Term Frequency) is the frequency of appearance of a word (compound noun) in an article, and it is considered that the larger this value, the better this word represents the feature of this article. The TF value of a certain word t in a certain article D is expressed by the following expression, normalized by the number of types m and logarithms of the word in the article D, for example, where the appearance frequency of the word t in the article D is f.

ＴＦの値が大きい単語であっても、多くの記事に頻繁に出現する単語は、特定の記事の特徴を表す単語ではない一般的な単語である場合が多い。ここで、ＩＤＦ（Inverse Document Frequency）は、対象の単語が出現する記事数の逆数であり、この値が大きいほどこの単語が出現する記事数が少なく、この単語は特定の記事の特徴をよく表しているものと考えられる。ある単語ｔのＩＤＦ値は、例えば、全記事の中で単語ｔが出現する文書数をＤｆとすると、全記事数Ｎで正規化して以下の式で表される。 Even if a word has a large TF value, a word that frequently appears in many articles is often a general word that is not a word representing the characteristics of a specific article. Here, IDF (Inverse Document Frequency) is the reciprocal of the number of articles in which the target word appears, and the larger this value, the smaller the number of articles in which this word appears. This word well represents the characteristics of a specific article. It is thought that. The IDF value of a certain word t is expressed by the following expression normalized by the number of all articles N, for example, where Df is the number of documents in which the word t appears in all articles.

上記のＴＦとＩＤＦの両者の値が大きい単語ｔが、文書Ｄの特徴を真によく表していると考えられるため、単語ｔのＴＦ−ＩＤＦ値は、ＴＦとＩＤＦの積を整数化した以下の式で表される。 Since the word t having a large value of both TF and IDF is considered to represent the feature of the document D truly well, the TF-IDF value of the word t is obtained by converting the product of TF and IDF to an integer. It is expressed by the following formula.

このＴＦ−ＩＤＦ値を対象の記事について特徴語（複合名詞）毎に算出する。対象の記事が異なれば、記事に含まれる特徴語も異なり、また、同じ特徴語であってもその記事中での出現頻度が異なるため、ＴＦ−ＩＤＦ値はそれぞれ異なる値となる。なお、数２式において、全記事数Ｎは、対象の記事が多くなるほど精度が高くなると考えられるため、他のユーザの購読分も含めた記事２０２全体の総数とする。また、単語ｔが出現する文書数Ｄｆは、記事２０２全体を単語ｔをキーとして全文検索する等によって求めることができる。 This TF-IDF value is calculated for each feature word (compound noun) for the target article. If the target article is different, the feature words included in the article are different, and even if the same feature word is used, the appearance frequency in the article is different. Therefore, the TF-IDF values are different from each other. In Equation 2, the total number of articles N is the total number of articles 202 including subscriptions of other users because the accuracy increases as the number of target articles increases. Further, the number of documents Df in which the word t appears can be obtained by performing a full text search on the entire article 202 using the word t as a key.

なお、本実施の形態では、短い文章の記事であっても特徴語データに内容が反映され、類似度を適切に比較することができるように、特徴語に漏れが出ないよう、記事から抽出された複合名詞全てを特徴語の対象としているが、所定の条件に基づいてＴＦ−ＩＤＦ値が小さいものを特徴語から除外するようにしてもよい。また、ＴＦ値、ＩＤＦ値の算出式については種々のものが提案されており、上記の算出式に限らず精度や処理時間などに応じて適当なものを用いることができる。 In this embodiment, even if the article is a short sentence, the content is reflected in the feature word data, and the feature word is extracted from the article so that the similarity can be properly compared. Although all the combined nouns are targeted for feature words, those having a small TF-IDF value may be excluded from the feature words based on a predetermined condition. Various formulas for calculating the TF value and the IDF value have been proposed, and not only the above formula but also an appropriate formula can be used according to accuracy, processing time, and the like.

［元記事と購読記事の類似度算出］
図５は、ステップＳ１０１で算出した元記事４０１の特徴語データと、ステップＳ１０２で算出した各購読記事２０４の特徴語データとの類似度を算出する処理（ステップＳ１０３）の例を説明する図である。類似度の算出に際しては、類似度算出部１２３により、上述した特徴語データ算出部１２１での処理によってステップＳ１０１、Ｓ１０２にて算出された元記事４０１の特徴語データ（ＴＦ−ＩＤＦ値）と各購読記事２０４の特徴語データ（ＴＦ−ＩＤＦ値）とを、それぞれベクトル空間化する。さらに、元記事４０１の特徴語データのベクトルと、各購読記事２０４の特徴語データのベクトルとの内積を用いることによって類似度を算出するベクトル空間法を利用する。 [Calculation of similarity between original article and subscription article]
FIG. 5 is a diagram for explaining an example of processing (step S103) for calculating the similarity between the feature word data of the original article 401 calculated in step S101 and the feature word data of each subscribed article 204 calculated in step S102. is there. When calculating the similarity, the similarity calculation unit 123 uses the feature word data (TF-IDF value) of the original article 401 calculated in steps S101 and S102 by the processing in the feature word data calculation unit 121 described above and each Each feature word data (TF-IDF value) of the subscribed article 204 is converted into a vector space. Further, a vector space method is used in which similarity is calculated by using the inner product of the feature word data vector of the original article 401 and the feature word data vector of each subscribed article 204.

元記事４０１の特徴語データおよび各購読記事２０４の特徴語データに含まれる特徴語を全てマージした数がｎ個であった場合、元記事４０１の特徴語データｄ_ｏのベクトルＶ（ｄ_ｏ）および、各購読記事２０４の特徴語データｄ_１、ｄ_２、…のベクトルＶ（ｄ_１）、Ｖ（ｄ_２）、…は、それぞれ、ｎ個の特徴語のＴＦ−ＩＤＦ値を要素とするｎ次元のベクトルとして表される。このとき、対象の特徴語を含まない特徴語データについては、当該特徴語に対応するベクトルの要素の値は０である。 If the number of merging all feature words contained in the feature word data of the feature word data and the subscription article 204 of the original article 401 was of n, the feature word data d _o of the original article 401 vector V (d _o) Further, the vector V (d ₁ ), V (d ₂ ),... Of the feature word data d ₁ , d ₂ ,... Of each subscribed article 204 has TF-IDF values of n feature words as elements. Expressed as an n-dimensional vector. At this time, for feature word data that does not include the target feature word, the value of the element of the vector corresponding to the feature word is zero.

ここで、ベクトルの方向は対象の記事の特徴を表していると考えられる。従って、元記事４０１の特徴語データと各購読記事２０４の特徴語データとの類似度は、ベクトルＶ（ｄ_ｏ）とベクトルＶ（ｄ_１）、Ｖ（ｄ_２）、…とのなす角の小ささで表すことができる。すなわち、二つのベクトルのなす角をθとした場合に、ｃｏｓθが１に近いほど類似度が高いと言うことができる。ｃｏｓθは二つのベクトルの内積を各ベクトルの絶対値で除算することにより求められる。従って、特徴語データｄ_ａ、ｄ_ｂの類似度ｓｉｍ（ｄ_ａ，ｄ_ｂ）は、二つのベクトルＶ（ｄ_ａ）、Ｖ（ｄ_ｂ）により以下の式で表される。 Here, the direction of the vector is considered to represent the feature of the target article. Therefore, the similarity between the feature word data of the original article 401 and the feature word data of each subscribed article 204 is the angle between the vector V (d _o ) and the vectors V (d ₁ ), V (d ₂ ),. It can be expressed in smallness. That is, when the angle between two vectors is θ, it can be said that the closer the cos θ is to 1, the higher the similarity. cos θ is obtained by dividing the inner product of two vectors by the absolute value of each vector. Accordingly, the similarity sim (d _a , d _b ) of the feature word data d _a and d _b is expressed by the following equation using the two vectors V (d _a ) and V (d _b ).

数４式を用いて、元記事４０１の特徴語データと各購読記事２０４の特徴語データとの類似度をそれぞれ算出する。すなわち、ｓｉｍ（ｄ_ｏ，ｄ_１）、ｓｉｍ（ｄ_ｏ，ｄ_２）、…をそれぞれ算出する。全ての購読記事２０４について元記事４０１の特徴語データとの類似度を算出すると、算出された類似度が上位の順に購読記事２０４を並び替える。図５の例では、類似度が上位の順に「購読記事３」、「購読記事１」、「購読記事２」の順に並び替えられたことを示している。なお、本実施の形態では、元記事４０１と各購読記事２０４との類似度をベクトル空間法を用いて算出しているが、これに限らず他の算出方法によって類似度を算出してもよい。 Using the formula 4, the similarity between the feature word data of the original article 401 and the feature word data of each subscribed article 204 is calculated. That is, sim (d _o , d ₁ ), sim (d _o , d ₂ ),. When the similarity with the feature word data of the original article 401 is calculated for all the subscribed articles 204, the subscribed articles 204 are rearranged in descending order of the calculated similarity. In the example of FIG. 5, it is shown that the similarities are rearranged in the order of “subscription article 3”, “subscription article 1”, and “subscription article 2”. In the present embodiment, the similarity between the original article 401 and each subscribed article 204 is calculated using the vector space method. However, the present invention is not limited to this, and the similarity may be calculated using another calculation method. .

［関連記事と重複記事の分類］
図６は、ステップＳ１０３で算出した元記事４０１の特徴語データと各購読記事２０４の特徴語データとの類似度に基づいて、購読記事２０４から関連記事４０２と重複記事４０３とを分類する処理（ステップＳ１０４）の例を説明する図である。 [Classification of related articles and duplicate articles]
FIG. 6 is a process for classifying the related articles 402 and the duplicate articles 403 from the subscribed articles 204 based on the similarity between the feature word data of the original article 401 calculated in step S103 and the feature word data of each subscribed article 204 ( It is a figure explaining the example of step S104).

まず、関連記事分類部１２２により、ステップＳ１０３で類似度が上位の順に並び替えられた購読記事２０４のうち、類似度が所定の閾値よりも高いもののみを抽出する。次に、抽出した各購読記事２０４について、元記事４０１との発行日時の差が所定の時間間隔以上であるか否かを判定し、所定の時間間隔以上である購読記事２０４を関連記事４０２として分類し、所定の時間間隔より小さい購読記事２０４を重複記事４０３として分類する。特許文献２などにも記載されているように、内容が類似しており、かつ発行日時が近い記事は、実質的に同じ内容である可能性が高い。従って、これらの記事については重複記事４０３として分類し、関連記事４０２からは除外する。 First, the related article classifying unit 122 extracts only those articles whose similarity is higher than a predetermined threshold among the subscribed articles 204 rearranged in order of higher similarity in step S103. Next, for each of the extracted subscription articles 204, it is determined whether or not the difference in issue date and time from the original article 401 is equal to or greater than a predetermined time interval, and the subscribed articles 204 that are equal to or greater than the predetermined time interval are set as related articles 402. Classify and classify subscription articles 204 that are smaller than a predetermined time interval as duplicate articles 403. As described in Patent Document 2 and the like, it is highly possible that articles having similar contents and similar publication dates are substantially the same. Therefore, these articles are classified as duplicate articles 403 and excluded from related articles 402.

なお、上記の所定の閾値や時間間隔は情報収集管理サーバ１００においてデータベース２００や定義ファイルなどの適当な手段により保持し、状況に応じて適宜設定を変更できるようにするのが望ましい。本実施の形態では、図６に示すように所定の閾値を０．２としている。上述したステップＳ１０３でのベクトル空間法を用いた類似度の算出方法による場合、所定の閾値を０．２とすることで実用上支障のない精度で関連記事４０２が抽出できることを実験的に確認している。 The predetermined threshold and time interval are preferably held in the information collection management server 100 by an appropriate means such as the database 200 or a definition file so that the setting can be changed as appropriate according to the situation. In the present embodiment, the predetermined threshold is set to 0.2 as shown in FIG. In the case of the similarity calculation method using the vector space method in step S103 described above, it was experimentally confirmed that the related article 402 can be extracted with a practically satisfactory accuracy by setting the predetermined threshold value to 0.2. ing.

［ユーザ嗜好を表す特徴語データ算出］
図７は、既読記事２０７からユーザ嗜好を表す特徴語データを算出する処理（ステップＳ１０５）の例を説明する図である。まず、対象のユーザの購読記事２０４と参照履歴２０５とに基づいて対象のユーザの既読記事２０７を取得し、全ての既読記事２０４について、上述した特徴語データ算出部１２１での処理によって特徴語データをそれぞれ算出する。 [Calculation of feature word data representing user preferences]
FIG. 7 is a diagram illustrating an example of processing (step S105) for calculating feature word data representing user preferences from the read article 207. First, read articles 207 of the target user are acquired based on the subscribed articles 204 and the reference history 205 of the target user, and all the read articles 204 are characterized by the processing in the feature word data calculation unit 121 described above. Each word data is calculated.

次に、既読記事２０７全体における、各特徴語の特徴語データ（ＴＦ−ＩＤＦ値）の平均値を算出する。ここで、各特徴語は、全ての既読記事２０７から抽出された特徴語をマージしたものとなる。このとき、ある特徴語についてそれが含まれていない既読記事２０７については、当該既読記事２０７における当該特徴語のＴＦ−ＩＤＦ値は０である。なお、本実施の形態では、単純に全ての既読記事２０７での各特徴語のＴＦ−ＩＤＦ値の合計値を、既読記事２０７の数で除算してＴＦ−ＩＤＦ値の平均値を求めているが、平均値の算出方法はこれに限らず、例えば所定の条件により加重平均を算出するようにしてもよい。 Next, the average value of the feature word data (TF-IDF value) of each feature word in the entire read article 207 is calculated. Here, each feature word is obtained by merging feature words extracted from all the read articles 207. At this time, for a read article 207 that does not include a certain feature word, the TF-IDF value of the feature word in the read article 207 is zero. In this embodiment, the average value of the TF-IDF values is simply obtained by dividing the total value of the TF-IDF values of the feature words in all the read articles 207 by the number of the read articles 207. However, the method of calculating the average value is not limited to this, and for example, a weighted average may be calculated according to a predetermined condition.

ここで算出されたＴＦ−ＩＤＦ値の平均値は、既読記事２０７全体の内容の特徴を表すデータ、すなわち対象のユーザの嗜好を表すものであり、これをユーザ嗜好を表す特徴語データとする。このように、ユーザ毎の参照履歴２０５を保持し、これを利用することにより、ユーザの嗜好・関心を表す情報を取得することができる。 The average value of the TF-IDF values calculated here is data representing the characteristics of the content of the entire read article 207, that is, the preference of the target user, and this is feature word data representing the user preference. . Thus, by storing the reference history 205 for each user and using it, information representing the user's preference / interest can be acquired.

［ユーザ嗜好と関連記事の類似度算出］
図８は、ステップＳ１０５で算出したユーザ嗜好を表す特徴語データと、ステップＳ１０４で分類した各関連記事４０２の特徴語データとの類似度を算出する処理（ステップＳ１０６）の例を説明する図である。類似度の算出方法は、図５に示したステップＳ１０３での類似度の算出処理での算出方法と同様である。 [Calculation of similarity between user preferences and related articles]
FIG. 8 is a diagram for explaining an example of processing (step S106) for calculating the similarity between the feature word data representing the user preference calculated in step S105 and the feature word data of each related article 402 classified in step S104. is there. The similarity calculation method is the same as the calculation method in the similarity calculation processing in step S103 shown in FIG.

すなわち、類似度算出部１２３により、ステップＳ１０５で算出したユーザ嗜好を表す特徴語データと、ステップＳ１０４で分類した各関連記事４０２の特徴語データ（ステップＳ１０２で算出した対象の各購読記事２０４の特徴語データ）とをそれぞれベクトル空間化する。さらに、ユーザ嗜好を表す特徴語データのベクトルと、各関連記事４０２の特徴語データのベクトルとの内積を用いることによって類似度を算出するベクトル空間法を利用する。 That is, the feature word data representing the user preference calculated in step S105 by the similarity calculation unit 123 and the feature word data of each related article 402 classified in step S104 (feature of each target subscribed article 204 calculated in step S102). Word data) and vector spaces. Further, a vector space method is used in which the similarity is calculated by using the inner product of the feature word data vector representing the user preference and the feature word data vector of each related article 402.

全ての関連記事４０２についてステップＳ１０３での算出方法と同様の手順でユーザ嗜好を表す特徴語データと関連記事４０２の特徴語データとの類似度を算出すると、算出された類似度が上位の順に関連記事４０２を並び替える。図８の例では、類似度が上位の順に「関連記事３」、「関連記事１」、「関連記事２」の順に並び替えられたことを示している。このように並び替えられた関連記事４０２をユーザに提示することで、ユーザに対してユーザの関心・嗜好に合った関連記事４０２を推奨することができる。 When the similarity between the feature word data representing the user preference and the feature word data of the related article 402 is calculated for all related articles 402 in the same procedure as the calculation method in step S103, the calculated similarities are related in descending order. The articles 402 are rearranged. In the example of FIG. 8, it is shown that the similarities are rearranged in the order of “related article 3”, “related article 1”, and “related article 2”. By presenting the related articles 402 thus rearranged to the user, the related articles 402 that match the user's interests and preferences can be recommended to the user.

なお、本実施の形態では、関連記事４０２を類似度が上位の順に並び替えて表示することで、ユーザの嗜好に合った関連記事４０２を優先的に提示して推奨しているが、優先的に提示する手段はこれに限らず、例えば文字色やフォントを変更したりして強調表示するなど、種々の方法をとることができる。また、本実施の形態では重複記事４０３を関連記事４０２から除外しているが、関連記事４０２をユーザに提示する際に、例えば、重複記事４０３を一まとめにしてタイトルのみ一覧表示したりするなどしてユーザが認識できるようにしてもよい。 In this embodiment, the related articles 402 are preferentially presented and recommended by rearranging and displaying the related articles 402 in order from the highest similarity. However, the presenting method is not limited to this, and various methods such as highlighting by changing the character color or font can be employed. In this embodiment, the duplicate article 403 is excluded from the related articles 402. However, when the related articles 402 are presented to the user, for example, the duplicate articles 403 are collectively displayed as a list only. Then, the user may be able to recognize.

以上に説明したように、本実施の形態の関連記事推奨方法によれば、ユーザ毎に保持している過去に参照した購読記事２０４の参照履歴２０５を利用することにより、ユーザが関心があり参照している元記事に対してユーザ毎の嗜好・関心に合った関連記事４０２を推奨することが可能となる。これにより、ユーザの嗜好に合った意外な関連記事４０２を発見する可能性を向上させ、ユーザの関心・興味の広がりを支援することを可能としている。 As described above, according to the related article recommendation method of the present embodiment, the user is interested and referred to by using the reference history 205 of the subscribed article 204 referred to in the past held for each user. It is possible to recommend a related article 402 that matches the preference and interest of each user with respect to the original article that is being processed. Accordingly, it is possible to improve the possibility of finding an unexpected related article 402 that matches the user's preference, and to support the spread of the user's interest and interest.

また、関連記事４０２を提示する際に、実質的に内容が重複する重複記事４０３を関連記事４０２から除外することにより、ユーザが情報利用の活動を効率的に行うことを可能としている。さらに、継続した使用によってユーザ毎の既読記事２０７（参照履歴２０５）が多く蓄積されるほど、ユーザの嗜好に合った関連記事４０２の推奨の精度が向上し、より効果的にユーザが情報利用の活動を行うことが可能となる。 In addition, when the related article 402 is presented, the user can efficiently perform information utilization activities by excluding the duplicate articles 403 whose contents substantially overlap from the related articles 402. Furthermore, as the number of already read articles 207 (reference history 205) for each user is accumulated by continued use, the accuracy of the recommendation of the related article 402 that matches the user's preference improves, and the user uses information more effectively. It becomes possible to carry out activities.

以上、本発明者によってなされた発明を実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。 As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.

本発明は、特定の記事の内容に関連する内容を有する他の記事を自動で推奨する関連記事推奨方法および関連記事推奨プログラムに利用可能である。 The present invention can be used for a related article recommendation method and a related article recommendation program that automatically recommend other articles having contents related to the contents of a specific article.

１００…情報収集管理サーバ、１１０…新着記事収集部、１２０…関連記事抽出部、１２１…特徴語データ算出部、１２２…関連記事分類部、１２３…類似度算出部、
２００…データベース、２０１…記事群、２０２…記事、２０３…ユーザ購読情報、２０４…購読記事、２０５…参照履歴、２０６…購読指定、２０７…既読記事、２１０…ユーザ情報、２２０…購読情報、２３０…フィード一覧、２４０…リアクション情報、２５０…記事データ、
３１０…Ｗｅｂサーバ、３２０…文書サーバ、
４００…クライアント端末、４０１…元記事、４０２…関連記事、４０３…重複記事。 DESCRIPTION OF SYMBOLS 100 ... Information collection management server, 110 ... New article collection part, 120 ... Related article extraction part, 121 ... Feature word data calculation part, 122 ... Related article classification | category part, 123 ... Similarity degree calculation part,
200 ... Database, 201 ... Article group, 202 ... Article, 203 ... User subscription information, 204 ... Subscription article, 205 ... Reference history, 206 ... Subscription designation, 207 ... Read article, 210 ... User information, 220 ... Subscription information, 230 ... feed list, 240 ... reaction information, 250 ... article data,
310 ... Web server, 320 ... Document server,
400: Client terminal, 401: Original article, 402: Related article, 403: Duplicate article.

Claims

A related article recommendation method for recommending to a user a related article whose contents are related to an original article consisting of text data that is referred to by a user by a computer system,
The computer system holds, for each user, a plurality of subscribed articles that are referred to by the user, and a reference history for each subscribed article,
One or more words are extracted as feature words from the original article based on a predetermined extraction condition, and the importance level of the feature word in the original article is indicated based on a predetermined calculation condition for each extracted feature word A first step of calculating a weight value to be feature word data of the original article;
One or more words are extracted as feature words from each subscribed article based on the predetermined extraction condition, and for each extracted feature word, the feature word in each subscribed article is extracted based on the predetermined calculation condition. A second step of calculating the weighting value to be feature word data of each subscribed article;
A third step of calculating a similarity between the feature word data of the original article calculated in the first step and the feature word data of each subscribed article calculated in the second step based on a predetermined comparison condition; ,
The subscribed articles in which the similarity calculated in the third step is higher than a predetermined threshold and the difference in issue date and time from the original article is equal to or greater than a predetermined time interval are classified as the related articles, and in the second step A fourth step in which the feature word data of each subscribed article to be calculated is the feature word data of each related article;
The user's read articles are acquired based on the respective subscribed articles and the reference history, and one or more words are extracted as feature words from all the read articles based on the predetermined extraction condition, and extracted. For each of the feature words, the weight value of the feature word in each read article is calculated based on the predetermined calculation condition, and the average value of all the read articles is calculated to obtain the user's preference. A fifth step for representing feature word data;
Based on the predetermined comparison condition, a similarity level between the feature word data representing the user's preference calculated in the fifth step and the feature word data of each related article classified in the fourth step is calculated. 6 steps,
And a seventh step of preferentially presenting the related article having a higher similarity calculated in the sixth step to the user.

In the related article recommendation method according to claim 1,
In the fourth step, the subscribed article, in which the similarity calculated in the third step is higher than a predetermined threshold and a difference in issue date and time from the original article is smaller than a predetermined time interval, is added to the original article. Categorized as a duplicate article with substantially overlapping content,
In the seventh step, the related article recommendation method, wherein the duplicate article is presented to the user so that the user can recognize the duplicate article.

In the related article recommendation method according to claim 1 or 2,
The weighting value is a TF-IDF value calculated from the TF value calculated for the feature word for the article including the feature word and the IDF value calculated for the subscribed articles of all the users. Related article recommendation method characterized by being.

In the related article recommendation method according to any one of claims 1 to 3,
The predetermined comparison condition in the third step and the sixth step is that the feature word data for which the similarity is to be calculated is converted into a vector space, and the similarity is calculated based on an angle formed by both vectors. Related article recommendation method characterized by being.

A related article recommendation program for causing a computer system to function to present a related article whose contents are related to an original article consisting of text data referred to by a user and recommend the user,
The computer system holds, for each user, a plurality of subscribed articles that are referred to by the user, and a reference history for each subscribed article,
The related article recommendation program extracts one or more words as feature words from the original article based on a predetermined extraction condition, and for each extracted feature word, the original article in the original article based on a predetermined calculation condition An eighth step of calculating a weighting value indicating the importance of the feature word and using it as the feature word data of the original article;
One or more words are extracted as feature words from each subscribed article based on the predetermined extraction condition, and for each extracted feature word, the feature word in each subscribed article is extracted based on the predetermined calculation condition. A ninth step of calculating the weighting value to be feature word data of each subscribed article;
A tenth step of calculating a similarity between the feature word data of the original article calculated in the eighth step and the feature word data of each subscribed article calculated in the ninth step based on a predetermined comparison condition; ,
Classifying the subscribed articles in which the similarity calculated in the tenth step is higher than a predetermined threshold and the difference between the issue date and the original article is a predetermined time interval or more as the related articles, and in the ninth step An eleventh step in which the feature word data of each subscribed article to be calculated is set as the feature word data of each related article;
The user's read articles are acquired based on the respective subscribed articles and the reference history, and one or more words are extracted as feature words from all the read articles based on the predetermined extraction condition, and extracted. For each of the feature words, the weight value of the feature word in each read article is calculated based on the predetermined calculation condition, and the average value of all the read articles is calculated to obtain the user's preference. A twelfth step for representing feature word data;
Based on the predetermined comparison condition, a similarity level between the feature word data representing the user's preference calculated in the twelfth step and the feature word data of each related article classified in the eleventh step is calculated. 13 steps,
And a 14th step of preferentially presenting the related article having a higher similarity calculated in the 13th step to the user.

In the related article recommendation program according to claim 5,
In the eleventh step, the subscribed article, in which the similarity calculated in the tenth step is higher than a predetermined threshold and a difference in issue date and time from the original article is smaller than a predetermined time interval, is added to the original article. Categorized as a duplicate article with substantially overlapping content,
In the fourteenth step, the related article recommendation program, wherein the duplicate article is presented to the user so that the user can recognize the duplicate article.

In the related article recommendation program according to claim 5 or 6,
The weighting value is a TF-IDF value calculated from the TF value calculated for the feature word for the article including the feature word and the IDF value calculated for the subscribed articles of all the users. Related article recommendation program characterized by being.

In the related article recommendation program according to any one of claims 5 to 7,
The predetermined comparison condition in the tenth step and the thirteenth step is to calculate the similarity based on an angle formed by the vector of the feature word data for which the similarity is to be calculated, respectively. Related article recommendation program characterized by being.