JP7042720B2

JP7042720B2 - Information processing equipment, information processing methods, and programs

Info

Publication number: JP7042720B2
Application number: JP2018169495A
Authority: JP
Inventors: 俊平大倉
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-09-11
Filing date: 2018-09-11
Publication date: 2022-03-28
Anticipated expiration: 2038-09-11
Also published as: JP2020042545A

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

インターネットなどを介してニュース記事のようなコンテンツがユーザに提供される場合、そのコンテンツに関連したコンテンツも併せて提供される場合がある。これに関連し、コンテンツの重要度を推定し、推定した重要度が高いコンテンツを優先的に表示する技術が知られている（例えば、特許文献１参照）。 When content such as a news article is provided to a user via the Internet or the like, content related to the content may also be provided. In this regard, there is known a technique for estimating the importance of content and preferentially displaying the estimated content with high importance (see, for example, Patent Document 1).

特開２０１７－５９０５７号公報Japanese Unexamined Patent Publication No. 2017-59057

しかしながら、従来の技術では、ユーザが閲覧した、あるいはこれから閲覧する可能性がある対象のコンテンツに対して、時系列に関連する他のコンテンツを、対象コンテンツとともにユーザに提供することができない場合があった。 However, in the conventional technology, it may not be possible to provide the user with other time-series-related content together with the target content for the target content that the user has browsed or may browse in the future. rice field.

本発明は、上記の課題に鑑みてなされたものであり、対象のコンテンツと時系列に関連するコンテンツをユーザに提供することができる情報処理装置、情報処理方法、およびプログラムを提供することを目的としている。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an information processing device, an information processing method, and a program capable of providing a user with content related to a target content and a time series. It is supposed to be.

本発明の一態様は、複数のコンテンツのうち、ある着目コンテンツとの類似度が大きい上位所定数のコンテンツからキーワードを抽出する抽出部と、前記抽出部により抽出された前記キーワードに基づいて、前記複数のコンテンツの中から、前記着目コンテンツと時系列に関連する一以上の関連コンテンツを選択する選択部と、を備える情報処理装である。 One aspect of the present invention is based on an extraction unit that extracts a keyword from a predetermined number of contents having a high degree of similarity to a certain content of interest among a plurality of contents, and the keyword extracted by the extraction unit. It is an information processing device including a selection unit for selecting the content of interest and one or more related contents related to a time series from a plurality of contents.

本発明の一態様によれば、対象のコンテンツと時系列に関連するコンテンツをユーザに提供することができる。 According to one aspect of the present invention, the target content and the content related to the time series can be provided to the user.

実施形態の情報処理装置１００を含む情報処理システム１の一例を示す図である。It is a figure which shows an example of the information processing system 1 including the information processing apparatus 100 of an embodiment. 第１端末装置１０の画面に表示されたコンテンツの一例を示す図である。It is a figure which shows an example of the content displayed on the screen of the 1st terminal apparatus 10. 第１端末装置１０の画面に表示されたコンテンツの一例を示す図である。It is a figure which shows an example of the content displayed on the screen of the 1st terminal apparatus 10. 実施形態における情報処理装置１００の構成の一例を示す図である。It is a figure which shows an example of the structure of the information processing apparatus 100 in an embodiment. コンテンツデータ１３２の一例を示す図である。It is a figure which shows an example of content data 132. ユーザログデータ１３４の一例を示す図である。It is a figure which shows an example of the user log data 134. 実施形態における制御部１１０の一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processing of the control unit 110 in an embodiment. 関連コンテンツの選択処理を説明するための図である。It is a figure for demonstrating the selection process of the related content. 関連コンテンツのグループ化の方法を説明するための図である。It is a figure for demonstrating the method of grouping of related contents. 関連コンテンツのグループ化の方法を説明するための図である。It is a figure for demonstrating the method of grouping of related contents. 関連コンテンツのグループ化の方法を説明するための図である。It is a figure for demonstrating the method of grouping of related contents. 関連コンテンツのグループ化の方法を説明するための図である。It is a figure for demonstrating the method of grouping of related contents. 関連コンテンツのグループ化の方法を説明するための図である。It is a figure for demonstrating the method of grouping of related contents. 関連コンテンツのグループ化の方法を説明するための図である。It is a figure for demonstrating the method of grouping of related contents. 関連コンテンツのグループ化の方法を説明するための図である。It is a figure for demonstrating the method of grouping of related contents. 代表関連コンテンツの選択方法を説明するための図である。It is a figure for demonstrating the selection method of representative-related contents. 実施形態の情報処理装置１００のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the information processing apparatus 100 of an embodiment.

以下、本発明を適用した情報処理装置、情報処理方法、およびプログラムを、図面を参照して説明する。 Hereinafter, an information processing apparatus, an information processing method, and a program to which the present invention is applied will be described with reference to the drawings.

［概要］
情報処理装置は、一以上のプロセッサにより実現される。情報処理装置は、複数のコンテンツのうち、着目するある一つのコンテンツ（以下、着目コンテンツと称する）との類似度が大きい上位所定数のコンテンツからキーワードを抽出し、抽出したキーワードに基づいて、複数のコンテンツの中から、着目コンテンツと時系列に関連する一以上の関連コンテンツを選択する。これによって、対象のコンテンツと時系列に関連するコンテンツをユーザに提供することができる。 [Overview]
The information processing device is realized by one or more processors. The information processing device extracts keywords from a predetermined number of high-ranking contents having a high degree of similarity to one content of interest (hereinafter referred to as the content of interest) among a plurality of contents, and a plurality of keywords are extracted based on the extracted keywords. Select one or more related contents related to the content of interest and the time series from the contents of. This makes it possible to provide the user with the target content and the content related to the time series.

本実施形態におけるコンテンツは、例えば、ブログやウェブサイトなどに掲載される記事であり、テキストを含むコンテンツである。この記事は、例えば、ニュースや政治、経済、スポーツといった時々刻々と変化する社会的な出来事（時事）をテーマとして扱う記事であってよい。このような記事は、ある出来事に関して第一報となる記事が存在し、その後、時間の経過に応じて出来事が変化した場合、その変化を伝える記事が続報として提供され得る。以下の説明では、一例として、コンテンツがニュース記事のような文書であるものとして説明する。 The content in the present embodiment is, for example, an article posted on a blog, a website, or the like, and is content including text. This article may be an article that deals with the theme of social events (current affairs) that change from moment to moment, such as news, politics, economy, and sports. In such an article, if there is an article that is the first report on an event and then the event changes over time, an article that conveys the change may be provided as a follow-up report. In the following description, as an example, the content is assumed to be a document such as a news article.

［全体構成］
図１は、実施形態の情報処理装置１００を含む情報処理システム１の一例を示す図である。実施形態における情報処理システム１は、例えば、一つ以上の第１端末装置１０と、一つ以上の第２端末装置２０と、情報処理装置１００とを備える。これらの装置は、例えば、ネットワークＮＷを介して互いに接続される。 [overall structure]
FIG. 1 is a diagram showing an example of an information processing system 1 including the information processing apparatus 100 of the embodiment. The information processing system 1 in the embodiment includes, for example, one or more first terminal devices 10, one or more second terminal devices 20, and an information processing device 100. These devices are connected to each other, for example, via a network NW.

図１に示す各装置は、ネットワークＮＷを介して種々の情報を送受信する。ネットワークＮＷは、例えば、インターネット、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、プロバイダ端末、無線通信網、無線基地局、専用回線などを含む。なお、図１に示す各装置の全ての組み合わせが相互に通信可能である必要はなく、ネットワークＮＷは、一部にローカルなネットワークを含んでもよい。 Each device shown in FIG. 1 transmits and receives various information via the network NW. The network NW includes, for example, the Internet, a WAN (Wide Area Network), a LAN (Local Area Network), a provider terminal, a wireless communication network, a wireless base station, a dedicated line, and the like. It should be noted that not all combinations of the devices shown in FIG. 1 need to be able to communicate with each other, and the network NW may include a local network in part.

第１端末装置１０は、例えば、スマートフォンなどの携帯電話、タブレット端末、各種パーソナルコンピュータなどの、入力装置、表示装置、通信装置、記憶装置、および演算装置を備える端末装置である。通信装置は、ＮＩＣ（Network Interface Card）などのネットワークカード、無線通信モジュールなどを含む。第１端末装置１０では、ウェブブラウザやアプリケーションプログラムなどのＵＡ（User Agent）が起動し、ユーザの入力内容に応じたリクエストを情報処理装置１００に送信する。また、ＵＡが起動された第１端末装置１０は、情報処理装置１００から取得した情報に基づいて、表示装置に各種画像を表示させる。 The first terminal device 10 is a terminal device including an input device, a display device, a communication device, a storage device, and an arithmetic unit, such as a mobile phone such as a smartphone, a tablet terminal, and various personal computers. The communication device includes a network card such as a NIC (Network Interface Card), a wireless communication module, and the like. In the first terminal device 10, a UA (User Agent) such as a web browser or an application program is activated, and a request according to a user's input content is transmitted to the information processing device 100. Further, the first terminal device 10 in which the UA is activated causes the display device to display various images based on the information acquired from the information processing device 100.

第２端末装置２０は、例えば、スマートフォンなどの携帯電話、タブレット端末、各種パーソナルコンピュータなどの、入力装置、表示装置、通信装置、記憶装置、および演算装置を備える端末装置である。通信装置は、ＮＩＣなどのネットワークカード、無線通信モジュールなどを含む。例えば、マスメディアなどの企業や事業者の従業員、あるいはジャーナリストやブロガーなどの個人が第２端末装置２０を操作することで、情報処理装置１００に、ニュース記事などのコンテンツをアップロードする。 The second terminal device 20 is a terminal device including an input device, a display device, a communication device, a storage device, and an arithmetic unit, such as a mobile phone such as a smartphone, a tablet terminal, and various personal computers. The communication device includes a network card such as a NIC, a wireless communication module, and the like. For example, an employee of a company or business such as mass media, or an individual such as a journalist or a blogger operates the second terminal device 20 to upload contents such as news articles to the information processing device 100.

情報処理装置１００は、例えば、ウェブブラウザからのリクエスト（例えばＨＴＴＰ（Hypertext Transfer Protocol）リクエスト）に応じてウェブページを第１端末装置１０に提供するウェブサーバであってよい。ウェブページには、例えば、ニュース記事などのコンテンツが含まれる。また、情報処理装置１００は、アプリケーションプログラムからのリクエストに応じてコンテンツを第１端末装置１０に提供するアプリケーションサーバであってもよい。 The information processing device 100 may be, for example, a web server that provides a web page to the first terminal device 10 in response to a request from a web browser (for example, an HTTP (Hypertext Transfer Protocol) request). Web pages include content such as news articles. Further, the information processing device 100 may be an application server that provides contents to the first terminal device 10 in response to a request from an application program.

図２および図３は、第１端末装置１０の画面に表示されたコンテンツの一例を示す図である。図２の例では、コンテンツが一覧形式で掲載されたウェブページを表している。このようなウェブページには、各コンテンツに含まれる代表的な画像や、各コンテンツのスニペット（要約）、各コンテンツの詳細な情報が掲載された他のウェブページへのハイパーリンクＬＫなどが表示される。例えば、図２に例示するウェブページにおいて、最上段のコンテンツＣＴ_１のハイパーリンクＬＫ_１が選択された場合、図３に例示するウェブページへと画面が遷移する。このウェブページには、例えば、ユーザが選択したコンテンツＣＴ_１とともに、そのコンテンツＣＴ_１と時系列に関連した他のコンテンツの画像やスニペット、ハイパーリンクＬＫなどが表示される。これによって、例えば、ユーザが、あるコンテンツＣＴ_１を閲覧した場合、そのコンテンツＣＴ_１が扱う話題を、閲覧時点よりも前の過去の時点または閲覧時点よりも後の将来の時点で扱った他のコンテンツＣＴ_Ｘ、ＣＴ_Ｙ、ＣＴ_Ｚを、そのユーザに提供することができる。このようなコンテンツの提供方法の詳細については以下に説明する。 2 and 3 are diagrams showing an example of the content displayed on the screen of the first terminal device 10. The example of FIG. 2 represents a web page in which the content is posted in a list format. Such web pages display representative images contained in each content, snippets (summary) of each content, hyperlinks LK to other web pages containing detailed information of each content, and the like. To. For example, in the web page illustrated in FIG. 2, when the hyperlink LK ₁ of the content CT ₁ in the uppermost row is selected, the screen transitions to the web page illustrated in FIG. On this web page, for example, along with the content CT ₁ selected by the user, images, snippets, hyperlinks LK, and the like of the content CT ₁ and other contents related to the time series are displayed. Thereby, for example, when the user browses a certain content CT ₁ , the topic handled by the content CT ₁ is dealt with at a past time point before the browsing time or at a future time point after the browsing time. Content CT _X , CT _Y , CT _Z can be provided to the user. The details of the method of providing such contents will be described below.

［情報処理装置の構成］
図４は、実施形態における情報処理装置１００の構成の一例を示す図である。図示のように、情報処理装置１００は、例えば、通信部１０２と、制御部１１０と、記憶部１３０とを備える。 [Information processing device configuration]
FIG. 4 is a diagram showing an example of the configuration of the information processing apparatus 100 in the embodiment. As shown in the figure, the information processing apparatus 100 includes, for example, a communication unit 102, a control unit 110, and a storage unit 130.

通信部１０２は、例えば、ＮＩＣ等の通信インターフェースを含む。通信部１０２は、ネットワークＮＷを介して、第１端末装置１０や第２端末装置２０などと通信する。例えば、通信部１０２は、第１端末装置１０と通信し、ＨＴＴＰリクエストなどを受信してよい。また、例えば、通信部１０２は、第２端末装置２０と通信し、第２端末装置２０からコンテンツを受信してもよい。通信部１０２は、コンテンツを受信すると、受信したコンテンツを後述するコンテンツデータ１３２として記憶部１３０に記憶させる。 The communication unit 102 includes, for example, a communication interface such as a NIC. The communication unit 102 communicates with the first terminal device 10, the second terminal device 20, and the like via the network NW. For example, the communication unit 102 may communicate with the first terminal device 10 and receive an HTTP request or the like. Further, for example, the communication unit 102 may communicate with the second terminal device 20 and receive the content from the second terminal device 20. When the communication unit 102 receives the content, the communication unit 102 stores the received content in the storage unit 130 as the content data 132 described later.

制御部１１０は、例えば、前処理部１１２と、抽出部１１４と、選択部１１６と、分類部１１８と、提供部１２０とを備える。これらの構成要素は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）等のプロセッサ（あるいはプロセッサ回路）が、記憶部１３０に記憶されたプログラム（ソフトウェア）を実行することにより実現される。また、制御部１１０の構成要素のうち一部または全部は、例えば、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）等のハードウェア（回路部：circuitry）によって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。また、プロセッサにより参照されるプログラムは、予め記憶部１３０に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体に格納されており、記憶媒体が情報処理装置１００のドライブ装置に装着されることで記憶媒体から記憶部１３０にインストールされてもよい。 The control unit 110 includes, for example, a preprocessing unit 112, an extraction unit 114, a selection unit 116, a classification unit 118, and a provision unit 120. These components are realized, for example, by a processor (or processor circuit) such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) executing a program (software) stored in the storage unit 130. .. Further, some or all of the components of the control unit 110 are hardware (circuit unit: circuitry) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), and the like. ), Or it may be realized by the cooperation of software and hardware. Further, the program referenced by the processor may be stored in the storage unit 130 in advance, or is stored in a removable storage medium such as a DVD or a CD-ROM, and the storage medium is the drive of the information processing apparatus 100. It may be installed in the storage unit 130 from the storage medium by being attached to the device.

記憶部１３０は、例えば、ＨＤＤ（Hard Disc Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）などの記憶装置により実現される。記憶部１３０には、ファームウェアやアプリケーションプログラムなどの各種プログラムのほかに、コンテンツデータ１３２やユーザログデータ１３４、キーワード辞書１３６などが格納される。 The storage unit 130 is realized by, for example, a storage device such as an HDD (Hard Disc Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a ROM (Read Only Memory), and a RAM (Random Access Memory). In addition to various programs such as firmware and application programs, the storage unit 130 stores content data 132, user log data 134, keyword dictionary 136, and the like.

コンテンツデータ１３２は、例えば、複数の記事（例えば数万件から数十万件の記事）がコンテンツとして含まれるデータである。図５は、コンテンツデータ１３２の一例を示す図である。コンテンツがニュース記事のような文書である場合、図示の例のように、コンテンツデータ１３２は、各記事を識別する記事ＩＤに対して、その記事の入稿時刻や、その記事がベクトル化された記事ベクトル（コンテンツベクトル）などが対応付けられたデータであってよい。 The content data 132 is, for example, data including a plurality of articles (for example, tens of thousands to hundreds of thousands of articles) as contents. FIG. 5 is a diagram showing an example of content data 132. When the content is a document such as a news article, the content data 132 is vectorized with the submission time of the article and the article for the article ID that identifies each article, as shown in the illustrated example. The data may be associated with an article vector (content vector) or the like.

入稿時刻とは、例えば、第２端末装置２０から情報処理装置１００へと記事が送信された時刻であってよいし、情報処理装置１００によって記事が受信された時刻であってもよい。 The submission time may be, for example, the time when the article is transmitted from the second terminal device 20 to the information processing device 100, or the time when the article is received by the information processing device 100.

記事ベクトルは、コンテンツが記事である場合のコンテンツベクトルの一つであり、例えば、ｗｏｒｄ２ｖｅｃやＧｌｏＶｅのような分散表現と呼ばれる手法を用いることで、記事から生成されるベクトルである。分散表現は、単語または語句と、その単語または語句の前後で出現する単語または語句との共起性に基づいて、単語や語句をベクトル化する手法であり、例えば、予め用意された複数の単語や語句を含むコーパスに基づいて、ある文脈において着目する一つの単語の前後に出現する単語の出現確率を求め、その出現確率を要素値とする多次元のベクトルを生成する手法である。具体的には、記事ベクトルをＶとした場合、Ｖ＝［ｅ１，ｅ２，ｅ３，…］といったように表現することができる。記事から記事ベクトルを生成する処理は、前処理部１１２によって行われてもよいし、情報処理装置１００以外に他の装置によって行われてもよい。 The article vector is one of the content vectors when the content is an article, and is a vector generated from the article by using a method called distributed representation such as word2vec or GloVe. Distributed representation is a method of vectorizing a word or phrase based on the co-occurrence of the word or phrase and the word or phrase that appears before or after the word or phrase, for example, a plurality of prepared words. Based on a corpus that includes words and phrases, it is a method to obtain the appearance probability of words that appear before and after one word of interest in a certain context, and to generate a multidimensional vector with the appearance probability as an element value. Specifically, when the article vector is V, it can be expressed as V = [e1, e2, e3, ...]. The process of generating the article vector from the article may be performed by the preprocessing unit 112, or may be performed by another device other than the information processing device 100.

ユーザログデータ１３４は、複数のユーザの行動履歴を含むデータである。図６は、ユーザログデータ１３４の一例を示す図である。図示の例のように、ユーザログデータ１３４は、各ユーザを識別するユーザＩＤに対して、そのユーザが閲覧した記事を掲載するウェブページのＵＲＬ（Uniform Resource Locator）や、その記事のタイトルなどが行動履歴として対応付けられたデータである。 The user log data 134 is data including the action history of a plurality of users. FIG. 6 is a diagram showing an example of the user log data 134. As shown in the illustrated example, in the user log data 134, the URL (Uniform Resource Locator) of the web page on which the article viewed by the user is posted, the title of the article, and the like are used for the user ID that identifies each user. It is the data associated with the action history.

キーワード辞書１３６は、形態素解析やチャンキング（Chunking）処理などを用いて、コンテンツから特徴語を抽出する処理（以下、特徴語抽出処理と称する）が行われる際に利用される辞書である。形態素解析は、文書を形態素に分割して解析する手法である。チャンキング処理は、言語処理においては形態素解析などで分割された語を必要に応じて意味的なまとまりに繋ぎ直す処理であり、具体的には、単語単位のものをまとめて文節単位にしたり、複数単語で固有のフレーズとなるものを繋いだりする処理である。キーワード辞書１３６には、特徴語を表す文字列が登録されており、その文字列は、例えば、組織名、人名、地名、固有物名といった名詞であってよい。特徴語（名詞）は、一つの単語であってもよいし、複数の単語からなるフレーズであってもよい。 The keyword dictionary 136 is a dictionary used when a process of extracting a feature word from the content (hereinafter referred to as a feature word extraction process) is performed by using a morphological analysis, a chunking process, or the like. Morphological analysis is a method of dividing a document into morphemes and analyzing them. Chunking processing is a processing in which words divided by morphological analysis etc. are reconnected into a semantic group as necessary in language processing. It is a process of connecting multiple words that are unique phrases. A character string representing a characteristic word is registered in the keyword dictionary 136, and the character string may be a noun such as an organization name, a person name, a place name, or a unique object name. The characteristic word (noun) may be one word or a phrase consisting of a plurality of words.

［処理フロー］
以下、実施形態における制御部１１０の一連の処理の流れをフローチャートに即して説明する。図７は、実施形態における制御部１１０の一連の処理の流れを示すフローチャートである。本フローチャートの処理は、例えば、所定の周期で繰り返し行われてもよい。 [Processing flow]
Hereinafter, the flow of a series of processes of the control unit 110 in the embodiment will be described according to a flowchart. FIG. 7 is a flowchart showing a flow of a series of processes of the control unit 110 in the embodiment. The processing of this flowchart may be repeated, for example, at a predetermined cycle.

まず、前処理部１１２は、コンテンツデータ１３２に含まれる複数のコンテンツの中から、着目コンテンツを決定する（Ｓ１００）。例えば、前処理部１１２は、コンテンツデータ１３２に含まれる複数の記事の中で、最も入稿時刻が遅い記事（直近に入稿された新着記事）を、着目コンテンツに決定してよい。また、前処理部１１２は、通信部１０２によって第１端末装置１０からＨＴＴＰリクエストなどが受信された場合、ユーザログデータ１３４を参照して、コンテンツデータ１３２に含まれる複数のコンテンツの中から、リクエストの送信元である第１端末装置１０を利用するユーザが過去に閲覧したコンテンツ（例えばユーザが最後に閲覧したコンテンツ）を特定し、その特定したコンテンツを着目コンテンツに決定してもよい。 First, the preprocessing unit 112 determines the content of interest from the plurality of contents included in the content data 132 (S100). For example, the preprocessing unit 112 may determine the article having the latest submission time (new article submitted most recently) among the plurality of articles included in the content data 132 as the content of interest. Further, when the preprocessing unit 112 receives an HTTP request or the like from the first terminal device 10 by the communication unit 102, the preprocessing unit 112 refers to the user log data 134 and makes a request from among a plurality of contents included in the content data 132. The content viewed in the past by the user who uses the first terminal device 10 which is the transmission source of the above (for example, the content which the user last viewed) may be specified, and the specified content may be determined as the content of interest.

次に、前処理部１１２は、決定した着目コンテンツと、コンテンツデータ１３２に含まれる複数のコンテンツの其々との類似度を導出する（Ｓ１０２）。例えば、前処理部１１２は、着目コンテンツに対応したコンテンツベクトルと、複数のコンテンツの其々に対応した各コンテンツベクトルとのコサイン類似度を、コンテンツ同士の類似度として導出する。 Next, the preprocessing unit 112 derives the similarity between the determined content of interest and each of the plurality of contents included in the content data 132 (S102). For example, the preprocessing unit 112 derives the cosine similarity between the content vector corresponding to the content of interest and each content vector corresponding to each of the plurality of contents as the similarity between the contents.

次に、抽出部１１４は、前処理部１１２によって着目コンテンツとの類似度が導出された複数のコンテンツの中から、着目コンテンツとの類似度が大きい上位所定数Ｎのコンテンツを選択し、選択したＮ個のコンテンツの集合からキーワードを抽出する（Ｓ１０４）。例えば、抽出部１１４は、着目コンテンツと、Ｎ個のコンテンツのそれぞれとに対して、特徴語抽出処理を行い、キーワード辞書１３６に登録された文字列（名詞）を第１キーワードとして抽出する。さらに、抽出部１１４は、着目コンテンツと、Ｎ個のコンテンツのそれぞれとに対して固有表現抽出処理を行い、組織名、人名、地名、日付表現、時間表現、金額表現、割合表現、固有物名といった予め決められた固有表現のクラスに分類される文字列を第２キーワードとして抽出する。 Next, the extraction unit 114 selects and selects the content of the upper predetermined number N having a high degree of similarity with the content of interest from the plurality of contents whose similarity with the content of interest is derived by the preprocessing unit 112. Keywords are extracted from a set of N contents (S104). For example, the extraction unit 114 performs a feature word extraction process on each of the content of interest and each of the N contents, and extracts the character string (noun) registered in the keyword dictionary 136 as the first keyword. Further, the extraction unit 114 performs a unique expression extraction process for each of the content of interest and each of the N contents, and the organization name, the person name, the place name, the date expression, the time expression, the amount expression, the ratio expression, and the unique object name. A character string classified into a predetermined named entity class such as is extracted as a second keyword.

次に、選択部１１６は、抽出部１１４によって抽出されたキーワード（第１キーワードおよび第２キーワード）を条件にして、コンテンツデータ１３２に含まれる複数のコンテンツの中から、着目コンテンツと時系列に関連するコンテンツ（以下、関連コンテンツと称する）を選択する（Ｓ１０６）。関連コンテンツとは、例えば、着目コンテンツが扱う出来事と同じ出来事を扱ったコンテンツであり、着目コンテンツが、ある第１時刻における出来事を扱っていれば、第１時刻よりも前、あるいは後の第２時刻における出来事を扱うコンテンツである。具体的には、着目コンテンツが、２０２０年８月１日に発生した「台風３号」に関するニュース記事である場合、２０２０年８月２日以降の時点における「台風３号」に関するニュース記事が、関連コンテンツとして選択される。 Next, the selection unit 116 is related to the content of interest and the time series from among the plurality of contents included in the content data 132, subject to the keywords (first keyword and second keyword) extracted by the extraction unit 114. Select the content to be used (hereinafter referred to as related content) (S106). The related content is, for example, content that deals with the same event as the event of interest, and if the content of interest deals with an event at a certain first time, the second content is before or after the first time. Content that deals with events at time. Specifically, if the content of interest is a news article about "Typhoon No. 3" that occurred on August 1, 2020, the news article about "Typhoon No. 3" as of August 2, 2020 or later will be Selected as related content.

例えば、選択部１１６は、抽出部１１４が特徴語抽出処理を行うことで着目コンテンツとＮ個のコンテンツとの双方から抽出した複数の第１キーワードのうち、確度が閾値以上の一以上の第１キーワードを所定割合以上含むコンテンツを、複数のコンテンツの中から暫定的な関連コンテンツとして選択する。確度とは、第１キーワードが着目コンテンツの出来事を端的に表すキーワードであるということの確からしさの程度を表す指標値であり、例えば、着目コンテンツとＮ個のコンテンツとの中で出現するキーワードの頻度（出現回数）によって表されてよい。この場合、確度は、第１キーワードが着目コンテンツとＮ個のコンテンツとの中でより多く出現するほど大きくなる。 For example, the selection unit 116 is the first one or more of the plurality of first keywords extracted from both the content of interest and the N contents by the extraction unit 114 performing the feature word extraction process, and the accuracy is equal to or higher than the threshold value. Content containing a predetermined percentage or more of keywords is selected as provisional related content from a plurality of contents. The accuracy is an index value indicating the degree of certainty that the first keyword is a keyword that simply represents the event of the content of interest, and is, for example, a keyword that appears in the content of interest and N contents. It may be expressed by frequency (number of appearances). In this case, the accuracy increases as the first keyword appears more in the content of interest and the N contents.

選択部１１６は、複数のコンテンツの中から、確度が閾値以上の第１キーワードを含むコンテンツを一以上の暫定的な関連コンテンツとして選択すると、一以上の暫定的な関連コンテンツの中から、抽出部１１４が固有表現抽出処理を行うことで着目コンテンツから抽出した第２キーワードを含むコンテンツを関連コンテンツとして選択する。 When the selection unit 116 selects the content including the first keyword whose accuracy is equal to or higher than the threshold value as one or more provisional related contents from the plurality of contents, the selection unit 116 extracts from one or more provisional related contents. The content including the second keyword extracted from the content of interest is selected as the related content by the 114 performing the named entity extraction process.

図８は、関連コンテンツの選択処理を説明するための図である。図中ＣＴ_１は、着目コンテンツを表しており、ＣＴ_２からＣＴ_４のそれぞれは、関連コンテンツの選択対象とするコンテンツ（コンテンツデータ１３２に含まれる複数のコンテンツ）を表している。例えば、着目コンテンツＣＴ_１からは、確度が閾値以上の第１キーワードとして、「東京」、「花火」、「港区」、「祭り」、「○○公園」という文字列が抽出され、第２キーワードとして、「地名」というクラスに分類される「東京都」という文字列と、「固有物名」というクラスに分類される「花火大会」という文字列が抽出されている。コンテンツＣＴ_２からは、第１キーワードとして、「東京」、「花火」、「港区」、「祭り」、「○○公園」という文字列が抽出され、第２キーワードとして、「地名」というクラスに分類される「東京都」という文字列と、「固有物名」というクラスに分類される「花火大会」という文字列が抽出されている。コンテンツＣＴ_３からは、第１キーワードとして、「東京」、「花火」、「新宿区」、「○○球場」、「野球観戦」という文字列が抽出され、第２キーワードとして、「地名」というクラスに分類される「東京都」という文字列と、「固有物名」というクラスに分類される「花火大会」という文字列が抽出されている。コンテンツＣＴ_４からは、第１キーワードとして、「神奈川」、「花火」、「横浜市」、「祭り」、「△△公園」という文字列が抽出され、第２キーワードとして、「地名」というクラスに分類される「神奈川県」という文字列と、「固有物名」というクラスに分類される「花火大会」という文字列が抽出されている。 FIG. 8 is a diagram for explaining a selection process of related contents. In the figure, CT ₁ represents the content of interest, and each of CT ₂ to CT ₄ represents the content to be selected for the related content (a plurality of contents included in the content data 132). For example, from the content CT ₁ of interest, the character strings "Tokyo", "fireworks", "Minato Ward", "festival", and "○○ park" are extracted as the first keyword whose accuracy is equal to or higher than the threshold value, and the second As keywords, the character string "Tokyo" classified into the class "place name" and the character string "fireworks display" classified into the class "unique object name" are extracted. From the content CT ₂ , the character strings "Tokyo", "fireworks", "Minato Ward", "festival", and "○○ park" are extracted as the first keyword, and the class "place name" is extracted as the second keyword. The character string "Tokyo" classified in "Tokyo" and the character string "Fireworks display" classified in the class "unique object name" are extracted. From the content CT ₃ , the character strings "Tokyo", "fireworks", "Shinjuku Ward", "○○ stadium", and "watching baseball" are extracted as the first keyword, and the second keyword is "place name". The character string "Tokyo" classified into the class and the character string "Fireworks display" classified into the class "unique object name" are extracted. From the content CT ₄ , the character strings "Kanagawa", "Fireworks", "Yokohama City", "Festival", and "△△ Park" are extracted as the first keyword, and the class "Place Name" is extracted as the second keyword. The character string "Kanagawa Prefecture" classified into "Kanagawa Prefecture" and the character string "Fireworks display" classified into the class "Unique object name" are extracted.

例えば、所定割合が８０［％］である場合、コンテンツＣＴ_２は、確度が閾値以上の第１キーワードを１００［％］の割合で含み、且つ第２キーワードを全て含んでいるため、選択部１１６は、コンテンツＣＴ_２を着目コンテンツＣＴ_１の関連コンテンツとして選択する。一方、コンテンツＣＴ_３は、第２キーワードを全て含んでいるものの、確度が閾値以上の第１キーワードを４０［％］の割合で含んでおり、所定割合未満であるため、選択部１１６は、コンテンツＣＴ_３を着目コンテンツＣＴ_１の関連コンテンツとして選択しない。また。コンテンツＣＴ_４は、第２キーワードの一部を含んでおらず、確度が閾値以上の第１キーワードを４０［％］の割合で含んでおり、所定割合未満であるため、選択部１１６は、コンテンツＣＴ_４を着目コンテンツＣＴ_１の関連コンテンツとして選択しない。 For example, when the predetermined ratio is 80 [%], the content CT ₂ includes the first keyword whose accuracy is equal to or higher than the threshold value at a ratio of 100 [%] and includes all the second keywords, so that the selection unit 116 Selects the content CT ₂ as the related content of the content CT ₁ of interest. On the other hand, although the content CT ₃ includes all the second keywords, the content CT 3 contains the first keyword whose accuracy is equal to or higher than the threshold value at a ratio of 40 [%], which is less than a predetermined ratio. CT ₃ is not selected as the related content of the content CT ₁ of interest. Also. Since the content CT ₄ does not include a part of the second keyword, contains the first keyword having an accuracy of the threshold value or more at a ratio of 40 [%], and is less than a predetermined ratio, the selection unit 116 selects the content. CT ₄ is not selected as the related content of the content CT ₁ of interest.

なお、選択部１１６は、コンテンツデータ１３２に含まれる複数のコンテンツの中から、関連コンテンツを選択する際に、着目コンテンツとの類似度（例えばコサイン類似度）が閾値未満のコンテンツを、関連コンテンツの選択対象から除外してもよい。 When the selection unit 116 selects the related content from the plurality of contents included in the content data 132, the selection unit 116 selects the content whose similarity with the content of interest (for example, the cosine similarity) is less than the threshold value of the related content. It may be excluded from the selection.

図７の説明に戻り、次に、分類部１１８は、選択部１１６により選択された関連コンテンツの数が所定数を超えるか否かを判定する（Ｓ１０８）。所定数とは、例えば、ウェブページに着目コンテンツを表示させる際に、そのウェブページの残りの領域に表示可能な関連コンテンツの数である。所定数は、ウェブページの設計者などが任意に決定可能なハイパーパラメータとして扱われてよい。 Returning to the description of FIG. 7, the classification unit 118 then determines whether or not the number of related contents selected by the selection unit 116 exceeds a predetermined number (S108). The predetermined number is, for example, the number of related contents that can be displayed in the remaining area of the web page when the content of interest is displayed on the web page. A predetermined number may be treated as a hyperparameter that can be arbitrarily determined by a web page designer or the like.

提供部１２０は、分類部１１８によって関連コンテンツの数が所定数以下であると判定された場合、通信部１０２を制御して、関連コンテンツを第１端末装置１０に提供（送信）する（Ｓ１１０）。例えば、提供部１２０は、図３に例示するようなウェブページに関連コンテンツを掲載することで、関連コンテンツを第１端末装置１０に提供してよい。 When the classification unit 118 determines that the number of related contents is equal to or less than a predetermined number, the providing unit 120 controls the communication unit 102 to provide (transmit) the related contents to the first terminal device 10 (S110). .. For example, the providing unit 120 may provide the related content to the first terminal device 10 by posting the related content on a web page as illustrated in FIG.

一方、分類部１１８は、関連コンテンツの数が所定数を超えると判定した場合、着目コンテンツと時系列に関連している複数の関連コンテンツの互いの相対的な内容の変化に基づいて、複数の関連コンテンツのそれぞれを複数のグループ（クラスタ）のいずれかに分類する（Ｓ１１２）。 On the other hand, when the classification unit 118 determines that the number of related contents exceeds a predetermined number, a plurality of related contents are changed based on changes in the relative contents of the content of interest and the plurality of related contents related to the time series. Each of the related contents is classified into one of a plurality of groups (clusters) (S112).

図９から図１５は、関連コンテンツのグループ化の方法を説明するための図である。図中Ｖ１からＶ９は、関連コンテンツのコンテンツベクトルを表しており、縦軸は、関連コンテンツの内容を表し、横軸は、コンテンツがアップロードされた時刻（例えば入稿時刻）を表している。コンテンツの内容とは、例えば、コンテンツベクトルに含まれる複数の要素のことである。図示の例では、説明を簡略化するために、コンテンツの内容を一次元で表しているが、多次元であってよい。 9 to 15 are diagrams for explaining a method of grouping related contents. In the figure, V1 to V9 represent the content vector of the related content, the vertical axis represents the content of the related content, and the horizontal axis represents the time when the content was uploaded (for example, the submission time). The content of the content is, for example, a plurality of elements included in the content vector. In the illustrated example, the content is represented in one dimension for the sake of brevity, but it may be multidimensional.

まず、分類部１１８は、多次元空間において広がりをもって分布しているコンテンツベクトルＶ１からＶ９の重心Ｇ_Ｒを導出する。図９の例では、コンテンツベクトルＶ１からＶ３と、Ｖ４からＶ６と、Ｖ７からＶ９とがそれぞれ同じ軸上に分布しており、コンテンツベクトルＶ４からＶ６から見て、Ｖ１からＶ３と、Ｖ７からＶ９とが互いに等距離に分布している。そのため、図１０に例示するように、重心Ｇ_Ｒは、Ｖ４からＶ６を通る軸上に現れる。 First, the classification unit 118 derives the center of gravity _GR of V9 from the content vector V1 distributed in a multidimensional space with a spread. In the example of FIG. 9, the content vectors V1 to V3, V4 to V6, and V7 to V9 are distributed on the same axis, respectively, and when viewed from the content vectors V4 to V6, V1 to V3 and V7 to V9. Are distributed equidistantly from each other. Therefore, as illustrated in FIG. 10, the center of gravity _GR appears on the axis passing through V4 to V6.

分類部１１８は、重心Ｇ_Ｒを導出すると、時間軸方向に離散的に分布するコンテンツベクトルＶ１からＶ９のどこでグループを分離させるのかを決めるために、グループの境界となる境界線をコンテンツベクトルの間に設定する。図１１に示すように、例えば、グループの境界となる境界線は、時刻ｔ_１－２、ｔ_２－３、ｔ_３－４、ｔ_４－５、ｔ_５－６、ｔ_６－７、ｔ_７－８、ｔ_８－９のいずれか一か所または複数か所に設けられる。時刻ｔ_１－２は、コンテンツベクトルＶ１に対応したアップロード時刻ｔ_１とコンテンツベクトルＶ２に対応したアップロード時刻ｔ_２との間の時刻であり、時刻ｔ_２－３は、コンテンツベクトルＶ２に対応したアップロード時刻ｔ_２とコンテンツベクトルＶ３に対応したアップロード時刻ｔ_３との間の時刻である。他の時刻ｔ_３－４、ｔ_４－５、ｔ_５－６、ｔ_６－７、ｔ_７－８、ｔ_８－９についても同様である。なお、縦軸を便宜上一次元としているため、グループの境界を一次元の境界線として説明しているが、上述したように、縦軸が多次元である場合、グループの境界も多次元空間（例えば平面など）であってよい。 When the center of gravity _GR is derived, the classification unit 118 sets a boundary line between the content vectors, which is the boundary line of the group, in order to determine where the groups are separated from the content vectors V1 to V9, which are discretely distributed in the time axis direction. Set to. As shown in FIG. 11, for example, the boundary line that becomes the boundary of the group is time t _1-2 , t _2-3 , t _3-4 , t _4-5 , t _5-6 , t _6-7 , t. It is provided at one or more of _7-8 and _t8-9 . Time t _1-2 is the time between the upload time t ₁ corresponding to the content vector V1 and the upload time t ₂ corresponding to the content vector V2, and time t _2-3 is the upload corresponding to the content vector V2. It is a time between the time t ₂ and the upload time t ₃ corresponding to the content vector V3. The same applies to other times t _3-4 , t _4-5 , t _5-6 , t _6-7 , t _7-8 , and t _8-9 . Since the vertical axis is one-dimensional for convenience, the boundary of the group is described as a one-dimensional boundary line. However, as described above, when the vertical axis is multidimensional, the boundary of the group is also a multidimensional space ( For example, it may be a plane).

例えば、分類部１１８は、境界線の候補となる複数の時刻ｔ_１－２、ｔ_２－３、ｔ_３－４、ｔ_４－５、ｔ_５－６、ｔ_６－７、ｔ_７－８、ｔ_８－９の中から、グループ内のコンテンツベクトルを重心で近似するときの誤差が最も小さくなる時刻に境界線を設定することで、コンテンツベクトルＶ１からＶ９が分布する多次元空間を複数のグループに分離する。 For example, the classification unit 118 has a plurality of time points t _1-2 , t _2-3 , t _3-4 , t _4-5 , t _5-6 , t _6-7 , t _7-8 , which are candidates for the boundary line. , T _8-9 , by setting the boundary line at the time when the error when approximating the content vector in the group by the center of gravity is the smallest, a plurality of multidimensional spaces in which the content vectors V1 to V9 are distributed can be created. Separate into groups.

図１２の例では、時刻ｔ_１－２を境界線としており、コンテンツベクトルＶ１からＶ９が分布する多次元空間は、時刻ｔ_１－２以前にアップロードされたコンテンツが含まれる第１グループと、時刻ｔ_１－２以降にアップロードされたコンテンツが含まれる第２グループとに分離される。この場合、コンテンツベクトルＶ１は、第１グループに分類され、コンテンツベクトルＶ２からＶ９は、第２グループに分類される。 In the example of FIG. 12, the time t _1-2 is the boundary line, and the multidimensional space in which the content vectors V1 to V9 are distributed is the first group including the contents uploaded before the time t _1-2 and the time. It is separated into a second group that includes content uploaded after _t1-2 . In this case, the content vector V1 is classified into the first group, and the content vectors V2 to V9 are classified into the second group.

分類部１１８は、コンテンツベクトルＶ１からＶ９を第１グループまたは第２グループのいずれかに分類すると、グループごとに、コンテンツベクトルの重心を導出し、グループごとに、重心に対するコンテンツベクトルの誤差（例えば最小二乗誤差）を導出する。 When the content vector V1 to V9 are classified into either the first group or the second group, the classification unit 118 derives the center of gravity of the content vector for each group, and the error of the content vector with respect to the center of gravity for each group (for example, the minimum). Squared error) is derived.

図１２の例では、第１グループの重心Ｇ１は、コンテンツベクトルＶ１と同じ位置であることから、重心Ｇ１に対するコンテンツベクトルＶ１の誤差はゼロとなる。一方、第２グループの重心Ｇ２は、コンテンツベクトルＶ２からＶ９の重心であるため、各コンテンツベクトルに重心Ｇ２との誤差が生じる。 In the example of FIG. 12, since the center of gravity G1 of the first group is at the same position as the content vector V1, the error of the content vector V1 with respect to the center of gravity G1 is zero. On the other hand, since the center of gravity G2 of the second group is the center of gravity of the content vectors V2 to V9, an error occurs in each content vector from the center of gravity G2.

次に、分類部１１８は、図１３から図１５の例のように、時刻ｔ_１－２と異なる時刻に境界線を設定して、多次元空間を複数のグループに分離し、グループごとに重心に対するコンテンツベクトルの誤差を導出することを繰り返す。図１４に例示するように、時刻ｔ_３－４を境界線としたときが、第１グループと第２グループの誤差が最も小さくなるため、分類部１１８は、時刻ｔ_３－４を境界線に設定し、時刻ｔ_３－４以降について境界線を探索する。このように、分類部１１８は、コンテンツベクトルを重心で近似するときの誤差が最も小さくなる境界線を探索していき、コンテンツベクトルＶ１からＶ９が分布する多次元空間を複数のグループに分離する。 Next, as in the example of FIGS. 13 to 15, the classification unit 118 sets a boundary line at a time different from the time t _1-2 , separates the multidimensional space into a plurality of groups, and divides the multidimensional space into a plurality of groups, and the center of gravity of each group. Iterates to derive the error of the content vector for. As illustrated in FIG. 14, when the time t _3-4 is set as the boundary line, the error between the first group and the second group is the smallest. Therefore, the classification unit 118 uses the time t _3-4 as the boundary line. Set and search the boundary line after time t _3-4 . In this way, the classification unit 118 searches for the boundary line where the error when approximating the content vector by the center of gravity is the smallest, and separates the multidimensional space in which the content vectors V1 to V9 are distributed into a plurality of groups.

図７の説明に戻り、提供部１２０は、分類部１１８によって複数の関連コンテンツが複数のグループに分類された場合、各グループから代表となる一つの関連コンテンツ（以下、代表関連コンテンツと称する）を選択する（Ｓ１１４）。 Returning to the description of FIG. 7, when a plurality of related contents are classified into a plurality of groups by the classification unit 118, the providing unit 120 selects one related content (hereinafter referred to as representative related content) as a representative from each group. Select (S114).

図１６は、代表関連コンテンツの選択方法を説明するための図である。例えば、分類部１１８が、時刻ｔ_３－４と時刻ｔ_６－７とのそれぞれに境界線を設定して、コンテンツベクトルＶ１からＶ９が分布する多次元空間を３つのグループに分離した場合、提供部１２０は、各グループに分類されたコンテンツベクトルから一つのコンテンツベクトルを選択することで、代表関連コンテンツを選択する。図示の例では、第２グループに分類されたコンテンツベクトルＶ５が着目コンテンツのコンテンツベクトルとしている。例えば、提供部１２０は、着目コンテンツとの話しの繋がりを重視する場合、図示の例のように、各グループにおいて、着目コンテンツを含む第２グループに時間的に最も近い関連コンテンツを、代表関連コンテンツとして選択してよい。この場合、図示の例のように、第１グループからは、コンテンツベクトルＶ３に対応する関連コンテンツが代表関連コンテンツとして選択され、第３グループからは、コンテンツベクトルＶ７に対応する関連コンテンツが代表関連コンテンツとして選択される。 FIG. 16 is a diagram for explaining a method of selecting representative related contents. For example, when the classification unit 118 sets a boundary line at each of the time t _3-4 and the time t _6-7 , and separates the multidimensional space in which the content vectors V1 to V9 are distributed into three groups, the content vector 118 is provided. The unit 120 selects representative-related content by selecting one content vector from the content vectors classified into each group. In the illustrated example, the content vector V5 classified into the second group is used as the content vector of the content of interest. For example, when the provision unit 120 emphasizes the connection with the content of interest, as shown in the illustrated example, in each group, the related content that is closest in time to the second group including the content of interest is the representative related content. May be selected as. In this case, as shown in the illustrated example, the related content corresponding to the content vector V3 is selected as the representative related content from the first group, and the related content corresponding to the content vector V7 is selected as the representative related content from the third group. Is selected as.

また、提供部１２０は、ニュース記事などで第一報を重視する場合、各グループにおいて最も時刻が早い関連コンテンツ（図１６の例では、コンテンツベクトルＶ１に対応する関連コンテンツやコンテンツベクトルＶ７に対応する関連コンテンツ）を代表関連コンテンツとして選択してもよい。 Further, when the first report is emphasized in a news article or the like, the providing unit 120 corresponds to the related content having the earliest time in each group (in the example of FIG. 16, the related content corresponding to the content vector V1 or the content vector V7). Related content) may be selected as the representative related content.

なお、図示の例では、境界線を２か所に設定しているがこれに限られない。例えば、着目コンテンツとともに表示させる関連コンテンツの数を２つとした場合、グループ数を２つとするために、ある一つの時刻に境界線を設定してよい。また、着目コンテンツとともに表示させる関連コンテンツの数を４つ以上とした場合、グループ数を４つ以上とするために、３つ以上の時刻に境界線を設定してよい。 In the illustrated example, the boundary line is set at two places, but the boundary line is not limited to this. For example, when the number of related contents to be displayed together with the content of interest is two, a boundary line may be set at a certain time in order to set the number of groups to two. Further, when the number of related contents to be displayed together with the content of interest is four or more, a boundary line may be set at three or more times in order to set the number of groups to four or more.

次に、提供部１２０は、Ｓ１１０の処理として、各グループから選択した代表関連コンテンツを第１端末装置１０に提供する。これによって、本フローチャートの処理が終了する。 Next, the providing unit 120 provides the representative-related content selected from each group to the first terminal device 10 as the process of S110. This ends the processing of this flowchart.

以上説明した実施形態によれば、複数のコンテンツのうち、ある着目コンテンツとの類似度が大きい上位所定数Ｎのコンテンツからキーワードを抽出する抽出部１１４と、抽出部１１４により抽出されたキーワードを条件にして、複数のコンテンツの中から、着目コンテンツと時系列に関連する一以上の関連コンテンツを選択する選択部１１６とを備えるため、着目コンテンツと時系列に関連する関連コンテンツをユーザに提供することができる。 According to the embodiment described above, the condition is that the extraction unit 114 that extracts keywords from the content of the upper predetermined number N having a high degree of similarity to a certain content of interest among the plurality of contents, and the keywords extracted by the extraction unit 114. In order to provide the selection unit 116 for selecting the content of interest and one or more related contents related to the time series from the plurality of contents, the content of interest and the related contents related to the time series are provided to the user. Can be done.

例えば、コンテンツがニュース記事である場合、一つの関連した出来事の中で、その出来事に変化が生じるたびに、出来事に関する新たな記事が入稿され得る。このように次々と入稿される新しい記事がコンテンツとしてユーザに提供される場合、ユーザは、興味のある記事に対してジャンル（話題）が関連する記事よりも、その興味のある記事の続報として入稿された記事や、興味のある記事が続報記事であるときの第一報に相当する記事の方がより興味関心が高い場合がある。 For example, if the content is a news article, a new article about the event may be submitted each time the event changes within one related event. When new articles that are submitted one after another are provided to the user as content in this way, the user is more likely to follow the article of interest than the article related to the genre (topic) of the article of interest. In some cases, the submitted article or the article corresponding to the first report when the article of interest is a follow-up article is more interesting.

例えば、興味のある記事に対してジャンルが関連する記事をコンテンツとしてユーザに提供する場合、ＴＦ－ＩＤＦなどで記事同士の類似度を求め、その類似度に応じて記事をユーザに提供するのか否かを決定することが考えられ得る。この場合、記事のジャンルとしては近いものの、記事同士に時間的な関連性がない場合がある。例えば、「豪雨」を報道する二つの記事が存在する場合、一方の記事が「東京」で発生した「豪雨」を報道する記事であり、他方の記事が「九州」で発生した「豪雨」を報道する記事であった場合、記事のジャンルは互いに「豪雨」という自然災害で共通するものの、発生場所が互い異なるため、一方の記事を他方の記事の続報記事として関連付けることは適切でない。 For example, when providing an article related to a genre to a user as content for an article of interest, whether or not to obtain the similarity between the articles by TF-IDF or the like and provide the article to the user according to the similarity. It can be considered to decide. In this case, although the genres of the articles are similar, the articles may not be temporally related to each other. For example, if there are two articles reporting "heavy rain", one article reports "heavy rain" that occurred in "Tokyo" and the other article reports "heavy rain" that occurred in "Kyushu". In the case of articles to be reported, although the genres of the articles are common to each other due to the natural disaster of "heavy rain", it is not appropriate to associate one article as a follow-up article of the other article because the place of occurrence is different from each other.

これに対して、本実施形態では、記事同士の互いの話題の近さというよりも、主要な固有名詞をキーワードとし、そのキーワードの共通性に応じて、着目コンテンツに対して他のコンテンツが関連するのか否かを決定するため、着目コンテンツと時系列に関連するコンテンツを適切に選択することができる。 On the other hand, in this embodiment, the main proper noun is used as a keyword rather than the closeness of the topics of the articles to each other, and other contents are related to the content of interest according to the commonality of the keywords. In order to determine whether or not to do so, the content of interest and the content related to the time series can be appropriately selected.

また、上述した実施形態によれば、選択部１１６により選択された複数の関連コンテンツの相対的な内容の変化に基づいて、複数の関連コンテンツのそれぞれを複数のグループのいずれかに分類する分類部１１８を更に備えることで、時間軸方向に離散的に分布する複数の関連コンテンツのうち、時間が移り変わっても内容の変化が小さい関連コンテンツを一つのグループに纏めることができる。この結果、各グループから代表的な関連コンテンツを残し、その他の関連コンテンツを間引くことができる。 Further, according to the above-described embodiment, the classification unit that classifies each of the plurality of related contents into one of the plurality of groups based on the relative change in the contents of the plurality of related contents selected by the selection unit 116. By further providing 118, among a plurality of related contents discretely distributed in the time axis direction, related contents whose contents change little even if the time changes can be grouped into one group. As a result, representative related contents can be left from each group, and other related contents can be thinned out.

一般的に、データの階層クラスタリングは、全てのデータに順序がなく、任意のデータからクラスタリングが行われる。これに対して、本実施形態では、時間方向に離散的に分布する複数の関連コンテンツを、その時間方向に関してクラスタリングするため、内容の変化を残しつつ関連コンテンツを間引くことができる。この結果、内容が重複したコンテンツがユーザに提供されてしまうのを抑制することができる。 In general, in the hierarchical clustering of data, all the data are out of order, and clustering is performed from arbitrary data. On the other hand, in the present embodiment, since a plurality of related contents discretely distributed in the time direction are clustered in the time direction, the related contents can be thinned out while leaving the change in the contents. As a result, it is possible to prevent the user from being provided with content having duplicate contents.

＜実施形態の変形例＞
以下、上述した実施形態の変形例について説明する。上述した実施形態では、分類部１１８が、着目コンテンツと時系列に関連している複数の関連コンテンツの互いの相対的な内容の変化に基づいて、複数の関連コンテンツのそれぞれを複数のグループのいずれかに分類するものとして説明したがこれに限られない。 <Modified example of the embodiment>
Hereinafter, a modified example of the above-described embodiment will be described. In the above-described embodiment, the classification unit 118 sets each of the plurality of related contents into a plurality of groups based on the change in the relative contents of the content of interest and the plurality of related contents related to the time series. Although it was explained as being classified as a content, it is not limited to this.

例えば、分類部１１８は、時間的に等間隔となるように、複数の関連コンテンツをグループに分類してもよいし、各グループに含まれる関連コンテンツの数が均等となるように、複数の関連コンテンツをグループに分類してもよい。 For example, the classification unit 118 may classify a plurality of related contents into groups so that they are evenly spaced in time, or a plurality of related contents so that the number of related contents contained in each group is equal. Content may be grouped.

また、上述した実施形態では、前処理部１１２が、複数のコンテンツのそれぞれと、着目コンテンツとの類似度を導出し、抽出部１１４が、前処理部１１２によって着目コンテンツとの類似度が導出された複数のコンテンツの中から、着目コンテンツとの類似度が大きい上位所定数Ｎのコンテンツを選択し、選択したＮ個のコンテンツの集合からキーワードを抽出し、選択部１１６が、抽出部１１４によって抽出されたキーワードを条件にして、コンテンツデータ１３２に含まれる複数のコンテンツの中から、着目コンテンツと時系列に関連する関連コンテンツを選択した上で、分類部１１８が、複数の関連コンテンツをグループに分類するものとして説明したがこれに限られない。 Further, in the above-described embodiment, the preprocessing unit 112 derives the similarity between each of the plurality of contents and the content of interest, and the extraction unit 114 derives the similarity with the content of interest by the preprocessing unit 112. From the plurality of contents, the top predetermined number N contents having a high degree of similarity to the content of interest are selected, keywords are extracted from the set of the selected N contents, and the selection unit 116 is extracted by the extraction unit 114. After selecting the content of interest and the related content related to the time series from the plurality of contents included in the content data 132 on the condition of the specified keyword, the classification unit 118 classifies the plurality of related contents into groups. I explained it as something to do, but it is not limited to this.

例えば、分類部１１８は、前処理部１１２によって着目コンテンツとの類似度が導出された複数のコンテンツの中から、着目コンテンツとの類似度が大きい上位所定数Ｘのコンテンツを関連コンテンツとして扱い、その関連コンテンツをグループに分類してもよい。所定数Ｘは、上述した所定数Ｎと同じであってもよいし、異なっていてもよい。 For example, the classification unit 118 treats the content of the upper predetermined number X having a high degree of similarity with the content of interest as the related content from among the plurality of contents whose similarity with the content of interest is derived by the preprocessing unit 112. Related content may be grouped. The predetermined number X may be the same as or different from the predetermined number N described above.

また、分類部１１８は、ユーザに提供する対象のコンテンツのジャンルが事前に決まっている場合、そのジャンルに該当するコンテンツのみをグループに分類してもよい。 Further, when the genre of the content to be provided to the user is determined in advance, the classification unit 118 may classify only the content corresponding to the genre into a group.

＜ハードウェア構成＞
上述した実施形態の情報処理装置１００は、例えば、図１７に示すようなハードウェア構成により実現される。図１７は、実施形態の情報処理装置１００のハードウェア構成の一例を示す図である。 <Hardware configuration>
The information processing apparatus 100 of the above-described embodiment is realized by, for example, a hardware configuration as shown in FIG. FIG. 17 is a diagram showing an example of the hardware configuration of the information processing apparatus 100 of the embodiment.

情報処理装置１００は、ＮＩＣ１００－１、ＣＰＵ１００－２、ＲＡＭ１００－３、ＲＯＭ１００－４、フラッシュメモリやＨＤＤなどの二次記憶装置１００－５、およびドライブ装置１００－６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。ドライブ装置１００－６には、光ディスクなどの可搬型記憶媒体が装着される。二次記憶装置１００－５、またはドライブ装置１００－６に装着された可搬型記憶媒体に格納されたプログラムがＤＭＡコントローラ（不図示）などによってＲＡＭ１００－３に展開され、ＣＰＵ１００－２によって実行されることで、制御部１１０が実現される。制御部１１０が参照するプログラムは、ネットワークＮＷを介して他の装置からダウンロードされてもよい。 The information processing device 100 includes NIC100-1, CPU100-2, RAM100-3, ROM100-4, a secondary storage device 100-5 such as a flash memory or an HDD, and a drive device 100-6 on an internal bus or a dedicated communication line. It is configured to be interconnected by. A portable storage medium such as an optical disk is mounted on the drive device 100-6. A program stored in a portable storage medium mounted on the secondary storage device 100-5 or the drive device 100-6 is expanded in the RAM 100-3 by a DMA controller (not shown) or the like, and executed by the CPU 100-2. As a result, the control unit 110 is realized. The program referred to by the control unit 110 may be downloaded from another device via the network NW.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…情報処理システム、１０…第１端末装置、２０…第２端末装置、１００…情報処理装置、１０２…通信部、１１０…制御部、１１２…前処理部、１１４…抽出部、１１６…選択部、１１８…分類部、１２０…提供部、１３０…記憶部 1 ... Information processing system, 10 ... First terminal device, 20 ... Second terminal device, 100 ... Information processing device, 102 ... Communication unit, 110 ... Control unit, 112 ... Preprocessing unit, 114 ... Extraction unit, 116 ... Selection Department, 118 ... Classification department, 120 ... Providing department, 130 ... Storage department

Claims

An extraction unit that extracts keywords from a predetermined number of high-ranking contents that have a high degree of similarity to a certain content of interest among multiple contents.
A selection unit for selecting one or more related contents related to the content of interest and the time series from the plurality of contents based on the keyword extracted by the extraction unit is provided.
The extraction unit extracts a first keyword representing a feature word and a second keyword representing a named entity from each of the content of interest and the upper predetermined number of contents.
The selection unit has the first keyword and the second keyword in common with the content of interest in the upper predetermined number of contents, and is before or after the first time associated with the content of interest. Select the content associated with the second time as the related content.
Information processing equipment.

The selection unit selects, as the related content, content containing the second keyword and containing the plurality of the first keywords in a predetermined ratio or more from the plurality of contents.
The information processing apparatus according to claim 1 .

A preprocessing unit for excluding content having a similarity less than a threshold value with the content of interest from the plurality of contents is further provided.
The information processing apparatus according to claim 1 or 2 .

When a plurality of related contents are selected by the selection unit, the plurality of related contents are classified into a plurality of groups based on the relative change of the content vector obtained by vectorizing each of the plurality of related contents. With more parts,
The information processing apparatus according to any one of claims 1 to 3 .

The classification unit
The content vectors corresponding to each of the plurality of related contents are discretely distributed on the feature space.
The centers of gravity of the plurality of content vectors distributed discretely on the feature space are derived, and the centers of gravity are derived.
Based on the derived center of gravity, the plurality of related contents are classified into a plurality of groups.
The information processing apparatus according to claim 4.

The classification unit
The center of gravity of the content vector corresponding to each of the related contents included in each group is derived for each group.
The plurality of related contents are classified into a plurality of groups so that the error of the center of gravity derived for each group is minimized.
The information processing apparatus according to claim 5.

A communication unit that communicates with a terminal device that can be used by the user,
Further includes a providing unit that selects at least one related content from each of the plurality of groups, controls the communication unit, and provides the related content selected from each of the groups to the terminal device. ,
The information processing apparatus according to any one of claims 4 to 6 .

The computer
Of multiple contents, keywords are extracted from the top predetermined number of contents that have a high degree of similarity to a certain content of interest.
Based on the extracted keywords, one or more related contents related to the content of interest and the time series are selected from the plurality of contents.
A first keyword representing a feature word and a second keyword representing a named entity are extracted from each of the content of interest and the upper predetermined number of contents.
Among the upper predetermined number of contents, the first keyword and the second keyword are common to the attention content, and the second time before or after the first time associated with the attention content corresponds to. Select the attached content as the related content,
Information processing method.

On the computer
Of multiple contents, the process of extracting keywords from the top predetermined number of contents that have a high degree of similarity to a certain content of interest, and
A process of selecting one or more related contents related to the content of interest and the time series from the plurality of contents based on the extracted keywords.
A process of extracting a first keyword representing a feature word and a second keyword representing a named entity from each of the content of interest and the upper predetermined number of contents.
Among the upper predetermined number of contents, the first keyword and the second keyword are common to the attention content, and the second time before or after the first time associated with the attention content corresponds to. The process of selecting the attached content as the related content, and
A program to execute.