JP7088795B2

JP7088795B2 - Information processing equipment, information processing methods, and programs

Info

Publication number: JP7088795B2
Application number: JP2018174911A
Authority: JP
Inventors: 一騎山内; 琢郎森; 伸次池宮
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2022-06-21
Anticipated expiration: 2038-09-19
Also published as: JP2020046942A

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

自然言語処理の分野などでは、ニューラルネットワークを利用して得た単語や文の分散表現が機械学習などに利用されている。一方で、複数のクエリの互いの関連度をスコアとして導出する技術が知られている（例えば、特許文献１参照）。 In the field of natural language processing, distributed expressions of words and sentences obtained by using neural networks are used for machine learning and the like. On the other hand, a technique for deriving the degree of relevance of a plurality of queries to each other as a score is known (see, for example, Patent Document 1).

特開２０１７－２２８１１４号公報Japanese Unexamined Patent Publication No. 2017-228114

しかしながら、従来の技術では、単語や文として入力されたクエリの分散表現が、そのクエリの特徴を十分に表し切れておらず、分散表現の精度が十分でない場合があった。この結果、クエリの分散表現を利用した言語処理の精度が十分でない場合があった。 However, in the conventional technique, the distributed expression of a query input as a word or a sentence may not sufficiently express the characteristics of the query, and the accuracy of the distributed expression may not be sufficient. As a result, the accuracy of language processing using the distributed representation of queries may not be sufficient.

本発明は、上記の課題に鑑みてなされたものであり、クエリの分散表現の精度を向上させることができる情報処理装置、情報処理方法、およびプログラムを提供することを目的としている。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an information processing apparatus, an information processing method, and a program capable of improving the accuracy of distributed representation of queries.

本発明の一態様は、情報を検索するために入力された複数のクエリを取得する取得部と、前記取得部により取得された複数のクエリのそれぞれの分散表現を生成する生成部と、前記複数のクエリのそれぞれを入力したユーザの重複度に基づいて、前記生成部により生成された複数の分散表現の中から、互いに関連した分散表現を選択する選択部と、を備える情報処理装置である。 One aspect of the present invention includes an acquisition unit that acquires a plurality of queries input for searching information, a generation unit that generates a distributed representation of each of the plurality of queries acquired by the acquisition unit, and the plurality of units. It is an information processing apparatus including a selection unit for selecting a distributed expression related to each other from a plurality of distributed expressions generated by the generation unit based on the degree of duplication of a user who has input each of the queries.

本発明の一態様によれば、クエリの分散表現の精度を向上させることができる。 According to one aspect of the present invention, the accuracy of the distributed representation of the query can be improved.

第１実施形態の情報処理装置１００を含む情報処理システム１の一例を示す図である。It is a figure which shows an example of the information processing system 1 including the information processing apparatus 100 of 1st Embodiment. 第１実施形態における情報処理装置１００の構成の一例を示す図である。It is a figure which shows an example of the structure of the information processing apparatus 100 in 1st Embodiment. 第１実施形態における制御部１１０の一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processing of the control unit 110 in 1st Embodiment. 基準ワードベクトルの近傍に分布するワードベクトルを選択する様子を模式的に示す図である。It is a figure which shows how to select the word vector distributed in the vicinity of the reference word vector schematically. Ｎ個のワードベクトルのそれぞれの生成元である各クエリについて、基準ワードベクトルの生成元であるクエリとの関連度のスコアを導出する様子を模式的に示す図である。It is a figure which shows how to derive the score of the degree of association with the query which is the generator of a reference word vector for each query which is the generator of each of N word vectors. 基準ワードベクトルに関連するワードベクトルを選択する処理を所定回数ｋ繰り返したときの様子を模式的に示す図である。It is a figure which shows typically the state when the process of selecting a word vector related to a reference word vector is repeated k a predetermined number of times. 本手法の正解率の一例を表した図である。It is a figure which showed an example of the correct answer rate of this method. 実施形態の情報処理装置１００のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the information processing apparatus 100 of an embodiment.

以下、本発明を適用した情報処理装置、情報処理方法、およびプログラムを、図面を参照して説明する。 Hereinafter, an information processing apparatus, an information processing method, and a program to which the present invention is applied will be described with reference to the drawings.

［概要］
情報処理装置は、一以上のプロセッサにより実現される。情報処理装置は、情報を検索するために入力された複数のクエリを取得し、取得した複数のクエリのそれぞれの分散表現を生成する。クエリの分散表現は、クエリを複数の特徴によって表現した情報であり、例えば、複数の特徴を要素とする多次元のベクトルによって表される。 [Overview]
The information processing device is realized by one or more processors. The information processing device acquires a plurality of queries input for retrieving information, and generates a distributed representation of each of the acquired plurality of queries. The distributed representation of a query is information that represents the query by a plurality of features, and is represented by, for example, a multidimensional vector having the plurality of features as elements.

情報処理装置は、複数のクエリの分散表現を生成すると、複数の分散表現の生成元である複数のクエリのそれぞれを入力したユーザに関する所定の指標値に基づいて、複数の分散表現の中から、互いに関連した分散表現を選択する。ユーザに関する所定の指標値は、例えば、複数のクエリのそれぞれを入力したユーザの重複度であってもよいし、複数のクエリのそれぞれを入力したユーザの類似度であってもよい。 When the information processing device generates a distributed representation of a plurality of queries, the information processing device can generate a distributed representation from the plurality of distributed representations based on a predetermined index value for a user who has input each of the plurality of queries that are the generation sources of the plurality of distributed representations. Select distributed representations that are related to each other. The predetermined index value for the user may be, for example, the degree of duplication of the user who entered each of the plurality of queries, or the degree of similarity of the user who entered each of the plurality of queries.

また、情報処理装置は、ユーザに関する所定の指標値に代えて、あるいは加えて、複数のクエリのそれぞれに関する所定の指標値に基づいて、複数の分散表現の中から、互いに関連した分散表現を選択してもよい。クエリに関する所定の指標値は、例えば、複数のクエリのそれぞれがユーザにより入力された時刻であってよい。 Further, the information processing apparatus selects a distributed expression related to each other from a plurality of distributed expressions in place of the predetermined index value related to the user or in addition to the predetermined index value related to each of the plurality of queries. You may. The predetermined index value for the query may be, for example, the time each of the plurality of queries is entered by the user.

このように、情報処理装置は、ユーザの重複度や、ユーザの類似度、クエリの入力時刻といった指標値に基づいて、複数のクエリの分散表現の中から互いに関連するクエリを選択することで、クエリの分散表現の精度を向上させることができる。 In this way, the information processing device selects queries that are related to each other from the distributed representation of a plurality of queries based on index values such as the degree of duplication of users, the degree of similarity of users, and the input time of queries. The accuracy of the distributed representation of the query can be improved.

＜第１実施形態＞
［全体構成］
図１は、第１実施形態の情報処理装置１００を含む情報処理システム１の一例を示す図である。第１実施形態における情報処理システム１は、例えば、一つ以上の端末装置１０と、情報提供装置２０と、情報処理装置１００とを備える。これらの装置は、例えば、ネットワークＮＷを介して互いに接続される。 <First Embodiment>
[overall structure]
FIG. 1 is a diagram showing an example of an information processing system 1 including the information processing apparatus 100 of the first embodiment. The information processing system 1 in the first embodiment includes, for example, one or more terminal devices 10, an information providing device 20, and an information processing device 100. These devices are connected to each other, for example, via a network NW.

図１に示す各装置は、ネットワークＮＷを介して種々の情報を送受信する。ネットワークＮＷは、例えば、インターネット、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、プロバイダ端末、無線通信網、無線基地局、専用回線などを含む。なお、図１に示す各装置の全ての組み合わせが相互に通信可能である必要はなく、ネットワークＮＷは、一部にローカルなネットワークを含んでもよい。 Each device shown in FIG. 1 transmits and receives various information via the network NW. The network NW includes, for example, the Internet, a WAN (Wide Area Network), a LAN (Local Area Network), a provider terminal, a wireless communication network, a wireless base station, a dedicated line, and the like. It should be noted that not all combinations of the devices shown in FIG. 1 need not be able to communicate with each other, and the network NW may include a local network in part.

端末装置１０は、例えば、スマートフォンなどの携帯電話、タブレット端末、各種パーソナルコンピュータなどの、入力装置、表示装置、通信装置、記憶装置、および演算装置を備える端末装置である。通信装置は、ＮＩＣ（Network Interface Card）などのネットワークカード、無線通信モジュールなどを含む。端末装置１０では、ウェブブラウザやアプリケーションプログラムなどのＵＡ（User Agent）が起動し、ユーザの入力内容に応じたリクエストを情報提供装置２０に送信する。また、ＵＡが起動された端末装置１０は、情報提供装置２０から取得した情報に基づいて、表示装置に各種画像を表示させる。 The terminal device 10 is a terminal device including an input device, a display device, a communication device, a storage device, and an arithmetic unit, such as a mobile phone such as a smartphone, a tablet terminal, and various personal computers. The communication device includes a network card such as a NIC (Network Interface Card), a wireless communication module, and the like. In the terminal device 10, a UA (User Agent) such as a web browser or an application program is activated, and a request according to a user's input content is transmitted to the information providing device 20. Further, the terminal device 10 in which the UA is activated causes the display device to display various images based on the information acquired from the information providing device 20.

情報提供装置２０は、例えば、ウェブブラウザからのリクエスト（例えばＨＴＴＰ（Hypertext Transfer Protocol）リクエストやクエリなど）に応じてウェブページを端末装置１０に提供するウェブサーバであってよい。ウェブページには、コンテンツが含まれる。コンテンツは、例えば、ブログやウェブサイトなどに掲載される文書データであってもよいし、静止画像データ、動画像データ、または音声データなどであってもよい。また、情報提供装置２０は、アプリケーションプログラムからのリクエストに応じて画像や音声などのコンテンツを端末装置１０に提供するアプリサーバであってもよい。例えば、情報提供装置２０は、文書検索や画像検索といった、あるデータベースから所望のコンテンツを検索するサービス（以下、検索サービスと称する）を、ウェブサイトやアプリケーションを介して、端末装置１０を利用するユーザに提供する。ユーザが検索サービスを利用してデータベースから所望のコンテンツを検索する際に、端末装置１０に対してクエリを入力した場合、情報提供装置２０は、その端末装置１０からクエリを受信し、受信したクエリに応じたコンテンツを含むウェブページなどを端末装置１０に送信する。 The information providing device 20 may be, for example, a web server that provides a web page to the terminal device 10 in response to a request from a web browser (for example, an HTTP (Hypertext Transfer Protocol) request or a query). Web pages contain content. The content may be, for example, document data posted on a blog, a website, or the like, or may be still image data, moving image data, audio data, or the like. Further, the information providing device 20 may be an application server that provides contents such as images and sounds to the terminal device 10 in response to a request from the application program. For example, the information providing device 20 is a user who uses a terminal device 10 via a website or an application to search for a desired content from a certain database (hereinafter referred to as a search service) such as a document search or an image search. To provide to. When a user inputs a query to the terminal device 10 when searching for desired content from a database using a search service, the information providing device 20 receives a query from the terminal device 10 and receives the query. A web page or the like including the content corresponding to the above is transmitted to the terminal device 10.

情報処理装置１００は、検索サービスの利用時に入力された複数のクエリを、情報提供装置２０から取得し、取得した複数のクエリのそれぞれの分散表現を生成する。そして、情報処理装置１００は、生成した複数の分散表現の中から互いに関連した分散表現を選択し、その選択結果に関する情報を情報提供装置２０に送信する。これを受けて、情報提供装置２０は、例えば、情報処理装置１００から受信した分散表現に関する情報に基づいて、検索サービスの精度を改善する。 The information processing device 100 acquires a plurality of queries input when using the search service from the information providing device 20, and generates a distributed representation of each of the acquired plurality of queries. Then, the information processing apparatus 100 selects distributed representations related to each other from the generated plurality of distributed representations, and transmits information regarding the selection result to the information providing device 20. In response to this, the information providing device 20 improves the accuracy of the search service based on, for example, the information regarding the distributed representation received from the information processing device 100.

［情報処理装置の構成］
図２は、第１実施形態における情報処理装置１００の構成の一例を示す図である。図示のように、情報処理装置１００は、例えば、通信部１０２と、制御部１１０と、記憶部１３０とを備える。 [Information processing device configuration]
FIG. 2 is a diagram showing an example of the configuration of the information processing apparatus 100 according to the first embodiment. As shown in the figure, the information processing apparatus 100 includes, for example, a communication unit 102, a control unit 110, and a storage unit 130.

通信部１０２は、例えば、ＮＩＣ等の通信インターフェースを含む。通信部１０２は、ネットワークＮＷを介して、情報提供装置２０などと通信する。 The communication unit 102 includes, for example, a communication interface such as a NIC. The communication unit 102 communicates with the information providing device 20 and the like via the network NW.

制御部１１０は、例えば、取得部１１２と、生成部１１４と、選択部１１６と、出力制御部１１８とを備える。これらの構成要素は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）等のプロセッサ（あるいはプロセッサ回路）が、記憶部１３０に記憶されたプログラム（ソフトウェア）を実行することにより実現される。また、制御部１１０の構成要素のうち一部または全部は、例えば、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）等のハードウェア（回路部：circuitry）によって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。また、プロセッサにより参照されるプログラムは、予め記憶部１３０に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体に格納されており、記憶媒体が情報処理装置１００のドライブ装置に装着されることで記憶媒体から記憶部１３０にインストールされてもよい。 The control unit 110 includes, for example, an acquisition unit 112, a generation unit 114, a selection unit 116, and an output control unit 118. These components are realized, for example, by a processor (or processor circuit) such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) executing a program (software) stored in the storage unit 130. .. Further, some or all of the components of the control unit 110 are hardware (circuit unit: circuitry) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), and the like. ), Or it may be realized by the collaboration of software and hardware. Further, the program referred to by the processor may be stored in the storage unit 130 in advance, or is stored in a removable storage medium such as a DVD or a CD-ROM, and the storage medium is the drive of the information processing apparatus 100. It may be installed in the storage unit 130 from the storage medium by being attached to the device.

記憶部１３０は、例えば、ＨＤＤ（Hard Disc Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）などの記憶装置により実現される。記憶部１３０には、ファームウェアやアプリケーションプログラムなどの各種プログラムのほかに、検索クエリ情報１３２や行動履歴情報１３４、生成器情報１３６などが格納される。 The storage unit 130 is realized by, for example, a storage device such as an HDD (Hard Disc Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a ROM (Read Only Memory), and a RAM (Random Access Memory). In addition to various programs such as firmware and application programs, the storage unit 130 stores search query information 132, action history information 134, generator information 136, and the like.

検索クエリ情報１３２は、例えば、検索サービスの利用時に入力された各クエリに対して、入力日時や、入力したユーザの識別情報（以下、ユーザＩＤ）などが対応付けられた情報である。 The search query information 132 is, for example, information in which an input date and time, identification information of the input user (hereinafter, user ID), and the like are associated with each query input when the search service is used.

行動履歴情報１３４は、例えば、検索サービスを利用した各ユーザの行動履歴を含む情報であり、各ユーザのユーザＩＤに対して、ユーザが検索サービスを利用した日時（検索日時）や、ユーザが検索時に入力したクエリ、ユーザが閲覧したウェブページのＵＲＬ（Uniform Resource Locator）などが対応付けられた情報である。 The action history information 134 is, for example, information including the action history of each user who used the search service, and the date and time (search date and time) when the user used the search service or the user searches for the user ID of each user. This is information associated with a query entered at times, a URL (Uniform Resource Locator) of a web page viewed by a user, and the like.

生成器情報１３６は、クエリから分散表現を生成するための生成器を定義した情報（プログラムまたはデータ構造）である。生成器は、クエリを表す一つの単語（ワード）や複数の単語からなる語句（フレーズ）が入力されると、予め決められた要素数（次元数）の多次元ベクトル（以下、ワードベクトルと称する）を分散表現として出力するように予め学習されたニューラルネットワークによって実現される。 Generator information 136 is information (program or data structure) that defines a generator for generating a distributed representation from a query. When a single word (word) representing a query or a phrase (phrase) consisting of a plurality of words is input, the generator is a multidimensional vector (hereinafter referred to as a word vector) having a predetermined number of elements (dimensions). ) Is realized by a neural network trained in advance to output as a distributed representation.

生成器情報１３６には、例えば、各ニューラルネットワークを構成する入力層、一以上の隠れ層（中間層）、出力層の其々に含まれるニューロン（ユニット）が互いにどのように結合されるのかという結合情報や、結合されたニューロン間で入出力されるデータに付与される結合係数がいくつであるのかという重み情報などが含まれる。結合情報は、例えば、各層に含まれるニューロン数や、各ニューロンの結合先のニューロンの種類を指定する情報、各ニューロンを実現する活性化関数、隠れ層のニューロン間に設けられたゲートなどの情報を含む。ニューロンを実現する活性化関数は、例えば、入力符号に応じて動作を切り替える関数（ＲｅＬＵ関数やＥＬＵ関数）であってもよいし、シグモイド関数や、ステップ関数、ハイパボリックタンジェント関数であってもよいし、恒等関数であってもよい。ゲートは、例えば、活性化関数によって返される値（例えば１または０）に応じて、ニューロン間で伝達されるデータを選択的に通過させたり、重み付けたりする。結合係数は、活性化関数のパラメータであり、例えば、ニューラルネットワークの隠れ層において、ある層のニューロンから、より深い層のニューロンにデータが出力される際に、出力データに対して付与される重みを含む。また、結合係数は、各層の固有のバイアス成分などを含んでもよい。 In the generator information 136, for example, how the neurons (units) contained in the input layer, one or more hidden layers (intermediate layers), and the output layers constituting each neural network are connected to each other. It includes connection information and weight information such as how many connection coefficients are given to data input / output between connected neurons. The connection information includes, for example, the number of neurons contained in each layer, information that specifies the type of neuron to which each neuron is connected, the activation function that realizes each neuron, and information such as a gate provided between neurons in the hidden layer. including. The activation function that realizes the neuron may be, for example, a function that switches the operation according to the input code (ReLU function or ELU function), or may be a sigmoid function, a step function, or a hyperbolic tangent function. , May be an equal function. The gate selectively passes or weights the data transmitted between neurons, for example, depending on the value returned by the activation function (eg 1 or 0). The coupling coefficient is a parameter of the activation function, and is a weight given to the output data when data is output from a neuron in one layer to a neuron in a deeper layer, for example, in a hidden layer of a neural network. including. Further, the coupling coefficient may include a bias component peculiar to each layer.

生成器情報１３６により定義される生成器は、例えば、コーパスに含まれる複数の単語または語句の中から、基準となる単語または語句を選択し、選択した単語または語句から、その単語または語句の前後に出現する単語または語句を予測するSkip-Gramや、コーパスに含まれる、ある文脈に着目し、その着目した文脈の中のある単語または語句を、前後に出現する単語または語句から予測するCountinuous Bag-of-Words（ＣＢＯＷ）といったニューラルネットワークによって実現される。このようなニューラルネットワークは、ｆａｓｔＴｅｘｔやｗｏｒｄ２ｖｅｃといったモデルに利用されている。ｆａｓｔＴｅｘｔは、ｗｏｒｄ２ｖｅｃと異なり、学習時に単語または語句の部分語（その単語や語句を構成する各文字）のまとまりを考慮して、活用形が異なるなどして表記が揺れる語句や単語を同じ単語や語句として扱っている。コーパスは、例えば、インターネット上で発信された情報を集約した百科事典に含まれる文書であってもよいし、検索クエリ情報１３２に含まれる複数のクエリの集合（所謂クエリログ）であってもよい。 The generator defined by the generator information 136 selects, for example, a reference word or phrase from a plurality of words or phrases contained in the corpus, and from the selected word or phrase, before or after the word or phrase. Skip-Gram that predicts words or phrases that appear in, or Countious Bag that focuses on a certain context contained in the corpus and predicts a certain word or phrase in that focused context from the words or phrases that appear before or after. -It is realized by a neural network such as Words (CBOW). Such neural networks are used in models such as fastText and word2vec. Unlike word2vec, fastText considers the grouping of words or partial words of words (the words and the characters that make up the words) at the time of learning, and the same words or words whose conjugations are different due to different conjugations. Treated as a phrase. The corpus may be, for example, a document included in an encyclopedia that aggregates information transmitted on the Internet, or may be a set of a plurality of queries (so-called query log) included in the search query information 132.

生成器情報１３６は、例えば、百科事典に含まれる文書をコーパスにして、クエリからワードベクトルを生成するｆａｓｔＴｅｘｔ（以下、第１生成器と称する）や、クエリの集合をコーパスにして、クエリからワードベクトルを生成するｆａｓｔＴｅｘｔ（以下、第２生成器と称する）、クエリの集合をコーパスにして、クエリからワードベクトルを生成するｗｏｒｄ２ｖｅｃ（以下、第３生成器と称する）といった複数の生成器を定義してよい。 The generator information 136 is, for example, a fastText (hereinafter referred to as a first generator) that generates a word vector from a query by using a document included in an encyclopedia as a corpus, or a corpus of a set of queries and a word from a query. Define multiple generators such as fastText (hereinafter referred to as the second generator) that generates a vector, and word2vec (hereinafter referred to as the third generator) that generates a word vector from a query using a set of queries as a corpus. You can do it.

［処理フロー］
以下、第１実施形態における制御部１１０の一連の処理の流れをフローチャートに即して説明する。図３は、第１実施形態における制御部１１０の一連の処理の流れを示すフローチャートである。本フローチャートの処理は、例えば、所定の周期で繰り返し行われてもよい。 [Processing flow]
Hereinafter, the flow of a series of processes of the control unit 110 in the first embodiment will be described according to a flowchart. FIG. 3 is a flowchart showing a flow of a series of processes of the control unit 110 in the first embodiment. The processing of this flowchart may be repeated, for example, at a predetermined cycle.

まず、取得部１１２は、通信部１０２を介して、情報提供装置２０から、検索サービスの利用時に入力された複数のクエリや、各クエリの入力日時、各クエリを入力したユーザのユーザＩＤなどを取得し（Ｓ１００）、取得したこれらの情報を検索クエリ情報１３２として記憶部１３０に記憶させる。 First, the acquisition unit 112 obtains a plurality of queries input from the information providing device 20 when using the search service, the input date and time of each query, the user ID of the user who input each query, and the like via the communication unit 102. Acquired (S100), and these acquired information are stored in the storage unit 130 as search query information 132.

次に、生成部１１４は、記憶部１３０に記憶された検索クエリ情報１３２および生成器情報１３６を参照し、検索クエリ情報１３２に含まれる複数のクエリのそれぞれを、生成器情報１３６によって定義された生成器に入力することで、生成器に各クエリのワードベクトルを生成させる（Ｓ１０２）。 Next, the generator 114 refers to the search query information 132 and the generator information 136 stored in the storage unit 130, and each of the plurality of queries included in the search query information 132 is defined by the generator information 136. By inputting to the generator, the generator is made to generate a word vector for each query (S102).

例えば、生成器情報１３６によって、第１生成器、第２生成器、および第３生成器の３つの生成器が定義されている場合、生成部１１４は、３つの生成器のそれぞれにクエリを入力することで、各クエリについて、互いに異なる３種類のワードベクトルを生成してもよい。 For example, if the generator information 136 defines three generators, a first generator, a second generator, and a third generator, the generator 114 inputs a query to each of the three generators. By doing so, three types of word vectors different from each other may be generated for each query.

次に、選択部１１６は、生成部１１４によって生成された複数のワードベクトルの中から、ある基準とするワードベクトル（以下、基準ワードベクトルと称する）を選択する（Ｓ１０４）。基準ワードベクトルは、「第１分散表現」の一例である。 Next, the selection unit 116 selects a reference word vector (hereinafter referred to as a reference word vector) from the plurality of word vectors generated by the generation unit 114 (S104). The reference word vector is an example of the "first distributed representation".

次に、選択部１１６は、選択した基準ワードベクトルの近傍に分布する所定数Ｎのワードベクトルを選択する（Ｓ１０６）。近傍とは、ワードベクトルに含まれる各要素を基底としたときに、その基底によって張られる多次元の空間において、基準ワードベクトルとの相対的な距離が近いことである。 Next, the selection unit 116 selects a predetermined number N of word vectors distributed in the vicinity of the selected reference word vector (S106). The neighborhood means that when each element included in the word vector is used as a basis, the relative distance from the reference word vector is close in the multidimensional space stretched by the basis.

例えば、選択部１１６は、基準ワードベクトルと、他のワードベクトルとの類似度（例えばコサイン類似度）を導出し、その導出した類似度を距離に換算することで、基準ワードベクトルに近い上位所定数Ｎのワードベクトルを選択する。基準ワードベクトルに近い上位所定数Ｎのワードベクトルは、「第２分散表現」の一例である。 For example, the selection unit 116 derives the similarity between the reference word vector and another word vector (for example, cosine similarity), and converts the derived similarity into a distance to obtain a higher-order predetermined degree close to the reference word vector. Select a word vector of number N. The word vector of the upper predetermined number N close to the reference word vector is an example of the “second distributed representation”.

図４は、基準ワードベクトルの近傍に分布するワードベクトルを選択する様子を模式的に示す図である。図中Ｖ_Ｒは、基準ワードベクトルを表している。例えばＮ＝９とした場合、選択部１１６は、図示の例のように、多次元の空間において、基準ワードベクトルＶ_Ｒに近い上位９個のワードベクトルＶ_１～Ｖ_９を選択する。 FIG. 4 is a diagram schematically showing how a word vector distributed in the vicinity of the reference word vector is selected. In the figure, _VR represents a reference word vector. For example, when N = 9, the selection unit 116 selects the upper nine word vectors V ₁ to V ₉ that are close to the reference word vector _VR in the multidimensional space as shown in the illustrated example.

次に、選択部１１６は、選択したＮ個のワードベクトルのそれぞれの生成元であるクエリと、基準ワードベクトルの生成元であるクエリとの互いの関連度を導出する（Ｓ１０８）。例えば、選択部１１６は、数式（１）に基づいて、クエリ間の関連度を導出する。 Next, the selection unit 116 derives the degree of relevance between the query that is the generator of each of the selected N word vectors and the query that is the generator of the reference word vector (S108). For example, the selection unit 116 derives the degree of relevance between queries based on the mathematical formula (1).

数式（１）中のＳｃｏｒｅ（Ａ，Ｂ）は、あるクエリＡおよびＢの互いの関連度を表すスコアである。クエリＡまたはクエリＢのうちの一方は、基準ワードベクトルの生成元であるクエリであるものとする。数式（１）に示すように、Ｓｃｏｒｅ（Ａ，Ｂ）は、クエリＡとクエリＢとの双方を入力したユーザの数（Ａｕｓｅｒ∩Ｂｕｓｅｒ）を、クエリＡを入力したユーザの数（Ａｕｓｅｒ）で除算した値（式中の分子）を、更に、クエリＢを入力したユーザの数（Ｂｕｓｅｒ）を、全ユーザの数（ＡＬＬｕｓｅｒ）で除算した値（式中の分母）で除算した値として導出される。 Score (A, B) in the formula (1) is a score representing the degree of relevance of certain queries A and B to each other. It is assumed that one of the query A and the query B is the query from which the reference word vector is generated. As shown in the formula (1), Score (A, B) is the number of users who input both query A and query B (Auser∩Buser), and the number of users who input query A (Auser). It is derived as a value obtained by dividing the divided value (molecule in the formula) and further dividing the number of users who input the query B (Buser) by the number of all users (ALLuser) (denominator in the formula). To.

関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）は、Ａｕｓｅｒ∩Ｂｕｓｅｒに基づくことから、クエリＡを入力したユーザと、クエリＢを入力したユーザとが互いに重複する度合い（重複度）を表した指標値でもあり、互いに異なるクエリのそれぞれを入力したユーザの重複度が大きいほど、スコアＳｃｏｒｅ（Ａ，Ｂ）が大きくなり、互いに異なるクエリのそれぞれを入力したユーザの重複度が小さいほど、スコアＳｃｏｒｅ（Ａ，Ｂ）が小さくなる。すなわち、互いに異なるクエリのそれぞれを入力したユーザの重複度が大きいほど、スコアＳｃｏｒｅ（Ａ，Ｂ）（＝クエリＡとクエリＢの関連度）が大きくなり、互いに異なるクエリのそれぞれを入力したユーザの重複度が小さいほど、スコアＳｃｏｒｅ（Ａ，Ｂ）（＝クエリＡとクエリＢの関連度）が小さくなる。 Since the score Score (A, B) of the degree of relevance is based on User ∩ Buser, it is also an index value indicating the degree of overlap (multiplicity) between the user who input the query A and the user who input the query B. The greater the multiplicity of users who entered each of the different queries, the higher the score Score (A, B), and the smaller the multiplicity of the users who entered each of the different queries, the higher the score Score (A, B). B) becomes smaller. That is, the greater the degree of duplication of the users who input each of the different queries, the larger the score Score (A, B) (= the degree of relevance between the query A and the query B), and the greater the degree of duplication of the users who input each of the different queries. The smaller the degree of duplication, the smaller the score Score (A, B) (= the degree of relevance between query A and query B).

図５は、Ｎ個のワードベクトルのそれぞれの生成元である各クエリについて、基準ワードベクトルの生成元であるクエリとの関連度のスコアを導出する様子を模式的に示す図である。上述した図４の例のように、Ｎ＝９である場合、９個のワードベクトルＶ_１～Ｖ_９のそれぞれの生成元であるクエリについて、基準ワードベクトルＶ_Ｒの生成元であるクエリとの関連度のスコアＳ１～Ｓ９を、それら２つのクエリを入力したユーザの重複度に基づいて導出する。 FIG. 5 is a diagram schematically showing how the score of the degree of association with the query that is the generator of the reference word vector is derived for each query that is the generator of each of the N word vectors. As in the example of FIG. 4 described above, when N = 9, the query that is the generator of each of the nine word vectors V ₁ to V ₉ is the query that is the generator of the reference word vector _VR . The relevance scores S1 to S9 are derived based on the degree of duplication of the user who entered the two queries.

選択部１１６は、Ｎ個のワードベクトルのそれぞれの生成元である各クエリについて、基準ワードベクトルの生成元であるクエリとの関連度のスコアＳｃｏｒｅを導出すると、Ｎ個のワードベクトルの中から、基準ワードベクトルのクエリとの関連度のスコアＳｃｏｒｅが閾値未満のワードベクトルを削除し（Ｓ１１０）、基準ワードベクトルのクエリとの関連度のスコアＳｃｏｒｅが閾値以上のワードベクトルを、基準ワードベクトルに関連したワードベクトルとして選択する。 When the selection unit 116 derives the score Score of the degree of association with the query that is the generation source of the reference word vector for each query that is the generation source of each of the N word vectors, the selection unit 116 derives the score Score from the N word vectors. Delete the word vector whose relevance score Score is less than the threshold of the reference word vector query (S110), and relate the word vector whose relevance score Score to the reference word vector query is equal to or higher than the threshold value to the reference word vector. Select as the word vector.

このように、ワードベクトル同士の類似度が大きいという条件に加えて、そのワードベクトルの生成元であるクエリを入力したユーザの重複度が大きい（すなわちクエリ間の関連度が大きい）という条件を満たすワードベクトルだけを選別することで、クエリの分散表現の精度を向上させることができる。 In this way, in addition to the condition that the degree of similarity between word vectors is large, the condition that the degree of duplication of the user who input the query from which the word vector is generated is large (that is, the degree of relevance between the queries is large) is satisfied. By selecting only the word vector, the accuracy of the distributed representation of the query can be improved.

次に、出力制御部１１８は、通信部１０２を制御して、選択部１１６によって選択されたＮ個のワードベクトルのうち、クエリの関連度に基づいて削除されずに残ったワードベクトル、すなわち、基準ワードベクトルのクエリとの関連度のスコアＳｃｏｒｅが閾値以上のワードベクトルを、基準ワードベクトルに関連したワードベクトルとして情報提供装置２０に送信する（Ｓ１１２）。 Next, the output control unit 118 controls the communication unit 102, and among the N word vectors selected by the selection unit 116, the word vectors remaining without being deleted based on the relevance of the query, that is, A word vector having a score equal to or higher than a threshold value of the degree of association with the query of the reference word vector is transmitted to the information providing device 20 as a word vector related to the reference word vector (S112).

なお、出力制御部１１８は、ワードベクトルに代えて、あるいは加えて、その送信対象のワードベクトルの生成元であるクエリを情報提供装置２０に送信してもよい。情報提供装置２０は、情報処理装置１００から、ワードベクトルまたはクエリを受信すると、その受信した情報を自身が備えるＨＤＤなどの記憶装置に記憶させておく。そして、情報提供装置２０は、ウェブページなどを介して検索サービスを提供した端末装置１０から、検索のためのクエリを受信すると、情報処理装置１００から受信したワードベクトルまたはクエリのうち、端末装置１０から受信したクエリとの関連度が大きいワードベクトルまたはクエリを抽出し、抽出した情報を、検索を推奨するクエリなどとして端末装置１０に送信する。これによって、ユーザは検索サービスを利用して興味や関心をもつコンテンツを検索することができる。 In addition, the output control unit 118 may transmit the query which is the generation source of the word vector to be transmitted to the information providing device 20 in place of or in addition to the word vector. When the information providing device 20 receives a word vector or a query from the information processing device 100, the information providing device 20 stores the received information in a storage device such as an HDD provided therein. Then, when the information providing device 20 receives a query for search from the terminal device 10 that provides the search service via a web page or the like, the terminal device 10 is among the word vectors or queries received from the information processing device 100. A word vector or query having a high degree of relevance to the query received from is extracted, and the extracted information is transmitted to the terminal device 10 as a query recommending a search or the like. This allows the user to use the search service to search for content of interest or interest.

次に、制御部１１０は、Ｓ１０４からＳ１１２の処理を所定回数ｋ繰り返したか否かを判定し（Ｓ１１４）、所定回数ｋ繰り返していないと判定した場合、Ｓ１０４に処理を戻し、前回選択した基準ワードベクトルと異なる他のワードベクトルを新たな基準ワードベクトルとして選択し、新たな基準ワードベクトルに類似する上位Ｎ個のワードベクトルの中から、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）が閾値以上のワードベクトルを、新たな基準ワードベクトルに関連するワードベクトルとして選択する。一方、制御部１１０は、Ｓ１０４からＳ１１２の処理を所定回数ｋ繰り返したと判定した場合、本フローチャートの処理を終了する。 Next, the control unit 110 determines whether or not the processing of S104 to S112 has been repeated k a predetermined number of times (S114), and if it is determined that the processing has not been repeated k a predetermined number of times, returns the processing to S104 and the previously selected reference word. Another word vector different from the vector is selected as a new reference word vector, and the score Score (A, B) of the degree of association between queries is a threshold value from among the top N word vectors similar to the new reference word vector. The above word vector is selected as the word vector related to the new reference word vector. On the other hand, when the control unit 110 determines that the processes of S104 to S112 have been repeated k for a predetermined number of times, the control unit 110 ends the process of this flowchart.

図６は、基準ワードベクトルに関連するワードベクトルを選択する処理を所定回数ｋ繰り返したときの様子を模式的に示す図である。図示の例のように、検索クエリ情報１３２に含まれる複数のクエリの其々のワードベクトルの集合を母集団としたときに、１回目からｋ回目の各処理において、制御部１１０は、母集団の中から、ある基準ワードベクトルＶ_Ｒｉに類似するＮ_ｉ個のワードベクトルを選択し、更に、そのＮ_ｉ個のワードベクトルの中から、基準ワードベクトルのクエリとの関連度のスコアＳｃｏｒｅが閾値未満のＤ_ｉ個のワードベクトルを差し引いた残りの（Ｎ_ｉ－Ｄ_ｉ）個のワードベクトルを、基準ワードベクトルＶ_Ｒｉに関連するワードベクトルとして選択する。ｉは、１以上ｋ以下の数である。例えば、１回目からｋ回目の各処理において、母集団から基準ワードベクトルＶ_Ｒｉに関連するワードベクトルが選択されると、ある判定基準の下、選択されたワードベクトルのクエリが、基準ワードベクトルＶ_Ｒｉのクエリと意味が同じまたは近いのかを人間などが判断して、本手法を評価してよい。図示の例では、１回目からｋ回目の各処理において、基準ワードベクトルＶ_Ｒｉに関連するワードベクトルのクエリのうち、基準ワードベクトルＶ_Ｒｉのクエリと意味が同じまたは近いクエリの数を、基準ワードベクトルＶ_Ｒｉに関連するワードベクトルのクエリの総数で除算した割合が、正解率として導出されている。 FIG. 6 is a diagram schematically showing a state when the process of selecting a word vector related to a reference word vector is repeated k a predetermined number of times. As shown in the illustrated example, when the set of word vectors of each of the plurality of queries included in the search query information 132 is used as the population, the control unit 110 is the population in each of the first to kth processes. From among, _Ni word vectors similar to a certain reference word vector V _Ri are selected, and from the _Ni word vectors, the score Score of the degree of association with the query of the reference word vector is the threshold. The remaining ( _Ni − _Di ) word vectors obtained by subtracting the less than _Di word vectors are selected as the word vectors associated with the reference word vector V _Ri . i is a number of 1 or more and k or less. For example, in each process from the first time to the kth time, when the word vector related to the reference word vector V _Ri is selected from the population, the query of the selected word vector under a certain criterion is the reference word vector V. A human or the like may determine whether the meaning is the same as or similar to the _Ri query, and evaluate this method. In the illustrated example, in each process from the first time to the kth time, among the queries of the word vector related to the reference word vector V _Ri , the number of queries having the same meaning as or similar to the query of the reference word vector V _Ri is determined as the reference word. The percentage divided by the total number of word vector queries related to the vector V _Ri is derived as the correct answer rate.

図７は、本手法の正解率の一例を表した図である。図中の本手法とは、上述したように、母集団から基準ワードベクトルに関連するワードベクトルを選択する際に、まず、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択し、その選択したＮ個のワードベクトルの中から、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）に基づいて、更に、ワードベクトルを絞り込む手法である。一方、比較手法とは、母集団から基準ワードベクトルに関連するワードベクトルを選択する際に、まず、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択し、その選択したＮ個のワードベクトルを基準ワードベクトルに関連するワードベクトルとする手法である。 FIG. 7 is a diagram showing an example of the correct answer rate of this method. As described above, this method in the figure means that when selecting a word vector related to a reference word vector from a population, first, the upper N word vectors similar to the reference word vector are selected, and the selection thereof is performed. This is a method of further narrowing down the word vectors based on the score Score (A, B) of the degree of relevance between the queries from the N word vectors. On the other hand, in the comparison method, when selecting a word vector related to a reference word vector from the population, first, the upper N word vectors similar to the reference word vector are selected, and then the selected N word vectors are selected. Is a method of making a word vector related to a reference word vector.

例えば、母集団とする複数のワードベクトルが第１生成器を利用して生成された場合、基準ワードベクトルに関連するものとして選択されたワードベクトルの正解率は、本手法では８４［％］であり、比較手法では７６［％］であり、その差分は＋８［％］であった。また、母集団とする複数のワードベクトルが第２生成器を利用して生成された場合、基準ワードベクトルに関連するものとして選択されたワードベクトルの正解率は、本手法では６６［％］であり、比較手法では３０［％］であり、その差分は＋３６［％］であった。また、母集団とする複数のワードベクトルが第３生成器を利用して生成された場合、基準ワードベクトルに関連するものとして選択されたワードベクトルの正解率は、本手法では８１［％］であり、比較手法では６７［％］であり、その差分は＋１４［％］であった。このように、比較手法に比べて本手法の方が、互いに意味が近い分散表現を選択することができる。 For example, when a plurality of word vectors as a population are generated by using the first generator, the correct answer rate of the word vector selected as related to the reference word vector is 84 [%] in this method. Yes, it was 76 [%] in the comparison method, and the difference was +8 [%]. Further, when a plurality of word vectors as a population are generated by using the second generator, the correct answer rate of the word vector selected as related to the reference word vector is 66 [%] in this method. Yes, it was 30 [%] in the comparison method, and the difference was +36 [%]. Further, when a plurality of word vectors as a population are generated by using the third generator, the correct answer rate of the word vector selected as related to the reference word vector is 81 [%] in this method. Yes, it was 67 [%] in the comparison method, and the difference was +14 [%]. In this way, this method can select distributed expressions whose meanings are closer to each other than the comparison method.

以上説明した第１実施形態によれば、情報を検索するために入力された複数のクエリを取得し、取得した複数のクエリのそれぞれの分散表現であるワードベクトルを生成し、複数のクエリのそれぞれを入力したユーザの重複度に基づいて、生成した複数のワードベクトルの中から、互いに関連したワードベクトルを選択することにより、クエリの分散表現の精度を向上させることができる。 According to the first embodiment described above, a plurality of queries input for searching information are acquired, a word vector which is a distributed representation of each of the acquired plurality of queries is generated, and each of the plurality of queries is generated. By selecting word vectors that are related to each other from a plurality of generated word vectors based on the degree of duplication of the user who entered, the accuracy of the distributed representation of the query can be improved.

＜第２実施形態＞
以下、第２実施形態について説明する。上述した第１実施形態では、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択した後に、そのＮ個のワードベクトルのそれぞれの生成元のクエリと、基準ワードベクトルの生成元のクエリとを入力したユーザの重複度（クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ））に基づいて、Ｎ個のワードベクトルを絞り込むものとして説明した。 <Second Embodiment>
Hereinafter, the second embodiment will be described. In the first embodiment described above, after selecting the upper N word vectors similar to the reference word vector, the query of the generator of each of the N word vectors and the query of the generator of the reference word vector are obtained. It was described as narrowing down N word vectors based on the degree of duplication of the input user (score Score (A, B) of the degree of relevance between queries).

これに対して、第２実施形態では、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択した後に、そのＮ個のワードベクトルのそれぞれの生成元のクエリを入力したユーザと、基準ワードベクトルの生成元のクエリとを入力したユーザとの類似度に基づいて、Ｎ個のワードベクトルを絞り込む点で上述した第１実施形態と相違する。以下、第１実施形態との相違点を中心に説明し、第１実施形態と共通する点については説明を省略する。なお、第２実施形態の説明において、第１実施形態と同じ部分については同一符号を付して説明する。 On the other hand, in the second embodiment, after selecting the upper N word vectors similar to the reference word vector, the user who inputs the query of the generator of each of the N word vectors and the reference word vector. It differs from the above-mentioned first embodiment in that N word vectors are narrowed down based on the similarity with the user who input the query of the generation source of. Hereinafter, the differences from the first embodiment will be mainly described, and the points common to the first embodiment will be omitted. In the description of the second embodiment, the same parts as those of the first embodiment will be described with the same reference numerals.

第２実施形態の生成部１１４は、クエリの分散表現としてワードベクトルを生成するのと同様に、クエリを入力したユーザの分散表現としてユーザベクトルを生成する。具体的には、生成部１１４は、行動履歴情報１３４を参照し、各ユーザの行動履歴から、コンテンツの閲覧回数などを要素とする多次元のベクトルを、ユーザベクトルとして生成してよい。 The generation unit 114 of the second embodiment generates the user vector as the distributed representation of the user who input the query, in the same manner as generating the word vector as the distributed representation of the query. Specifically, the generation unit 114 may refer to the action history information 134 and generate a multidimensional vector having the number of times the content is viewed as an element from the action history of each user as a user vector.

第２実施形態の選択部１１６は、生成部１１４によって各ユーザのユーザベクトルが生成されると、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択した後に、そのＮ個のワードベクトルのそれぞれの生成元のクエリを入力したユーザと、基準ワードベクトルの生成元のクエリとを入力したユーザとの類似度に基づいて、Ｎ個のワードベクトルの中から、基準ワードベクトルに関連するワードベクトルを選択する。例えば、選択部１１６は、Ｎ個のワードベクトルのそれぞれの生成元のクエリを入力したユーザＡのユーザベクトルと、基準ワードベクトルの生成元のクエリを入力したユーザＢのユーザベクトルとのコサイン類似度を導出し、その導出したコサイン類似度が閾値以上のユーザベクトルの組み合わせに対応したワードベクトルを、基準ワードベクトルに関連するワードベクトルを選択する。 When the user vector of each user is generated by the generation unit 114, the selection unit 116 of the second embodiment selects the upper N word vectors similar to the reference word vector, and then each of the N word vectors. Based on the similarity between the user who entered the query that generated the reference word vector and the user who entered the query that generated the reference word vector, the word vector related to the reference word vector is selected from the N word vectors. select. For example, the selection unit 116 has a cosine similarity between the user vector of the user A who has input the query of the generation source of each of the N word vectors and the user vector of the user B who has input the query of the generation source of the reference word vector. Is derived, and the word vector corresponding to the combination of user vectors whose cosine similarity is equal to or higher than the threshold value is selected, and the word vector related to the reference word vector is selected.

以上説明した第２実施形態によれば、情報を検索するために入力された複数のクエリを取得し、取得した複数のクエリのそれぞれの分散表現であるワードベクトルを生成し、複数のクエリのそれぞれを入力したユーザの類似度に基づいて、生成した複数のワードベクトルの中から、互いに関連したワードベクトルを選択することにより、上述した実施形態と同様に、クエリの分散表現の精度を向上させることができる。 According to the second embodiment described above, a plurality of queries input for searching information are acquired, a word vector which is a distributed representation of each of the acquired plurality of queries is generated, and each of the plurality of queries is generated. By selecting word vectors related to each other from a plurality of generated word vectors based on the similarity of the users who input the above, the accuracy of the distributed representation of the query is improved as in the above-described embodiment. Can be done.

＜第３実施形態＞
以下、第３実施形態について説明する。上述した第１実施形態では、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択した後に、そのＮ個のワードベクトルのそれぞれの生成元のクエリと、基準ワードベクトルの生成元のクエリとを入力したユーザの重複度（クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ））に基づいて、Ｎ個のワードベクトルを絞り込むものとして説明した。また、上述した第２実施形態では、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択した後に、そのＮ個のワードベクトルのそれぞれの生成元のクエリを入力したユーザと、基準ワードベクトルの生成元のクエリとを入力したユーザとの類似度に基づいて、Ｎ個のワードベクトルを絞り込むものとして説明した。 <Third Embodiment>
Hereinafter, the third embodiment will be described. In the first embodiment described above, after selecting the upper N word vectors similar to the reference word vector, the query of the generator of each of the N word vectors and the query of the generator of the reference word vector are obtained. It was described as narrowing down N word vectors based on the degree of duplication of the input user (score Score (A, B) of the degree of relevance between queries). Further, in the second embodiment described above, the user who inputs the query of the generator of each of the N word vectors after selecting the upper N word vectors similar to the reference word vector, and the reference word vector. It was described as narrowing down N word vectors based on the similarity with the user who entered the query of the generation source.

これに対して、第３実施形態では、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択した後に、そのＮ個のワードベクトルのそれぞれの生成元のクエリがユーザにより入力された時刻に基づいて、Ｎ個のワードベクトルを絞り込む点で上述した第１実施形態および第２実施形態と相違する。以下、第１実施形態および第２実施形態との相違点を中心に説明し、第１実施形態および第２実施形態と共通する点については説明を省略する。なお、第３実施形態の説明において、第１実施形態および第２実施形態と同じ部分については同一符号を付して説明する。 On the other hand, in the third embodiment, after selecting the upper N word vectors similar to the reference word vector, the query of the generator of each of the N word vectors is based on the time input by the user. Therefore, it differs from the first embodiment and the second embodiment described above in that N word vectors are narrowed down. Hereinafter, the differences between the first embodiment and the second embodiment will be mainly described, and the points common to the first embodiment and the second embodiment will be omitted. In the description of the third embodiment, the same parts as those of the first embodiment and the second embodiment will be described with the same reference numerals.

第３実施形態の選択部１１６は、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択した後に、そのＮ個のワードベクトルのそれぞれの生成元のクエリがユーザにより入力された時刻に基づいて、Ｎ個のワードベクトルの中から、基準ワードベクトルに関連するワードベクトルを選択する。例えば、あるユーザがあるクエリＡとクエリＢを入力した場合、選択部１１６は、それらのクエリＡ、Ｂが入力された時刻が同じタイミングと見做せる場合、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を大きくし、クエリＡ、Ｂが入力された時刻が異なるタイミングと見做せる場合、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を小さくしてよい。 After selecting the upper N word vectors similar to the reference word vector, the selection unit 116 of the third embodiment is based on the time when the query of the generator of each of the N word vectors is input by the user. , Select the word vector related to the reference word vector from the N word vectors. For example, when a user inputs a certain query A and a query B, the selection unit 116 determines that the times when the queries A and B are input are considered to be the same timing, the score Score of the degree of association between the queries Score ( If A and B) are increased and the times when the queries A and B are input can be regarded as different timings, the score Score (A, B) of the degree of association between the queries may be decreased.

具体的には、選択部１１６は、クエリＡ、Ｂが入力された時刻が同じセッションに含まれる場合、それらのクエリＡ、Ｂが同じタイミングで入力されたものと見做して、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を大きくし、クエリＡ、Ｂが入力された時刻が互いに異なるセッションに含まれる場合、それらのクエリＡ、Ｂが異なるタイミングで入力されたものと見做して、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を小さくしてよい。 Specifically, when the times when the queries A and B are input are included in the same session, the selection unit 116 considers that the queries A and B are input at the same timing, and interlies between the queries. If the relevance score Score (A, B) is increased and the times when queries A and B are entered are included in different sessions, it is considered that the queries A and B are entered at different timings. Therefore, the score Score (A, B) of the degree of association between queries may be reduced.

セッションとは、例えば、ユーザが、あるコンテンツを閲覧してからコンバージョンに至るまでの期間であってよい。コンバージョンとは、例えば、商品などのコンテンツを購入したり、動画などのコンテンツを視聴したり、広告などのコンテンツを閲覧したりすることである。また、セッションは、ユーザが、あるコンテンツを閲覧してから、他のコンテンツを閲覧するまでの期間であってもよいし、あるコンテンツを閲覧してから、ウェブブラウザなどのアプリケーションを終了するまでの期間であってもよいし、あるコンテンツを閲覧してから、所定時間経過（タイムアウト）するまでの期間であってもよい。 The session may be, for example, the period from the user viewing a certain content to the conversion. Conversion is, for example, purchasing content such as a product, viewing content such as a video, or browsing content such as an advertisement. In addition, the session may be a period from browsing a certain content to browsing another content, or from browsing a certain content to terminating an application such as a web browser. It may be a period, or it may be a period from browsing a certain content to the elapse of a predetermined time (timeout).

また、選択部１１６は、クエリＡが入力されたからクエリＢが入力されるまでの期間が短いほどクエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を大きくし、クエリＡが入力されたからクエリＢが入力されるまでの期間が長いほどクエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を小さくしてもよい。 Further, the selection unit 116 increases the score Score (A, B) of the degree of relevance between queries as the period from the input of the query A to the input of the query B becomes shorter, and the query B is input because the query A is input. The longer the period until is input, the smaller the score Score (A, B) of the degree of association between queries may be.

このように、クエリの入力時刻が同じまたは近いほど、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を大きくするため、Ｎ個のワードベクトルの中から、基準ワードベクトルに関連したワードベクトルを選択する際に、基準ワードベクトルに対する関連性が低いワードベクトルを取り除くことができる。 In this way, in order to increase the score Score (A, B) of the degree of relevance between queries as the query input times are the same or closer, the word vector related to the reference word vector is selected from the N word vectors. When selecting, it is possible to remove word vectors that are less relevant to the reference word vector.

以上説明した第３実施形態によれば、情報を検索するために入力された複数のクエリを取得し、取得した複数のクエリのそれぞれの分散表現であるワードベクトルを生成し、複数のクエリのそれぞれの入力時刻に基づいて、生成した複数のワードベクトルの中から、互いに関連したワードベクトルを選択することにより、上述した実施形態と同様に、クエリの分散表現の精度を向上させることができる。 According to the third embodiment described above, a plurality of queries input for searching information are acquired, a word vector which is a distributed representation of each of the acquired plurality of queries is generated, and each of the plurality of queries is generated. By selecting word vectors related to each other from the plurality of generated word vectors based on the input time of, the accuracy of the distributed representation of the query can be improved as in the above-described embodiment.

＜第４実施形態＞
以下、第４実施形態について説明する。第４実施形態では、母集団とする複数のワードベクトルのそれぞれの生成元のクエリと、基準ワードベクトルの生成元のクエリとを入力したユーザの重複度（クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ））に基づいて、ワードベクトルのそれぞれを重み付ける点で上述した第１実施形態から第３実施形態と相違する。以下、第１実施形態から第３実施形態との相違点を中心に説明し、第１実施形態から第３実施形態と共通する点については説明を省略する。なお、第４実施形態の説明において、第１実施形態から第３実施形態と同じ部分については同一符号を付して説明する。 <Fourth Embodiment>
Hereinafter, the fourth embodiment will be described. In the fourth embodiment, the degree of duplication of the user who input the query of the generation source of each of the plurality of word vectors as the population and the query of the generation source of the reference word vector (score Score (A) of the degree of relevance between the queries). , B)), which is different from the above-described first embodiment to the third embodiment in that each of the word vectors is weighted. Hereinafter, the differences from the first embodiment to the third embodiment will be mainly described, and the points common to the first to third embodiments will be omitted. In the description of the fourth embodiment, the same parts as those of the first to third embodiments will be described with the same reference numerals.

第４実施形態の選択部１１６は、母集団とする複数のワードベクトルのそれぞれの生成元のクエリと、基準ワードベクトルの生成元のクエリとを入力したユーザの重複度に基づいて、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）を導出する。そして、選択部１１６は、クエリ間の関連度のスコアＳｃｏｒｅ（Ａ，Ｂ）に応じて、複数のワードベクトルのそれぞれを重み付ける。 The selection unit 116 of the fourth embodiment is based on the degree of duplication of the user who input the query of the generation source of each of the plurality of word vectors as the population and the query of the generation source of the reference word vector, and between the queries. Derivation of the score Score (A, B) of the degree of relevance. Then, the selection unit 116 weights each of the plurality of word vectors according to the score Score (A, B) of the degree of association between the queries.

例えば、選択部１１６は、ユーザの重複度が大きいほど、基準ワードベクトルに対する類似度（例えばコサイン類似度）が大きくなるように、ワードベクトルに値の大きい重み係数を乗算する。一方、選択部１１６は、ユーザの重複度が小さいほど、基準ワードベクトルに対する類似度（例えばコサイン類似度）が小さくなるように、ワードベクトルに値の小さい重み係数を乗算する。 For example, the selection unit 116 multiplies the word vector by a weighting coefficient having a large value so that the greater the degree of overlap of the users, the greater the degree of similarity to the reference word vector (for example, the degree of cosine similarity). On the other hand, the selection unit 116 multiplies the word vector by a weighting coefficient having a small value so that the smaller the user's multiplicity is, the smaller the similarity with respect to the reference word vector (for example, the cosine similarity).

選択部１１６は、母集団とする複数のワードベクトルのそれぞれを重み付けると、その重み付けた複数のワードベクトルの中から、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択し、その選択したＮ個のワードベクトルのそれぞれの生成元のクエリと、基準ワードベクトルの生成元のクエリとの関連度や、それらクエリを入力したユーザとの類似度、それらクエリの入力時刻などに基づいて、Ｎ個のワードベクトルの中から、基準ワードベクトルに関連するワードベクトルを選択する。このように、母集団とする複数のワードベクトルを重み付けておくことで、母集団の中から、基準ワードベクトルに類似する上位Ｎ個のワードベクトルを選択する際に、より基準ワードベクトルに意味が近くなるワードベクトルを選び出すことができる。この結果、上述した実施形態と同様に、クエリの分散表現の精度を向上させることができる。 When each of the plurality of word vectors as the population is weighted, the selection unit 116 selects the upper N word vectors similar to the reference word vector from the weighted plurality of word vectors, and selects the upper N word vectors. N based on the degree of relevance between the query from which each of the N word vectors was generated and the query from which the reference word vector was generated, the similarity with the user who entered those queries, the input time of those queries, and so on. Select the word vector related to the reference word vector from the number of word vectors. By weighting a plurality of word vectors as the population in this way, the reference word vector has more meaning when selecting the upper N word vectors similar to the reference word vector from the population. You can select a word vector that is close to you. As a result, the accuracy of the distributed representation of the query can be improved as in the above-described embodiment.

＜ハードウェア構成＞
上述した実施形態の情報処理装置１００は、例えば、図８に示すようなハードウェア構成により実現される。図８は、実施形態の情報処理装置１００のハードウェア構成の一例を示す図である。 <Hardware configuration>
The information processing apparatus 100 of the above-described embodiment is realized by, for example, a hardware configuration as shown in FIG. FIG. 8 is a diagram showing an example of the hardware configuration of the information processing apparatus 100 of the embodiment.

情報処理装置１００は、ＮＩＣ１００－１、ＣＰＵ１００－２、ＲＡＭ１００－３、ＲＯＭ１００－４、フラッシュメモリやＨＤＤなどの二次記憶装置１００－５、およびドライブ装置１００－６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。ドライブ装置１００－６には、光ディスクなどの可搬型記憶媒体が装着される。二次記憶装置１００－５、またはドライブ装置１００－６に装着された可搬型記憶媒体に格納されたプログラムがＤＭＡコントローラ（不図示）などによってＲＡＭ１００－３に展開され、ＣＰＵ１００－２によって実行されることで、制御部１１０が実現される。制御部１１０が参照するプログラムは、ネットワークＮＷを介して他の装置からダウンロードされてもよい。 The information processing device 100 includes NIC100-1, CPU100-2, RAM100-3, ROM100-4, a secondary storage device 100-5 such as a flash memory or an HDD, and a drive device 100-6 on an internal bus or a dedicated communication line. It is configured to be interconnected by. A portable storage medium such as an optical disk is mounted on the drive device 100-6. A program stored in a portable storage medium mounted on the secondary storage device 100-5 or the drive device 100-6 is expanded in the RAM 100-3 by a DMA controller (not shown) or the like, and executed by the CPU 100-2. As a result, the control unit 110 is realized. The program referred to by the control unit 110 may be downloaded from another device via the network NW.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…情報処理システム、１０…端末装置、２０…情報提供装置、１００…情報処理装置、１０２…通信部、１１０…制御部、１１２…取得部、１１４…生成部、１１６…選択部、１１８…出力制御部、１３０…記憶部 1 ... Information processing system, 10 ... Terminal device, 20 ... Information providing device, 100 ... Information processing device, 102 ... Communication unit, 110 ... Control unit, 112 ... Acquisition unit, 114 ... Generation unit, 116 ... Selection unit, 118 ... Output control unit, 130 ... Storage unit

Claims

An acquirer that retrieves multiple queries entered to retrieve information,
A generation unit that generates a distributed representation of each of the plurality of queries acquired by the acquisition unit, and a generation unit.
A selection unit for selecting a distributed expression related to each other from a plurality of distributed expressions generated by the generation unit based on the degree of duplication of the user who entered each of the plurality of queries is provided.
The selection unit is
Weighting each of the plurality of distributed representations generated by the generator based on the user's multiplicity,
The greater the degree of duplication of the users, the larger the weight given to the distributed representation, and the smaller the degree of duplication of the users, the smaller the weight given to the distributed representation.
Select the distributed representations related to each other from the plurality of weighted distributed representations.
Information processing equipment.

The selection unit is
From the plurality of distributed representations, a predetermined number of higher-order second distributed representations similar to the reference first distributed representation are selected.
From the selected predetermined number of second distributed representations, the second distributed representation related to the first distributed representation is selected based on the degree of duplication of the user who input the query that is the generator of the second distributed representation. select,
The information processing apparatus according to claim 1.

The selection unit uses the second distributed representation generated from a query in which the degree of duplication of the user is equal to or greater than the threshold value from the predetermined number of second distributed representations, and the second distributed representation is related to the first distributed representation. Select as a distributed representation,
The information processing apparatus according to claim 2.

The selection unit is
For each of the plurality of distributed representations generated by the generation unit, the greater the degree of duplication of the users, the more similar to the first distributed representation as a reference.
A second distributed representation related to the first distributed representation is selected from the plurality of weighted distributed representations.
The information processing apparatus according to claim 1 .

The selection unit further selects distributed expressions related to each other from the plurality of distributed expressions generated by the generating unit based on the similarity of the user who entered each of the plurality of queries.
The information processing apparatus according to any one of claims 1 to 4 .

The selection unit further selects distributed expressions related to each other from the plurality of distributed expressions generated by the generating unit based on the time when each of the plurality of queries is input by the user.
The information processing apparatus according to any one of claims 1 to 5 .

An acquirer that retrieves multiple queries entered to retrieve information,
A generation unit that generates a distributed representation of each of the plurality of queries acquired by the acquisition unit, and a generation unit.
A selection unit for selecting a distributed expression related to each other from a plurality of distributed expressions generated by the generation unit based on the similarity of the user who entered each of the plurality of queries is provided.
The selection unit is
Based on the degree of duplication of the user who entered each of the plurality of queries, the distributed representations related to each other are further selected from the plurality of distributed representations generated by the generator.
Weighting each of the plurality of distributed representations generated by the generator based on the user's multiplicity,
The greater the degree of duplication of the users, the larger the weight given to the distributed representation, and the smaller the degree of duplication of the users, the smaller the weight given to the distributed representation.
Select the distributed representations related to each other from the plurality of weighted distributed representations.
Information processing equipment.

An acquirer that retrieves multiple queries entered to retrieve information,
A generation unit that generates a distributed representation of each of the plurality of queries acquired by the acquisition unit, and a generation unit.
Each of the plurality of queries includes a selection unit that selects a distributed expression related to each other from the plurality of distributed expressions generated by the generation unit based on the time input by the user .
The selection unit is
Based on the degree of duplication of the user who entered each of the plurality of queries, the distributed representations related to each other are further selected from the plurality of distributed representations generated by the generator.
Weighting each of the plurality of distributed representations generated by the generator based on the user's multiplicity,
The greater the degree of duplication of the users, the larger the weight given to the distributed representation, and the smaller the degree of duplication of the users, the smaller the weight given to the distributed representation.
Select the distributed representations related to each other from the plurality of weighted distributed representations.
Information processing equipment.

An acquirer that retrieves multiple queries entered to retrieve information,
A generation unit that generates a distributed representation of each of the plurality of queries acquired by the acquisition unit, and a generation unit.
From a plurality of distributed representations generated by the generator based on a predetermined index value for a user who has entered each of the plurality of queries, or a predetermined index value for each of the plurality of queries, they are related to each other. With a selection section to select a distributed representation ,
The selection unit is
Based on the degree of duplication of the user who entered each of the plurality of queries, the distributed representations related to each other are further selected from the plurality of distributed representations generated by the generator.
Weighting each of the plurality of distributed representations generated by the generator based on the user's multiplicity,
The greater the degree of duplication of the users, the larger the weight given to the distributed representation, and the smaller the degree of duplication of the users, the smaller the weight given to the distributed representation.
Select the distributed representations related to each other from the plurality of weighted distributed representations.
Information processing equipment.

The computer
Get multiple queries entered to retrieve information,
Generate a distributed representation of each of the multiple queries obtained above
Based on the degree of duplication of the user who entered each of the plurality of queries, the distributed representations related to each other are selected from the plurality of generated distributed representations.
Weighting each of the generated distributed representations based on the user's multiplicity,
The greater the degree of duplication of the users, the larger the weight given to the distributed representation, and the smaller the degree of duplication of the users, the smaller the weight given to the distributed representation.
Select the distributed representations related to each other from the plurality of weighted distributed representations.
Information processing method.

On the computer
The process of retrieving multiple queries entered to retrieve information,
The process of generating the distributed representation of each of the multiple queries obtained above,
A process of selecting a distributed representation related to each other from the generated distributed representations based on the degree of duplication of the user who entered each of the plurality of queries.
A process of weighting each of the generated plurality of distributed expressions based on the degree of duplication of the user.
The larger the degree of duplication of the users, the larger the weight given to the distributed expression, and the smaller the degree of duplication of the users, the smaller the weight given to the distributed expression.
The process of selecting the distributed representations related to each other from the plurality of weighted distributed representations,
A program to execute.