JP2022013256A

JP2022013256A - Keyword extraction apparatus, keyword extraction method, and keyword extraction program

Info

Publication number: JP2022013256A
Application number: JP2020115682A
Authority: JP
Inventors: 勇太萩尾; Yuta Hagio; 豊金子; Yutaka Kaneko
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2022-01-18
Anticipated expiration: 2040-07-03
Also published as: JP7483532B2

Abstract

To provide a keyword extraction apparatus configured to properly extract a keyword related to a video, a keyword extraction method, and a keyword extraction program.SOLUTION: A keyword extraction apparatus 1 includes: a keyword candidate word extraction unit 11 which extracts keyword candidate words from an input text in a video; a mask text generation unit 12 which generates a mask text formed by masking the keyword candidate words in the input text; a mask estimation unit 13 which calculates estimation values by estimating each of the masked keyword candidate words, in the mask text; a high-saliency object determination unit 16 which outputs a class name of an object detected from the video; a relevance score calculation unit 17 which calculates relevance scores between the class name and the keyword candidate words on the basis of the estimation values of each of the masked words; and a keyword output unit 18 which outputs a keyword candidate word having the highest relevance score, as a keyword related to the video.SELECTED DRAWING: Figure 1

Description

本発明は、ロボットが発話するキーワードを抽出するための装置、方法及びプログラムに関する。 The present invention relates to a device, a method and a program for extracting keywords spoken by a robot.

従来、人と一緒にテレビなどを視聴するロボットが番組の内容に沿った発話をする技術が研究されている。このようなロボットは、番組情報から所定の規則に従って、キーワードを抽出している。 Conventionally, technology has been studied in which a robot that watches TV or the like with a person speaks according to the contents of a program. Such a robot extracts keywords from program information according to a predetermined rule.

例えば、特許文献１では、番組の字幕文からキーワードを抽出する手法が提案されている。また、非特許文献１では、映像から物体を検出すると同時に、検出した物体の顕著性を推定し、物体の顕著性に応じてキーワードを抽出する手法が提案されている。
また、これらの手法の他、例えば、音声認識、人物認識、オブジェクト認識、文字認識などのクラウドサービスも並列して利用することにより、適切なキーワードを抽出する試みが行われている。 For example, Patent Document 1 proposes a method of extracting a keyword from a subtitle sentence of a program. Further, Non-Patent Document 1 proposes a method of detecting an object from an image and at the same time estimating the prominence of the detected object and extracting keywords according to the prominence of the object.
In addition to these methods, attempts are being made to extract appropriate keywords by using cloud services such as voice recognition, person recognition, object recognition, and character recognition in parallel.

特開２０１８－１９００７７号公報Japanese Unexamined Patent Publication No. 2018-190077

萩尾勇太他，“人とロボットの共時視聴実験に向けたコミュニケーションロボットの設計と試作”，２０１９年映像情報メディア学会年次大会．Yuta Hagio et al., "Design and Prototyping of Communication Robots for Simultaneous Viewing Experiments of Humans and Robots", 2019 Annual Meeting of the Society of Video and Information Media.

しかしながら、複数の方法により抽出されたキーワードのうち、どのキーワードが番組内容に適しているかを判断することは難しく、従来は、得られたキーワードの中からランダムに利用されていた。
また、顕著性を利用することで番組の内容に適した物体が選択されることが期待できるものの、検出可能な物体の種類は、数百クラス程度に限られており、この結果、同じキーワードばかりが抽出され、同じ発話文ばかりが生成されてしまう。 However, it is difficult to determine which of the keywords extracted by a plurality of methods is suitable for the program content, and conventionally, it has been randomly used from the obtained keywords.
In addition, although it can be expected that an object suitable for the content of the program will be selected by using the saliency, the types of objects that can be detected are limited to about several hundred classes, and as a result, only the same keywords are used. Is extracted, and only the same utterance sentence is generated.

本発明は、映像と関連したキーワードを適切に抽出できるキーワード抽出装置、キーワード抽出方法及びキーワード抽出プログラムを提供することを目的とする。 An object of the present invention is to provide a keyword extraction device, a keyword extraction method, and a keyword extraction program capable of appropriately extracting keywords related to video.

本発明に係るキーワード抽出装置は、映像に伴う入力文から、キーワードデータベースに登録されている単語をキーワード候補語として抽出するキーワード候補語抽出部と、前記入力文における前記キーワード候補語をマスクしたマスク文を生成するマスク文生成部と、前記マスク文において、マスクされた箇所の前記キーワード候補語を推定した単語毎の推定値を学習モデルにより算出するマスク推定部と、前記映像から検出された物体のクラス名を出力する物体決定部と、前記マスク推定部により算出された前記マスクされた箇所の前記単語毎の推定値に基づいて、前記クラス名と前記マスクされた箇所それぞれの前記キーワード候補語との関連度スコアを算出する関連度スコア算出部と、前記関連度スコアが最も高い前記キーワード候補語を、前記映像に関連するキーワードとして出力するキーワード出力部と、を備える。 The keyword extraction device according to the present invention has a keyword candidate word extraction unit that extracts words registered in the keyword database as keyword candidate words from an input sentence accompanying a video, and a mask that masks the keyword candidate words in the input sentence. A mask sentence generation unit that generates a sentence, a mask estimation unit that calculates an estimated value for each word that estimates the keyword candidate word in the masked part by a learning model, and an object detected from the video. Based on the object determination unit that outputs the class name of the above and the estimated value for each word of the masked portion calculated by the mask estimation unit, the keyword candidate word for each of the class name and the masked portion. It is provided with a relevance score calculation unit for calculating the relevance score, and a keyword output unit for outputting the keyword candidate word having the highest relevance score as a keyword related to the video.

前記マスク推定部は、前記マスク文に過去の入力文を加えた文章において、前記推定値を算出してもよい。 The mask estimation unit may calculate the estimated value in a sentence obtained by adding a past input sentence to the mask sentence.

前記マスク文生成部は、前記入力文に含まれる前記キーワード候補語の一つのみをマスクしたマスク文を、当該キーワード候補語の数だけ生成し、前記マスク推定部は、前記マスク文を一つのみ含む文章において、前記単語毎の推定値を算出してもよい。 The mask sentence generation unit generates mask sentences that mask only one of the keyword candidate words included in the input sentence, as many as the number of the keyword candidate words, and the mask estimation unit generates one mask sentence. In sentences containing only, the estimated value for each word may be calculated.

前記マスク文生成部は、前記映像から物体が検出されるタイミングで、直近の前記入力文から前記マスク文を生成してもよい。 The mask sentence generation unit may generate the mask sentence from the latest input sentence at the timing when the object is detected from the video.

前記キーワード出力部は、前記関連度スコアの最大値が所定の閾値に満たない場合、前記映像に関連するキーワードを出力しなくてもよい。 The keyword output unit does not have to output the keyword related to the video when the maximum value of the relevance score does not reach a predetermined threshold value.

前記キーワード抽出装置は、学習モデルにより前記映像の各画素に対して、顕著性スコアを付与する顕著性推定部を備え、前記物体決定部は、前記映像から検出された複数の物体のうち、前記顕著性スコアに基づく評価が最も高い領域にある物体のクラス名を出力してもよい。 The keyword extraction device includes a saliency estimation unit that gives a saliency score to each pixel of the video by a learning model, and the object determination unit is the object among a plurality of objects detected from the video. The class name of the object in the region with the highest evaluation based on the saliency score may be output.

前記キーワード抽出装置は、ユーザの視点位置の座標が付加されたカメラ画像を、前記映像と照合することにより、前記映像の各画素に対して、所定の分布の注視点スコアを付与する視点位置推定部を備え、前記物体決定部は、前記映像から検出された複数の物体のうち、前記注視点スコアに基づく評価が最も高い領域にある物体のクラス名を出力してもよい。 The keyword extraction device collates a camera image to which the coordinates of the user's viewpoint position are added with the video, thereby imparting a viewpoint position estimation having a predetermined distribution to each pixel of the video. The object determination unit may output the class name of the object in the region having the highest evaluation based on the gazing point score among the plurality of objects detected from the video.

前記関連度スコア算出部は、前記クラス名に対する分散表現ベクトルと、前記推定値が上位の所定数の単語それぞれの分散表現ベクトルとのコサイン類似度を算出し、平均値を前記関連度スコアとして算出してもよい。 The relevance score calculation unit calculates the cosine similarity between the distributed expression vector for the class name and the distributed expression vector of each of a predetermined number of words whose estimated value is higher, and calculates the average value as the relevance score. You may.

前記関連度スコア算出部は、前記クラス名に対応して予め登録されている複数の代表オブジェクト名それぞれについて同一単語の前記推定値を取得し、平均値を前記関連度スコアとして算出してもよい。 The relevance score calculation unit may acquire the estimated value of the same word for each of a plurality of representative object names registered in advance corresponding to the class name, and calculate the average value as the relevance score. ..

本発明に係るキーワード抽出方法は、映像に伴う入力文から、キーワードデータベースに登録されている単語をキーワード候補語として抽出するキーワード候補語抽出ステップと、前記入力文における前記キーワード候補語をマスクしたマスク文を生成するマスク文生成ステップと、前記マスク文において、マスクされた箇所の前記キーワード候補語を推定した単語毎の推定値を学習モデルにより算出するマスク推定ステップと、前記映像から検出された物体のクラス名を出力する物体決定ステップと、前記マスク推定部により算出された前記マスクされた箇所の前記単語毎の推定値に基づいて、前記クラス名と前記マスクされた箇所それぞれの前記キーワード候補語との関連度スコアを算出する関連度スコア算出ステップと、前記関連度スコアが最も高い前記キーワード候補語を、前記映像に関連するキーワードとして出力するキーワード出力ステップと、をコンピュータが実行する。 The keyword extraction method according to the present invention includes a keyword candidate word extraction step of extracting a word registered in a keyword database as a keyword candidate word from an input sentence accompanying a video, and a mask masking the keyword candidate word in the input sentence. A mask sentence generation step for generating a sentence, a mask estimation step for calculating an estimated value for each word that estimates the keyword candidate word in the masked part by a learning model, and an object detected from the video. Based on the object determination step that outputs the class name of the above and the estimated value for each word of the masked part calculated by the mask estimation unit, the keyword candidate word for each of the class name and the masked part. The computer executes a relevance score calculation step for calculating the relevance score and a keyword output step for outputting the keyword candidate word having the highest relevance score as a keyword related to the video.

本発明に係るキーワード抽出プログラムは、前記キーワード抽出装置としてコンピュータを機能させるためのものである。 The keyword extraction program according to the present invention is for operating a computer as the keyword extraction device.

本発明によれば、映像と関連したキーワードを適切に抽出できる。 According to the present invention, keywords related to video can be appropriately extracted.

第１実施形態におけるキーワード抽出装置の機能構成を示す図である。It is a figure which shows the functional structure of the keyword extraction apparatus in 1st Embodiment. 第１実施形態におけるキーワード抽出方法の流れを例示するフローチャートである。It is a flowchart which illustrates the flow of the keyword extraction method in 1st Embodiment. 第１実施形態におけるキーワード候補語と検出された物体との関係を例示する図である。It is a figure which illustrates the relationship between the keyword candidate word and the detected object in 1st Embodiment. 第１実施形態における関連度スコアによるキーワードの決定手順を例示する図である。It is a figure which illustrates the procedure of determining a keyword by the relevance score in 1st Embodiment. 第２実施形態におけるキーワード抽出方法の流れを例示するフローチャートである。It is a flowchart which illustrates the flow of the keyword extraction method in 2nd Embodiment. 第３実施形態におけるキーワード抽出装置の機能構成を示す図である。It is a figure which shows the functional structure of the keyword extraction apparatus in 3rd Embodiment.

［第１実施形態］
以下、本発明の第１実施形態について説明する。
第１実施形態では、人間と一緒にテレビなどの映像を伴う放送番組を視聴するロボットなどに組み込まれ、発話生成に利用されるキーワード抽出装置１を提供する。
キーワード抽出装置１は、放送番組の映像から検出された物体の中から最も顕著性の高い物体を抽出すると共に、音声又は字幕文などからキーワード候補語を抽出し、最も顕著性の高い物体とキーワード候補語との関連度スコアをランキングすることで、映像と関連のあるキーワードを出力する。 [First Embodiment]
Hereinafter, the first embodiment of the present invention will be described.
In the first embodiment, there is provided a keyword extraction device 1 that is incorporated into a robot or the like that watches a broadcast program accompanied by a video such as a television together with a human being and is used for utterance generation.
The keyword extraction device 1 extracts the most prominent object from the objects detected from the video of the broadcast program, and extracts the keyword candidate word from the voice or the subtitle, and the most prominent object and the keyword. By ranking the degree of relevance score with the candidate word, keywords related to the video are output.

図１は、本実施形態におけるキーワード抽出装置１の機能構成を示す図である。
キーワード抽出装置１は、制御部及び記憶部の他、各種インタフェースを備えた情報処理装置であり、記憶部に格納されたソフトウェア（キーワード抽出プログラム）を制御部が実行することにより、本実施形態の各種機能が実現される。 FIG. 1 is a diagram showing a functional configuration of the keyword extraction device 1 in the present embodiment.
The keyword extraction device 1 is an information processing device provided with various interfaces in addition to a control unit and a storage unit, and the control unit executes software (keyword extraction program) stored in the storage unit to execute the present embodiment. Various functions are realized.

キーワード抽出装置１の制御部は、キーワード候補語抽出部１１と、マスク文生成部１２と、マスク推定部１３と、顕著性推定部１４と、物体検出部１５と、高顕著性物体決定部１６（物体決定部）と、関連度スコア算出部１７と、キーワード出力部１８とを備える。 The control unit of the keyword extraction device 1 includes a keyword candidate word extraction unit 11, a mask sentence generation unit 12, a mask estimation unit 13, a prominence estimation unit 14, an object detection unit 15, and a highly prominent object determination unit 16. (Object determination unit), a relevance score calculation unit 17, and a keyword output unit 18 are provided.

また、キーワード抽出装置１の記憶部は、キーワード抽出プログラムの他、キーワードデータベース（ＤＢ）２１と、入力文メモリ２２と、単語推定モデル２３と、顕著性推定モデル２４と、物体検出モデル２５と、分散表現ベクトルデータベース（ＤＢ）２６とを備える。 In addition to the keyword extraction program, the storage unit of the keyword extraction device 1 includes a keyword database (DB) 21, an input sentence memory 22, a word estimation model 23, a saliency estimation model 24, and an object detection model 25. It also includes a distributed representation vector database (DB) 26.

キーワード候補語抽出部１１は、映像に伴って入力された字幕文、又はテレビ音声の認識結果などの入力文を単語に分割する。その後、キーワード候補語抽出部１１は、分割した単語群の中に、予め用意しておいたキーワードＤＢ２１に登録されている単語が存在するかを確認し、存在する場合、この単語を「キーワード候補語」として抽出し、入力文と対応付けて入力文メモリ２２に格納する。 The keyword candidate word extraction unit 11 divides an input sentence such as a subtitle sentence input along with the video or a recognition result of the television sound into words. After that, the keyword candidate word extraction unit 11 confirms whether or not a word registered in the keyword DB 21 prepared in advance exists in the divided word group, and if it exists, this word is referred to as "keyword candidate". It is extracted as a word and stored in the input sentence memory 22 in association with the input sentence.

マスク文生成部１２は、後段のマスク推定部１３に対する入力の上限（例えば、５１２語）以内で入力文メモリ２２から直近の入力文を含む文章を取り出し、直近の入力文におけるキーワード候補語を所定の文字列（例えば、［ＭＡＳＫ］）に置き換えることでマスクしたマスク文を生成する。
なお、この処理は、物体検出部１５の処理と同期したタイミングで定期的に（例えば、５秒程度の周期で）実行される。 The mask sentence generation unit 12 extracts a sentence including the latest input sentence from the input sentence memory 22 within the upper limit of the input to the mask estimation unit 13 in the subsequent stage (for example, 512 words), and determines a keyword candidate word in the latest input sentence. A masked mask statement is generated by replacing it with a character string of (for example, [MASK]).
It should be noted that this process is executed periodically (for example, at a cycle of about 5 seconds) at a timing synchronized with the process of the object detection unit 15.

このとき、マスク文生成部１２は、直近の入力文に含まれるキーワード候補語の一つのみをマスクしたマスク文を生成し、入力文にキーワード候補語が複数存在する場合は、その数だけ複数パターンの変換を行い、複数のマスク文を生成する。
その後、マスク文生成部１２は、複数の入力文からなる文章、キーワード候補語、及びマスク文をマスク推定部１３に提供する。 At this time, the mask sentence generation unit 12 generates a mask sentence that masks only one of the keyword candidate words included in the latest input sentence, and if there are a plurality of keyword candidate words in the input sentence, the number of the keyword candidate words is the same. Converts the pattern and generates multiple mask statements.
After that, the mask sentence generation unit 12 provides a sentence composed of a plurality of input sentences, a keyword candidate word, and a mask sentence to the mask estimation unit 13.

マスク推定部１３は、マスク文を含む文章において、マスクされた箇所のキーワード候補語を推定した単語毎の推定値を、学習済みの単語推定モデル２３により算出する。
単語推定モデル２３は、例えば、予め事前学習を行ったＢＥＲＴモデルであってよく、事前学習タスクである「ＭａｓｋｅｄＬＭ」により、マスクされた文章中の単語が周りの文章から推定される。なお、ＢＥＲＴモデルは、次の文献Ａで提案されており、推定結果は、モデルのボキャブラリに含まれる各単語（例えば、３００００語程度）の推定値（０～１の値であり、全ての単語の推定値を合計すると１）のリストとなる。 The mask estimation unit 13 calculates the estimated value for each word in which the keyword candidate word of the masked portion is estimated in the sentence including the mask sentence by the learned word estimation model 23.
The word estimation model 23 may be, for example, a BERT model that has been pre-learned in advance, and the words in the masked sentence are estimated from the surrounding sentences by the pre-learning task “Masked LM”. The BERT model is proposed in the following document A, and the estimation result is an estimated value (value of 0 to 1) of each word (for example, about 30,000 words) included in the vocabulary of the model, and all words. The sum of the estimated values of is the list of 1).

文献Ａ：Ｊ．Ｄｅｖｌｉｎｅｔａｌ．， “ＢＥＲＴ：Ｐｒｅ－ｔｒａｉｎｉｎｇｏｆＤｅｅｐＢｉｄｉｒｅｃｔｉｏｎａｌＴｒａｎｓｆｏｒｍｅｒｓｆｏｒＬａｎｇｕａｇｅＵｎｄｅｒｓｔａｎｄｉｎｇ”，ＮＡＡＣＬ－ＨＬＴ２０１９． Reference A: J. Devlin et al. , "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL-HLT 2019.

ここで、マスク推定部１３は、複数のマスク文が入力された場合に、各マスク文を順に処理する。
すなわち、マスク推定部１３は、マスク文を一つのみ含む文章において、マスク箇所を推定した単語毎の推定値を算出する。 Here, when a plurality of mask sentences are input, the mask estimation unit 13 processes each mask sentence in order.
That is, the mask estimation unit 13 calculates an estimated value for each word in which the masked portion is estimated in a sentence containing only one mask sentence.

顕著性推定部１４は、入力映像のキャプチャ画像に対して、予め学習済みの顕著性推定モデル２４を利用して顕著性推定処理を実行する。
顕著性推定モデル２４は、例えば、文献Ｂで提案されている生理学的なモデルを計算機実装した手法、又は文献Ｃで提案されているディープラーニングを用いた手法が適用可能であり、出力として、キャプチャ画像の各画素に対して０～１の範囲で推定された顕著性スコアが付与される。 The saliency estimation unit 14 executes the saliency estimation process on the captured image of the input video by using the saliency estimation model 24 that has been learned in advance.
As the saliency estimation model 24, for example, a method of computer-implementing the physiological model proposed in Document B or a method using deep learning proposed in Document C can be applied, and the method is captured as an output. A saliency score estimated in the range of 0 to 1 is given to each pixel of the image.

文献Ｂ：Ｌ．Ｉｔｔｉｅｔａｌ．， “Ａｍｏｄｅｌｏｆｓａｌｉｅｎｃｙ－ｂａｓｅｄｖｉｓｕａｌａｔｔｅｎｔｉｏｎｆｏｒｒａｐｉｄｓｃｅｎｅａｎａｌｙｓｉｓ”，ＰＡＭＩ１９９８．
文献Ｃ：Ｑ．Ｈｏｕｅｔａｌ．， “ＤｅｅｐｌｙＳｕｐｅｒｖｉｓｅｄＳａｌｉｅｎｔＯｂｊｅｃｔＤｅｔｅｃｔｉｏｎｗｉｔｈＳｈｏｒｔＣｏｎｎｅｃｔｉｏｎｓ”，ＰＡＭＩ２０１９． Reference B: L. Itti et al. , "A model of salience-based visual attention for rapid sense analysis", PAMI 1998.
Reference C: Q. How et al. , “Deeply Supervised Salience Object Detection With Short Connections”, PAMI 2019.

物体検出部１５は、入力映像のキャプチャ画像に対して、予め学習済みの物体検出モデル２５を利用して物体検出処理を定期的に（例えば、５秒程度の周期で）実行する。
物体検出モデル２５は、例えば、文献Ｄ、Ｅ、Ｆで提案されている学習手法が適用可能であり、出力として、検出された複数の物体それぞれの座標情報（矩形領域）と、物体のカテゴリを示すクラス名（例えば、人間、犬、ケーキなど）が得られる。 The object detection unit 15 periodically (for example, at a cycle of about 5 seconds) executes an object detection process on the captured image of the input video by using the object detection model 25 that has been learned in advance.
For example, the learning method proposed in Documents D, E, and F can be applied to the object detection model 25, and as an output, the coordinate information (rectangular area) of each of the plurality of detected objects and the object category can be obtained. The class name to indicate (eg, human, dog, cake, etc.) is obtained.

文献Ｄ：Ｓ．Ｒｅｎｅｔａｌ．， “ＦａｓｔｅｒＲ－ＣＮＮ：ＴｏｗａｒｄｓＲｅａｌ－ＴｉｍｅＯｂｊｅｃｔＤｅｔｅｃｔｉｏｎｗｉｔｈＲｅｇｉｏｎＰｒｏｐｏｓａｌＮｅｔｗｏｒｋｓ”，ＮＩＰＳ２０１５．
文献Ｅ：Ｊ．Ｒｅｄｍｏｎｅｔａｌ．， “ＹｏｕＯｎｌｙＬｏｏｋＯｎｃｅ：Ｕｎｉｆｉｅｄ，Ｒｅａｌ－ＴｉｍｅＯｂｊｅｃｔＤｅｔｅｃｔｉｏｎ”，ＣＶＰＲ２０１６．
文献Ｆ：Ｗ．Ｌｉｕｅｔａｌ．， “ＳＳＤ：ＳｉｎｇｌｅＳｈｏｔＭｕｌｔｉＢｏｘＤｅｔｅｃｔｏｒ”，ＥＣＣＶ２０１６． Reference D: S. Ren et al. , "Faster R-CNN: Towers Real-Time Object Detection with Region Proposal Network", NIPS 2015.
Reference E: J. Redmon et al. , "You Only Look Object: Unified, Real-Time Object Detection", CVPR2016.
Reference F: W. Liu et al. , "SSD: Single Shot MultiBox Detector", ECCV2016.

高顕著性物体決定部１６は、顕著性推定部１４による推定結果と、物体検出部１５による検出結果とを利用し、映像から検出された複数の物体のうち、顕著性スコアに基づく評価が最も高い領域にある顕著性が最も高い物体を決定する。
具体的には、高顕著性物体決定部１６は、検出された物体の矩形領域内の画素の顕著性スコアの平均値を算出し、算出値が最も高い物体を「高顕著性物体」として決定し、この物体のクラス名を出力する。 The high-saliency object determination unit 16 uses the estimation result by the saliency estimation unit 14 and the detection result by the object detection unit 15, and the evaluation based on the saliency score is the most among the plurality of objects detected from the video. Determine the most prominent object in the high region.
Specifically, the high-saliency object determination unit 16 calculates the average value of the saliency scores of the pixels in the rectangular region of the detected object, and determines the object with the highest calculated value as the "high-saliency object". And output the class name of this object.

関連度スコア算出部１７は、マスク推定部１３により算出されたマスクされた箇所の単語毎の推定値に基づいて、高顕著性物体決定部１６から出力されたクラス名とマスクされた箇所それぞれのキーワード候補語との関連度スコアを算出する。すなわち、関連度スコア算出部１７は、キーワード候補語の中から高顕著性物体と最も関連が深い単語を選択するための関連度スコアを算出する。 The relevance score calculation unit 17 is based on the word-by-word estimation value of the masked portion calculated by the mask estimation unit 13, and the class name and the masked portion output from the high-saliency object determination unit 16 are respectively. Calculate the relevance score with the keyword candidate word. That is, the relevance score calculation unit 17 calculates the relevance score for selecting the word most closely related to the highly prominent object from the keyword candidate words.

具体的には、関連度スコア算出部１７は、予め学習済みの分散表現ベクトルＤＢ２６を参照し、高顕著性物体のクラス名を変換した分散表現ベクトルと、推定値が上位の所定数（例えば、１０個程度）の単語それぞれを変換した分散表現ベクトルとのコサイン類似度を算出する。そして、関連度スコア算出部１７は、算出されたコサイン類似度の平均値を、高顕著性物体とキーワード候補語との関連度スコアとする。
なお、単語の分散表現としては、Ｗｏｒｄ２Ｖｅｃ又はＦａｓｔＴｅｘｔなどの既存の手法を用いることができる。 Specifically, the relevance score calculation unit 17 refers to the distributed expression vector DB 26 that has been learned in advance, and converts the class name of the highly prominent object into a distributed expression vector and a predetermined number having an estimated value higher (for example,). The cosine similarity with the distributed representation vector obtained by converting each of the words (about 10) is calculated. Then, the relevance score calculation unit 17 uses the calculated average value of the cosine similarity as the relevance score between the highly prominent object and the keyword candidate word.
As the distributed expression of words, existing methods such as Word2Vec or FastText can be used.

キーワード出力部１８は、関連度スコア算出部１７により算出された関連度スコアが最も高いキーワード候補語を、入力映像に関連するキーワードとして出力する。
ここで、キーワード出力部１８は、関連度スコアの最大値が所定の閾値に満たない場合、映像に関連するキーワードを出力しないこととしてよい。 The keyword output unit 18 outputs the keyword candidate word having the highest relevance score calculated by the relevance score calculation unit 17 as a keyword related to the input video.
Here, the keyword output unit 18 may not output the keyword related to the video when the maximum value of the relevance score does not reach a predetermined threshold value.

図２は、本実施形態におけるキーワード抽出方法の流れを例示するフローチャートである。
この例では、テレビ番組などを構成する字幕及び映像がそれぞれ、ステップＳ１及びＳ５において並列に入力される。 FIG. 2 is a flowchart illustrating the flow of the keyword extraction method in the present embodiment.
In this example, the subtitles and the video constituting the television program and the like are input in parallel in steps S1 and S5, respectively.

ステップＳ１において、キーワード抽出装置１に制御部は、再生中の番組の字幕文を入力文として取得する。 In step S1, the control unit of the keyword extraction device 1 acquires the subtitle text of the program being played as an input text.

ステップＳ２において、キーワード候補語抽出部１１は、字幕文を単語に分割する。
ステップＳ３において、キーワード候補語抽出部１１は、分割された単語の中からキーワード候補語を抽出する。
ステップＳ４において、キーワード候補語抽出部１１は、字幕文とキーワード候補語とを入力文メモリ２２に格納する。その後、処理はステップＳ１に戻る。 In step S2, the keyword candidate word extraction unit 11 divides the subtitle sentence into words.
In step S3, the keyword candidate word extraction unit 11 extracts the keyword candidate word from the divided words.
In step S4, the keyword candidate word extraction unit 11 stores the subtitle sentence and the keyword candidate word in the input sentence memory 22. After that, the process returns to step S1.

ステップＳ５において、キーワード抽出装置１に制御部は、再生中の番組の映像データを取得する。
ステップＳ６において、制御部は、取得した映像データから画像をキャプチャする。 In step S5, the control unit in the keyword extraction device 1 acquires the video data of the program being played.
In step S6, the control unit captures an image from the acquired video data.

ステップＳ７において、物体検出部１５は、キャプチャ画像の中から物体を検出する処理を実行する。
ステップＳ８において、制御部は、物体が検出されたか否かを判定する。この判定がＹＥＳの場合、ステップＳ９及びＳ１２が並列実行され、判定がＮＯの場合、処理はステップＳ５に戻る。 In step S7, the object detection unit 15 executes a process of detecting an object from the captured image.
In step S8, the control unit determines whether or not the object has been detected. If this determination is YES, steps S9 and S12 are executed in parallel, and if the determination is NO, the process returns to step S5.

ステップＳ９において、マスク文生成部１２は、入力文メモリ２２から字幕文を取り出し、キーワード候補語のそれぞれをマスクした複数パターンのマスク文のリストを作成する。
ステップＳ１０において、マスク推定部１３は、マスク文においてマスクされている単語を推定する。
ステップＳ１１において、関連度スコア算出部１７は、推定値が上位の所定数の単語を選択し、これらの単語を分散表現ベクトルに変換する。その後、処理はステップＳ１５に移る。 In step S9, the mask sentence generation unit 12 takes out a subtitle sentence from the input sentence memory 22 and creates a list of a plurality of patterns of mask sentences in which each of the keyword candidate words is masked.
In step S10, the mask estimation unit 13 estimates the word masked in the mask sentence.
In step S11, the relevance score calculation unit 17 selects a predetermined number of words having a higher estimated value and converts these words into a distributed expression vector. After that, the process proceeds to step S15.

ステップＳ１２において、顕著性推定部１４は、キャプチャ画像の各画素の顕著性推定処理を行い、顕著性スコアを付与する。
ステップＳ１３において、高顕著性物体決定部１６は、検出された物体のうち、最も顕著性が高い高顕著性物体を決定する。
ステップＳ１４において、関連度スコア算出部１７は、決定された高顕著性物体のクラス名を分散表現ベクトルに変換する。 In step S12, the saliency estimation unit 14 performs saliency estimation processing for each pixel of the captured image and gives a saliency score.
In step S13, the highly prominent object determination unit 16 determines the highly prominent object having the highest prominence among the detected objects.
In step S14, the relevance score calculation unit 17 converts the determined class name of the highly prominent object into a distributed representation vector.

ステップＳ１５において、関連度スコア算出部１７は、推定された単語それぞれの分散表現ベクトルと、高顕著性物体のクラス名の分散表現ベクトルとのコサイン類似度を算出し、平均値を関連度スコアとする。
ステップＳ１６において、キーワード出力部１８は、関連度スコアが最大となったマスク箇所のキーワード候補語を、高顕著性物体と関連したキーワードとして決定し出力する。 In step S15, the relevance score calculation unit 17 calculates the cosine similarity between the distributed expression vector of each estimated word and the distributed expression vector of the class name of the highly prominent object, and sets the average value as the relevance score. do.
In step S16, the keyword output unit 18 determines and outputs the keyword candidate word of the masked portion having the maximum relevance score as a keyword related to the highly prominent object.

図３は、本実施形態におけるキーワード候補語と検出された物体との関係を例示する図である。
この例では、番組のキャプチャ画像から２つの物体が検出され、それぞれ「人間」及び「ケーキ」というクラス名が得られている。 FIG. 3 is a diagram illustrating the relationship between the keyword candidate word and the detected object in the present embodiment.
In this example, two objects are detected from the captured image of the program, and the class names "human" and "cake" are obtained, respectively.

また、このとき、番組の字幕から「田中さんの今日の昼食はフレンチトーストです。」という入力文が取得されている。この字幕文からは、「田中」及び「フレンチトースト」の２つのキーワード候補語が抽出されている。 Also, at this time, the input sentence "Mr. Tanaka's lunch today is French toast" is obtained from the subtitles of the program. Two keyword candidate words, "Tanaka" and "French toast", are extracted from this subtitle.

ここで、字幕文から得られるキーワード候補語は具体的な名称であるが、一方、検出された物体のクラス名はより抽象的であるため、両者の名称は一致しないことが多い。
そこで、「ケーキ」が高顕著性物体である場合に、対応する具体的なキーワードが「田中」であるのか「フレンチトースト」であるのかが関連度スコアによって決定される。 Here, the keyword candidate word obtained from the subtitle sentence is a specific name, but on the other hand, since the class name of the detected object is more abstract, the two names often do not match.
Therefore, when "cake" is a highly prominent object, whether the corresponding specific keyword is "Tanaka" or "French toast" is determined by the relevance score.

図４は、本実施形態における関連度スコアによるキーワードの決定手順を例示する図である。
字幕文「田中さんの今日の昼食はフレンチトーストです。」のキーワード候補語の一つである「田中」をマスクしたマスク文と、「フレンチトースト」をマスクしたマスク文とが生成され、それぞれのマスク箇所の単語が推定される。 FIG. 4 is a diagram illustrating a procedure for determining a keyword based on the relevance score in the present embodiment.
A mask sentence masking "Tanaka", which is one of the keyword candidate words of the subtitle sentence "Mr. Tanaka's lunch today is French toast.", And a mask sentence masking "French toast" are generated, and each masked part. Word is estimated.

「田中」の箇所では、例えば、「佐藤」、「鈴木」、「渡辺」といった単語の推定値が高く算出され、「フレンチトースト」の箇所では、例えば、「弁当」、「おにぎり」、「サンドイッチ」といった単語の推定値が高く算出される。
このとき、高顕著性物体の「ケーキ」は、「佐藤」、「鈴木」、「渡辺」などとの類似度よりも、「弁当」、「おにぎり」、「サンドイッチ」などとの類似度の方が高いため、該当するマスク箇所のキーワード候補語である「フレンチトースト」が番組映像と関連したキーワードとして決定される。 In the "Tanaka" section, for example, the estimated values of words such as "Sato", "Suzuki", and "Watanabe" are calculated high, and in the "French toast" section, for example, "bento", "rice ball", and "sandwich". The estimated value of such words is calculated high.
At this time, the highly prominent object "cake" is more similar to "lunch box", "rice ball", "sandwich", etc. than to "Sato", "Suzuki", "Watanabe", etc. Is high, so "French toast", which is a keyword candidate word for the corresponding masked part, is determined as a keyword related to the program video.

本実施形態によれば、キーワード抽出装置１は、入力文におけるキーワード候補語をマスクした際に推定される単語の推定値に基づいて、映像から検出される物体のクラス名とキーワード候補語との関連度スコアを算出する。
これにより、キーワード抽出装置１は、字幕文などの入力文から映像に関連した重要な単語をキーワードとして適切に抽出できる。 According to the present embodiment, the keyword extraction device 1 sets the class name of the object detected from the video and the keyword candidate word based on the estimated value of the word estimated when the keyword candidate word in the input sentence is masked. Calculate the relevance score.
As a result, the keyword extraction device 1 can appropriately extract important words related to the video as keywords from the input sentence such as the subtitle sentence.

キーワード抽出装置１は、直近のマスク文に過去の入力文を加えた文章でマスク箇所を推定する。
これにより、例えば、「今日は［ＭＡＳＫ］を注文しました。」というマスク文では、マスク箇所の推定が難しいのに対して、「美味しそうな中華料理屋があったので入ります。」という入力文を加えた文章を用いることにより、マスク箇所が食べ物であること、さらに具体的に中華料理の単語であることが精度良く推定される。
このように、キーワード抽出装置１は、推定結果の精度を向上でき、検出された物体との関連度を適切に評価できる。 The keyword extraction device 1 estimates the masked portion with a sentence obtained by adding a past input sentence to the latest mask sentence.
As a result, for example, in the mask sentence "I ordered [MASK] today.", It is difficult to estimate the mask location, but the input "I entered because there was a Chinese restaurant that looked delicious." By using the sentence with the sentence added, it can be accurately estimated that the masked part is food, and more specifically, it is a word of Chinese food.
In this way, the keyword extraction device 1 can improve the accuracy of the estimation result and can appropriately evaluate the degree of association with the detected object.

キーワード抽出装置１は、キーワード候補語の一つのみをマスクした文章において、マスク箇所の単語を推定することにより、マスク箇所に対する推定結果の精度を向上でき、検出された物体との関連度を適切に評価できる。 The keyword extraction device 1 can improve the accuracy of the estimation result for the masked part by estimating the word of the masked part in the sentence in which only one of the keyword candidate words is masked, and the degree of relevance to the detected object is appropriate. Can be evaluated.

キーワード抽出装置１は、物体を検出したタイミングと同期して、映像に伴う直近の入力文からマスク文を生成することにより、関連度の高いキーワードを適切に抽出できる。 The keyword extraction device 1 can appropriately extract highly relevant keywords by generating a mask sentence from the latest input sentence accompanying the video in synchronization with the timing when the object is detected.

キーワード抽出装置１は、関連度スコアの最大値が閾値に満たない場合にキーワードを出力しないことにより、入力文が映像と関連しない場合に、不適切なキーワードを出力することを抑制できる。 By not outputting the keyword when the maximum value of the relevance score is less than the threshold value, the keyword extraction device 1 can suppress the output of an inappropriate keyword when the input sentence is not related to the video.

キーワード抽出装置１は、映像の各画素に対して、顕著性スコアを付与し、この顕著性スコアに基づく評価が最も高い領域にある物体を高顕著性物体として決定する。
これにより、キーワード抽出装置１は、映像の中で最も顕著性の高い物体に関連した重要なキーワードを適切に出力できる。 The keyword extraction device 1 assigns a saliency score to each pixel of the image, and determines an object in the region having the highest evaluation based on the saliency score as a high saliency object.
As a result, the keyword extraction device 1 can appropriately output important keywords related to the most prominent object in the video.

キーワード抽出装置１は、物体のクラス名と推定された単語とを分散表現ベクトルに変換することで、コサイン類似度により関連度スコアを算出する。
これにより、キーワード抽出装置１は、適切な関連度スコアを効率的に算出できる。 The keyword extraction device 1 calculates the relevance score by the cosine similarity by converting the class name of the object and the estimated word into a distributed expression vector.
As a result, the keyword extraction device 1 can efficiently calculate an appropriate relevance score.

［第２実施形態］
以下、本発明の第２実施形態について説明する。
第２実施形態では、第１実施形態と比べて、関連度スコア算出部１７の機能が異なり、分散表現ベクトルＤＢ２６に代えて代表オブジェクトデータベース（ＤＢ）２７が設けられる。 [Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described.
In the second embodiment, the function of the relevance score calculation unit 17 is different from that in the first embodiment, and the representative object database (DB) 27 is provided in place of the distributed representation vector DB 26.

物体検出部１５により検出される各クラスに対し、このクラスに属している物体（オブジェクト）名が予め用意され、代表オブジェクトＤＢ２７に登録されている。
例えば、「動物」というクラスには、犬、猫、馬、羊、…などの代表オブジェクトが複数（例えば、１０件程度）登録されている。 For each class detected by the object detection unit 15, an object name belonging to this class is prepared in advance and registered in the representative object DB 27.
For example, a plurality of representative objects (for example, about 10) such as dogs, cats, horses, sheep, etc. are registered in the class "animal".

関連度スコア算出部１７は、高顕著性物体として選択されたクラスに属する複数の代表オブジェクト名それぞれについて、マスク推定部１３による同一単語の推定値を取得し、平均値を高顕著性物体とキーワード候補語との関連度スコアとして算出する。 The relevance score calculation unit 17 acquires the estimated value of the same word by the mask estimation unit 13 for each of the plurality of representative object names belonging to the class selected as the high-saliency object, and sets the average value as the high-saliency object and the keyword. Calculated as a relevance score with the candidate word.

関連度スコア算出部１７は、この演算をキーワード候補語それぞれに対して実行することで、キーワード候補語毎の関連度スコアをランキングし、出力部は、関連度スコアが最大のキーワード候補語を出力する。 The relevance score calculation unit 17 ranks the relevance score for each keyword candidate word by executing this calculation for each keyword candidate word, and the output unit outputs the keyword candidate word having the maximum relevance score. do.

図５は、本実施形態におけるキーワード抽出方法の流れを例示するフローチャートである。
ステップＳ１からＳ１０までは、第１実施形態（図２）と同一であり、映像から物体が検出されたことに応じて、字幕文のマスクされたキーワード候補語それぞれに対する単語が推定される。
ステップＳ１０の後、処理はステップＳ１５ａに移る。 FIG. 5 is a flowchart illustrating the flow of the keyword extraction method in the present embodiment.
Steps S1 to S10 are the same as in the first embodiment (FIG. 2), and words for each masked keyword candidate word in the subtitle sentence are estimated according to the detection of the object from the video.
After step S10, the process proceeds to step S15a.

ステップＳ１２において、顕著性推定部１４は、キャプチャ画像の各画素の顕著性推定処理を行い、顕著性スコアを付与する。
ステップＳ１３において、高顕著性物体決定部１６は、検出された物体のうち、最も顕著性が高い高顕著性物体を決定する。
ステップＳ１４ａにおいて、関連度スコア算出部１７は、決定された高顕著性物体のクラスに属する代表オブジェクトを抽出する。 In step S12, the saliency estimation unit 14 performs saliency estimation processing for each pixel of the captured image and gives a saliency score.
In step S13, the highly prominent object determination unit 16 determines the highly prominent object having the highest prominence among the detected objects.
In step S14a, the relevance score calculation unit 17 extracts representative objects belonging to the determined class of highly prominent objects.

ステップＳ１５ａにおいて、関連度スコア算出部１７は、マスクされた単語の推定値から、各代表オブジェクトと同一の単語の推定値を抽出し、平均値を関連度スコアとする。
ステップＳ１６において、キーワード出力部１８は、関連度スコアが最大となったマスク箇所のキーワード候補語を、注視物体と関連したキーワードとして決定し出力する。 In step S15a, the relevance score calculation unit 17 extracts the estimated value of the same word as each representative object from the estimated value of the masked word, and sets the average value as the relevance score.
In step S16, the keyword output unit 18 determines and outputs the keyword candidate word of the masked portion having the maximum relevance score as a keyword related to the gaze object.

本実施形態によれば、キーワード抽出装置１は、検出された物体のクラス名に対応して予め登録されている複数の代表オブジェクト名それぞれについて、同一単語の推定値を取得し平均値を関連度スコアとして算出する。
したがって、キーワード抽出装置１は、物体検出における各クラスを、単語推定モデルのボキャブラリに含まれる代表オブジェクトにより予め特徴付けることにより、単語の推定値を用いて適切に関連度を評価できる。 According to the present embodiment, the keyword extraction device 1 acquires an estimated value of the same word for each of a plurality of representative object names registered in advance corresponding to the class name of the detected object, and sets the average value as the degree of relevance. Calculated as a score.
Therefore, the keyword extraction device 1 can appropriately evaluate the degree of relevance using the estimated value of the word by pre-characterizing each class in the object detection by the representative object included in the vocabulary of the word estimation model.

［第３実施形態］
以下、本発明の第３実施形態について説明する。
第３実施形態のキーワード抽出装置１ａは、第１実施形態における顕著性推定手法の代わりに、ユーザの注視点を利用する。顕著性スコアは、一般的にユーザの注視が集まりやすい点を推定した結果である一方、アイトラッカを用いてユーザの注視点を推定することで、キーワード抽出装置１ａは、実際にユーザが注目している物体に関連したキーワードを抽出する。 [Third Embodiment]
Hereinafter, a third embodiment of the present invention will be described.
The keyword extraction device 1a of the third embodiment uses the user's gaze point instead of the saliency estimation method of the first embodiment. The saliency score is the result of estimating the point where the user's gaze is generally likely to be gathered, while the keyword extraction device 1a actually pays attention to the user by estimating the user's gaze point using the eye tracker. Extract keywords related to the object you are in.

図６は、本実施形態におけるキーワード抽出装置１ａの機能構成を示す図である。
第３実施形態では、第１実施形態の顕著性推定部１４が視点検出部１４ａ及び視点位置推定部１４ｂに、高顕著性物体決定部１６が注視物体決定部１６ａ（物体決定部）に、それぞれ置き換わっている。 FIG. 6 is a diagram showing a functional configuration of the keyword extraction device 1a according to the present embodiment.
In the third embodiment, the saliency estimation unit 14 of the first embodiment is the viewpoint detection unit 14a and the viewpoint position estimation unit 14b, and the highly prominent object determination unit 16 is the gaze object determination unit 16a (object determination unit), respectively. It has been replaced.

視点検出部１４ａは、ユーザ（番組視聴者）が装着したアイトラッカから、ユーザの視点位置を検出する。なお、視点位置は、眼球を赤外線カメラで撮影し、その動きから視点位置を推定する方法など、様々な従来手法により検出できる。
視点検出部１４ａは、アイトラッカに搭載されたカメラの映像に視点位置の座標情報が付加されたデータを、検出結果として視点位置推定部１４ｂに提供する。 The viewpoint detection unit 14a detects the user's viewpoint position from the eye tracker worn by the user (program viewer). The viewpoint position can be detected by various conventional methods such as a method of photographing the eyeball with an infrared camera and estimating the viewpoint position from the movement thereof.
The viewpoint detection unit 14a provides the viewpoint position estimation unit 14b with data in which the coordinate information of the viewpoint position is added to the image of the camera mounted on the eye tracker as the detection result.

視点位置推定部１４ｂは、視点検出部１４ａから取得したユーザの視点位置の座標が付加されたカメラ映像を、番組映像と照合することにより、番組映像の各画素に対して、所定の分布の注視点スコアを付与する。 The viewpoint position estimation unit 14b collates the camera image to which the coordinates of the user's viewpoint position acquired from the viewpoint detection unit 14a are added with the program image, so that the note of a predetermined distribution is given to each pixel of the program image. Give a viewpoint score.

ここで、カメラ映像は、番組映像が提示されるテレビなどの枠外も含んだ、ユーザの視野に近い映像であるため、ユーザの視点が番組映像上のどこに位置しているか、又は番組映像を見ていないかを推定する必要がある。
視点位置推定部１４ｂは、まず、番組映像とアイトラッカのカメラ映像とを、それぞれキャプチャし、アイトラッカのカメラ映像内における番組映像の領域を推定する。領域の推定には、例えば、文献Ｇで提案されているＯＲＢ特徴量など、画像の拡大、縮小及び回転に対応した画像特徴量が用いられる。 Here, since the camera image is an image close to the user's field of view, including outside the frame of the television on which the program image is presented, the user's viewpoint is located on the program image, or the program image is viewed. It is necessary to estimate whether or not it is.
First, the viewpoint position estimation unit 14b captures the program image and the camera image of the eye tracker, respectively, and estimates the area of the program image in the camera image of the eye tracker. For the estimation of the region, for example, an image feature amount corresponding to enlargement, reduction, and rotation of the image, such as the ORB feature amount proposed in Document G, is used.

文献Ｇ：Ｅ．Ｒｕｂｌｅｅｅｔａｌ．， “ＯＲＢ：ＡｎｅｆｆｉｃｉｅｎｔａｌｔｅｒｎａｔｉｖｅｔｏＳＩＦＴｏｒＳＵＲＦ”，ＩＣＣＶ２０１１． Reference G: E. Ruby et al. , "ORB: An effective alternate to SIFT or SURF", ICCV2011.

カメラ映像内に番組映像が存在し、番組映像の領域を検出できた場合、視点位置推定部１４ｂは、検出した領域の画像をホモグラフィ変換により矩形に変換する。また、視点位置推定部１４ｂは、視点位置の座標も同様にホモグラフィ変換後の矩形上にマッピングし、この点を中心とした正規分布に従った注視点スコアを各画素に与える。
これにより、番組映像における各画素の注視点スコアが０～１の範囲で推定された結果が得られる。 When the program image exists in the camera image and the area of the program image can be detected, the viewpoint position estimation unit 14b converts the image of the detected area into a rectangle by homography conversion. Further, the viewpoint position estimation unit 14b also maps the coordinates of the viewpoint position on the rectangle after the homography transformation, and gives each pixel a gaze point score according to a normal distribution centered on this point.
As a result, the result in which the gazing point score of each pixel in the program video is estimated in the range of 0 to 1 can be obtained.

注視物体決定部１６ａは、視点位置推定部１４ｂによる推定結果と、物体検出部１５による検出結果とを利用し、映像から検出された複数の物体のうち、注視点スコアに基づく評価が最も高い領域にあり、ユーザに注視されていると推定される物体を決定する。
具体的には、注視物体決定部１６ａは、第１実施形態の高顕著性物体決定部１６と同様に、検出された物体の矩形領域内の画素の注視点スコアの平均値を算出し、算出値が最も高い物体を「注視物体」として決定し、この物体のクラス名を出力する。 The gaze object determination unit 16a uses the estimation result by the viewpoint position estimation unit 14b and the detection result by the object detection unit 15, and has the highest evaluation based on the gaze point score among the plurality of objects detected from the video. Determines the object that is presumed to be being watched by the user.
Specifically, the gaze object determination unit 16a calculates and calculates the average value of the gaze point scores of the pixels in the rectangular region of the detected object, similarly to the high-saliency object determination unit 16 of the first embodiment. The object with the highest value is determined as the "gaze object", and the class name of this object is output.

関連度スコア算出部１７は、マスク推定部１３により算出されたマスクされた箇所の単語毎の推定値に基づいて、注視物体決定部１６ａから出力されたクラス名とマスクされた箇所それぞれのキーワード候補語との関連度スコアを算出する。すなわち、関連度スコア算出部１７は、キーワード候補語の中から注視物体と最も関連が深い単語を選択するための関連度スコアを算出する。 The relevance score calculation unit 17 is a keyword candidate for each of the class name and the masked part output from the gaze object determination unit 16a based on the word-by-word estimated value of the masked part calculated by the mask estimation unit 13. Calculate the relevance score with the word. That is, the relevance score calculation unit 17 calculates the relevance score for selecting the word most closely related to the gaze object from the keyword candidate words.

関連度スコアの具体的な算出方法については、第１実施形態における分散表現ベクトルＤＢ２６を利用したコサイン類似度による手法、又は第２実施形態における代表オブジェクトＤＢ２７を利用した代表オブジェクトによる手法のいずれも適用可能である。 As for the specific calculation method of the relevance score, either the method using the cosine similarity using the distributed representation vector DB 26 in the first embodiment or the method using the representative object using the representative object DB 27 in the second embodiment is applied. It is possible.

なお、本実施形態におけるキーワード抽出方法の流れは、第１実施形態（図２）又は第２実施形態（図５）のステップＳ１２における顕著性推定を視点位置推定に、ステップＳ１３の高顕著性物体の決定を注視物体の決定に、それぞれ置き換えたものとなる。 In the flow of the keyword extraction method in this embodiment, the prominence estimation in step S12 of the first embodiment (FIG. 2) or the second embodiment (FIG. 5) is used as the viewpoint position estimation, and the highly prominent object in step S13. Is replaced with the determination of the gaze object.

本実施形態によれば、キーワード抽出装置１ａは、映像内の顕著性スコアに代えて、ユーザの視点位置を推定することで注視点スコアを付与する。
これにより、キーワード抽出装置１は、ユーザが実際に注視している物体に関連した重要なキーワードを適切に出力できる。 According to the present embodiment, the keyword extraction device 1a gives a gaze point score by estimating the viewpoint position of the user instead of the prominence score in the image.
As a result, the keyword extraction device 1 can appropriately output important keywords related to the object that the user is actually gazing at.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Moreover, the effects described in the above-described embodiments are merely a list of the most preferable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

前述の実施形態では、キーワード抽出装置１は、ロボットに組み込まれるものとして説明したが、これには限られず、ロボットの外部に配置され、ロボットと有線又は無線にて、あるいはネットワークを介して通信接続されてもよい。
また、各種のデータベース及び学習モデルなどは、キーワード抽出装置１が備える構成としたが、これには限られず、クラウドなどの外部サーバに配置されてもよい。 In the above-described embodiment, the keyword extraction device 1 has been described as being incorporated in the robot, but the present invention is not limited to this, and the keyword extraction device 1 is arranged outside the robot and is connected to the robot by wire or wirelessly or via a network. May be done.
Further, various databases, learning models, and the like are configured to be provided in the keyword extraction device 1, but the present invention is not limited to this, and they may be arranged in an external server such as a cloud.

本実施形態では、主にキーワード抽出装置１の構成と動作について説明したが、本発明はこれに限られず、各構成要素を備え、キーワードを抽出するための方法、又はプログラムとして構成されてもよい。 In the present embodiment, the configuration and operation of the keyword extraction device 1 have been mainly described, but the present invention is not limited to this, and each component may be provided and configured as a method or a program for extracting keywords. ..

さらに、キーワード抽出装置１の機能を実現するためのプログラムをコンピュータで読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。 Further, even if a program for realizing the function of the keyword extraction device 1 is recorded on a computer-readable recording medium, the program recorded on the recording medium is read into a computer system and executed. good.

ここでいう「コンピュータシステム」とは、ＯＳや周辺機器などのハードウェアを含むものとする。また、「コンピュータで読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭなどの可搬媒体、コンピュータシステムに内蔵されるハードディスクなどの記憶装置のことをいう。 The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, and a storage device such as a hard disk built in a computer system.

さらに「コンピュータで読み取り可能な記録媒体」とは、インターネットなどのネットワークや電話回線などの通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでもよい。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Furthermore, a "computer-readable recording medium" is a communication line that dynamically holds a program for a short period of time, such as a communication line when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. It may also include a program that holds a program for a certain period of time, such as a volatile memory inside a computer system that is a server or a client in that case. Further, the above program may be for realizing a part of the above-mentioned functions, and may be further realized for realizing the above-mentioned functions in combination with a program already recorded in the computer system. ..

１、１ａキーワード抽出装置
１１キーワード候補語抽出部
１２マスク文生成部
１３マスク推定部
１４顕著性推定部
１４ａ視点検出部
１４ｂ視点位置推定部
１５物体検出部
１６高顕著性物体決定部
１６ａ注視物体決定部
１７関連度スコア算出部
１８キーワード出力部
２１キーワードデータベース
２２入力文メモリ
２３単語推定モデル
２４顕著性推定モデル
２５物体検出モデル
２６分散表現ベクトルデータベース
２７代表オブジェクトデータベース 1, 1a Keyword extraction device 11 Keyword candidate word extraction unit 12 Mask sentence generation unit 13 Mask estimation unit 14 Severity estimation unit 14a Viewpoint detection unit 14b Viewpoint position estimation unit 15 Object detection unit 16 Highly prominent object determination unit 16a Gaze object determination Part 17 Relevance score calculation part 18 Keyword output part 21 Keyword database 22 Input sentence memory 23 Word estimation model 24 Severity estimation model 25 Object detection model 26 Distributed representation vector database 27 Representative object database

Claims

A keyword candidate word extraction unit that extracts words registered in the keyword database as keyword candidate words from the input sentences that accompany the video,
A mask sentence generation unit that generates a mask sentence that masks the keyword candidate word in the input sentence,
In the mask sentence, a mask estimation unit that calculates an estimated value for each word that estimates the keyword candidate word in the masked part by a learning model, and a mask estimation unit.
An object determination unit that outputs the class name of the object detected from the video, and
Relevance score for calculating the relevance score between the class name and the keyword candidate word in each of the masked locations based on the estimated value for each word of the masked portion calculated by the mask estimation unit. Calculation unit and
A keyword extraction device including a keyword output unit that outputs the keyword candidate word having the highest relevance score as a keyword related to the video.

The keyword extraction device according to claim 1, wherein the mask estimation unit calculates the estimated value in a sentence obtained by adding a past input sentence to the mask sentence.

The mask sentence generation unit generates as many mask sentences as the number of the keyword candidate words by masking only one of the keyword candidate words included in the input sentence.
The keyword extraction device according to claim 1 or 2, wherein the mask estimation unit calculates an estimated value for each word in a sentence containing only one mask sentence.

The keyword extraction device according to any one of claims 1 to 3, wherein the mask sentence generation unit generates the mask sentence from the latest input sentence at the timing when an object is detected from the video.

The keyword extraction device according to any one of claims 1 to 4, wherein the keyword output unit does not output a keyword related to the video when the maximum value of the relevance score does not reach a predetermined threshold value.

A saliency estimation unit that gives a saliency score to each pixel of the video by a learning model is provided.
The object determination unit according to any one of claims 1 to 5, wherein the object determination unit outputs the class name of the object in the region having the highest evaluation based on the saliency score among the plurality of objects detected from the video. Keyword extractor.

It is provided with a viewpoint position estimation unit that gives a gaze point score of a predetermined distribution to each pixel of the image by collating the camera image to which the coordinates of the viewpoint position of the user are added with the image.
The object determination unit according to any one of claims 1 to 5, which outputs the class name of the object in the region having the highest evaluation based on the gazing point score among the plurality of objects detected from the video. Keyword extractor.

The relevance score calculation unit calculates the cosine similarity between the distributed expression vector for the class name and the distributed expression vector of each of a predetermined number of words whose estimated value is higher, and calculates the average value as the relevance score. The keyword extraction device according to any one of claims 1 to 7.

The relevance score calculation unit acquires the estimated value of the same word for each of a plurality of representative object names registered in advance corresponding to the class name, and calculates the average value as the relevance score. The keyword extraction device according to any one of claims 7.

A keyword candidate word extraction step that extracts words registered in the keyword database as keyword candidate words from the input sentences that accompany the video,
A mask sentence generation step for generating a mask sentence that masks the keyword candidate word in the input sentence, and
In the mask sentence, a mask estimation step of calculating an estimated value for each word in which the keyword candidate word of the masked part is estimated by a learning model, and a mask estimation step.
An object determination step that outputs the class name of the object detected from the video, and
Relevance score for calculating the relevance score between the class name and the keyword candidate word in each of the masked locations based on the estimated value for each word of the masked portion calculated in the mask estimation step. Calculation steps and
A keyword extraction method in which a computer executes a keyword output step of outputting the keyword candidate word having the highest relevance score as a keyword related to the video.

A keyword extraction program for operating a computer as the keyword extraction device according to any one of claims 1 to 9.