JP2020035149A

JP2020035149A - Moving image data processing device, and program

Info

Publication number: JP2020035149A
Application number: JP2018160578A
Authority: JP
Inventors: 山崎　俊彦; Toshihiko Yamazaki; 俊彦山崎; 遵介中村; Shunsuke Nakamura
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2020-03-05
Also published as: WO2020045527A1

Abstract

To provide a moving image data processing device capable of acquiring evaluation of moving image data with a method suppressing cost, and a program.SOLUTION: A moving image data processing device extracts the feature quantity of each still image data with each of N (N is an integer of two or more) pieces of still image information obtained by sampling moving image information included in moving image data being a processing object as an input, uses a machine learning device undergoing machine learning of image relation weight information related to the extracted feature quantity corresponding to each of the N pieces of still image information, multiplies the image relation weight information obtained by undergoing the machine learning to accumulate each feature quantity, and outputs a result of the accumulation as a feature quantity of the moving image information included in the moving image data.SELECTED DRAWING: Figure 1

Description

本発明は、動画像データ処理装置、及びプログラムに関する。 The present invention relates to a moving image data processing device and a program.

動画像データ、例えば広告映像（コマーシャルフィルム：ＣＦ）等の効果を評価するためには、従来、視聴者に対して実際に映像を提示して、その印象をアンケートにより取得する等して行わざるを得なかった。 Conventionally, in order to evaluate the effect of moving image data, for example, an advertisement image (commercial film: CF), it is necessary to actually present the image to a viewer and obtain the impression by a questionnaire. Did not get.

また例えば広告映像では、その効果を簡易的に表すために、広告を提示する際に放送されている番組の視聴率や、広告回数を評価指標として用いる場合がある。 In addition, for example, in an advertisement video, the audience rating of a program being broadcast at the time of presenting an advertisement or the number of advertisements may be used as an evaluation index in order to simply represent the effect.

Z. Hussain, et.al., “Automatic understanding of image and video advertisements”, In CVPR, 1100-1110, 2017.2, 4Z. Hussain, et.al., “Automatic understanding of image and video advertisements”, In CVPR, 1100-1110, 2017.2, 4

しかしながら、上記従来の方法では、例えばアンケートを用いる方法では、アンケート対象の視聴者からの情報を得るためにコストがかかり、アンケートの精度を高めるために対象視聴者を大きくすると、莫大なコストがかかっていた。 However, in the above-mentioned conventional method, for example, in the method using a questionnaire, it is costly to obtain information from viewers who are subject to a questionnaire. I was

また、簡易的な評価指標は、必ずしも実際の評価結果との相関が大きくないという問題点があった。 Further, there is a problem that a simple evaluation index does not always have a large correlation with an actual evaluation result.

このような背景の下、動画像データの評価を視聴者に提示する前に、コストを抑えた方法で得る技術が要望されている。なお、非特許文献１には、広告画像のトピック分析を行うことが開示されている。 Under such a background, there is a demand for a technique for obtaining evaluation of moving image data by a cost-saving method before presenting the evaluation to a viewer. Non-Patent Document 1 discloses that topic analysis of an advertisement image is performed.

本発明は上記実情に鑑みて為されたもので、動画像データの評価を、コストを抑えた方法で得ることのできる動画像データ処理装置、及びプログラムを提供することをその目的の一つとする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a moving image data processing apparatus and a program that can obtain evaluation of moving image data in a cost-effective manner. .

上記従来例の問題点を解決するための本発明の一態様は動画像データ処理装置であって、処理の対象となった動画像データに含まれる動画像情報からサンプリングして得られたＮ個（Ｎは２以上の整数）の静止画像情報のそれぞれを受け入れる受入手段と、前記受け入れたＮ個の静止画像情報を入力として、それぞれの静止画像情報の特徴量を抽出する特徴抽出手段と、前記抽出された、前記サンプリングして得られたＮ個の静止画像情報のそれぞれに対応する特徴量に係る画像関係重み情報を機械学習した状態にある機械学習器を用い、当該機械学習して得られた画像関係重み情報を乗じて各特徴量を累算する累算手段と、前記累算の結果を、前記動画像データに含まれる動画像情報の特徴量として出力する出力手段と、を含み、前記動画像情報の特徴量が、処理の対象となった前記動画像データの評価に関する所定の処理に供されることとしたものである。 One embodiment of the present invention for solving the problems of the above-described conventional example is a moving image data processing device, which includes N moving image data obtained by sampling from moving image information included in moving image data to be processed. Receiving means for receiving each of the still image information (N is an integer of 2 or more), feature extracting means for receiving the received N pieces of still image information as input, and extracting a feature amount of each still image information; Using the machine learning device in a state where the extracted image relation weight information relating to the feature amount corresponding to each of the extracted N still image information obtained by sampling is machine-learned, the machine learning is performed. Accumulating means for accumulating each feature amount by multiplying the obtained image relation weight information, and output means for outputting the result of the accumulation as a feature amount of moving image information included in the moving image data, The movie Feature amount information is obtained by the fact to be subjected to a predetermined processing related to the evaluation of the moving image data as the object of processing.

本発明によると、動画像データの評価をコストを抑えた方法で得ることができる。 ADVANTAGE OF THE INVENTION According to this invention, evaluation of moving image data can be obtained by the method which suppressed cost.

本発明の実施の形態に係る動画像データ処理装置の構成例を表すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of a moving image data processing device according to an embodiment of the present invention. 本発明の実施の形態に係る動画像データ処理装置の例を表す機能ブロック図である。1 is a functional block diagram illustrating an example of a moving image data processing device according to an embodiment of the present invention. 本発明の実施の形態に係る動画像データ処理装置のもう一つの例を表す機能ブロック図である。It is a functional block diagram showing another example of the moving image data processing device according to the embodiment of the present invention.

本発明の実施の形態について図面を参照しながら説明する。本発明の実施の形態に係る動画像データ処理装置１は、図１に例示するように、制御部１１、記憶部１２、操作部１３、出力部１４、及びインタフェース部１５を含んで構成される。 An embodiment of the present invention will be described with reference to the drawings. A moving image data processing device 1 according to an embodiment of the present invention includes a control unit 11, a storage unit 12, an operation unit 13, an output unit 14, and an interface unit 15, as illustrated in FIG. .

ここで制御部１１は、少なくとも一つのＣＰＵやＧＰＵ（Graphics Processing Unit）等を含むプログラム制御デバイスであり、記憶部１２に格納されたプログラムに従って動作する。本実施の形態ではこの制御部１１は、処理の対象となった動画像データの入力を受け入れて、受け入れた動画像データに含まれる動画像情報からサンプリングして得られるＮ個（Ｎは２以上の整数）の静止画像情報を取得する。 Here, the control unit 11 is a program control device including at least one CPU, a GPU (Graphics Processing Unit), and the like, and operates according to a program stored in the storage unit 12. In the present embodiment, the control unit 11 receives an input of moving image data to be processed, and obtains N (N is 2 or more) obtained by sampling from moving image information included in the received moving image data. ) Of still image information.

具体的に制御部１１は、動画像情報（一連のフレームを構成する静止画像情報からなる）から、所定のタイミングごとに静止画像情報を抽出してサンプリングする。ここで所定のタイミングは、例えば１秒に１フレームの静止画像情報を抽出することとすればよい。 Specifically, the control unit 11 extracts and samples the still image information at predetermined timings from the moving image information (consisting of still image information forming a series of frames). Here, the predetermined timing may be, for example, to extract one frame of still image information per second.

制御部１１は、サンプリングしたＮ個の静止画像情報のそれぞれの特徴量を抽出する。この特徴量は、例えば、事前に機械学習されたニューラルネットワークに対応する静止画像情報を入力したときの、ニューラルネットワークの出力そのものであってもよい。 The control unit 11 extracts each feature amount of the sampled N pieces of still image information. This feature amount may be, for example, the output itself of the neural network when still image information corresponding to the neural network that has been machine-learned in advance is input.

制御部１１は、さらに、ここで得た静止画像情報の特徴量に係る画像関係重み情報を機械学習した状態にある機械学習器を用い、当該機械学習して得られた画像関係重み情報を乗じて各特徴量を累算する。そして制御部１１は、当該累算の結果を、動画像データに含まれる動画像情報の特徴量として出力する。 The control unit 11 further multiplies the image relationship weight information obtained by the machine learning by using a machine learning device in which the image relationship weight information related to the feature amount of the still image information obtained here is machine-learned. To calculate each feature amount. Then, the control unit 11 outputs the result of the accumulation as a feature amount of the moving image information included in the moving image data.

本実施の形態の一態様では、ここで得られた動画像情報の特徴量が、入力された動画像データの評価に関する所定の処理に供される。この制御部１１の動作については、後に詳しく述べる。 In one aspect of the present embodiment, the feature amount of the moving image information obtained here is subjected to predetermined processing relating to the evaluation of the input moving image data. The operation of the control unit 11 will be described later in detail.

記憶部１２は、メモリデバイス等であり、制御部１１によって実行されるプログラムを保持する。このプログラムは、コンピュータ可読かつ非一時的な記録媒体に格納されて提供され、この記憶部１２に複写されたものであってもよい。またこの記憶部１２は、制御部１１のワークメモリとしても動作する。 The storage unit 12 is a memory device or the like, and stores a program executed by the control unit 11. This program may be stored in a non-transitory computer readable recording medium and provided, and may be copied to the storage unit 12. The storage unit 12 also operates as a work memory of the control unit 11.

操作部１３は、キーボードやマウス等であり、利用者の指示操作を受け入れて制御部１１に出力する。出力部１４は、ディスプレイ等であり、制御部１１から入力される指示に従って情報を出力する。 The operation unit 13 is a keyboard, a mouse, or the like, and accepts a user's instruction operation and outputs it to the control unit 11. The output unit 14 is a display or the like, and outputs information according to an instruction input from the control unit 11.

インタフェース部１５は、ＵＳＢ（Universal Serial Bus）等のシリアルインタフェースやネットワークインタフェース等であり、外部の装置等から動画像データを受け入れて制御部１１に出力する。 The interface unit 15 is a serial interface such as a USB (Universal Serial Bus), a network interface, or the like, and receives moving image data from an external device or the like and outputs the moving image data to the control unit 11.

次に制御部１１の動作について説明する。本実施の形態の制御部１１は、機能的には、図２に例示するように、静止画像情報受入部２１と、複数の特徴量抽出器２２-1，２２-2，…２２-Nと、画像関係重み乗算部２３と、累算部２４と、音声特徴量抽出部２５と、メタデータ特徴量抽出部２６と、第２重み乗算部２７と、出力制御部２９と、を含んで構成される。 Next, the operation of the control unit 11 will be described. Functionally, the control unit 11 of the present embodiment includes a still image information receiving unit 21 and a plurality of feature amount extractors 22-1, 22-2,..., 22-N as illustrated in FIG. , An image relation weight multiplying unit 23, an accumulating unit 24, an audio feature amount extracting unit 25, a metadata feature amount extracting unit 26, a second weight multiplying unit 27, and an output control unit 29. Is done.

静止画像情報受入部２１は、処理対象となった動画像データから所定のタイミングごと（例えば１秒ごと）にサンプリングして得たＮ個の静止画像情報を受け入れて保持し、Ｎ個の静止画像情報のうちｉ番目の静止画像情報を、対応する特徴量抽出器２２-iに出力する。なお、本実施の形態において動画像データは、動画像情報と音声情報とを含む。 The still image information receiving unit 21 receives and holds N pieces of still image information obtained by sampling the moving image data to be processed at predetermined timings (for example, every second), and stores the N pieces of still images. The i-th still image information among the information is output to the corresponding feature amount extractor 22-i. In the present embodiment, the moving image data includes moving image information and audio information.

特徴量抽出器２２-i（ｉ＝１，２，…Ｎ）は、それぞれ、静止画像情報受入部２１が受け入れたＮ個の静止画像情報のうちｉ番目の静止画像情報の特徴量を抽出する。本実施の形態の一例では、この特徴量抽出器２２-iは、少なくとも一つの隠れ層を備えたニューラルネットワークであり、予め複数の所定の画像データを用いてその特徴量を出力するよう機械学習されているものとする。 Each of the feature amount extractors 22-i (i = 1, 2,... N) extracts the feature amount of the i-th still image information from the N pieces of still image information received by the still image information receiving unit 21. . In an example of the present embodiment, the feature amount extractor 22-i is a neural network having at least one hidden layer, and is configured to output a feature amount using a plurality of predetermined image data in advance. It is assumed that

具体的には、この特徴量抽出器２２-iは、所定の画像データの集合であるImageNet（http://www.image-net.org/）を用いて機械学習したｒｅｓｎｅｔ（Kaiming He, et.al., Deep Residual Learning for Image Recognition,arXiv:1512.03385）、またはimagenetを用いて機械学習したｒｅｓｎｅｔを蒸留した（ｒｅｓｎｅｔと同じデータを入力したときの出力をｒｅｓｎｅｔの出力を教師として機械学習した）、隠れ層１層の全結合型ニューラルネットワークであってもよい。なお、この例の各特徴量抽出器２２-iは、いずれも同じ（機械学習の結果も同じ）ニューラルネットワークを用いるものとする。 Specifically, the feature amount extractor 22-i performs resnet (Kaiming He, et. Al.) Machine-learned using ImageNet (http://www.image-net.org/) which is a set of predetermined image data. .al., Deep Residual Learning for Image Recognition, arXiv: 1512.03385) or distilling the resnet machine-learned using imagenet (the output when the same data as resnet was input was machine-learned using the resnet output as a teacher) It may be a fully connected neural network with one hidden layer. It should be noted that each feature amount extractor 22-i in this example uses the same neural network (the same machine learning result).

この例では、特徴量抽出器２２-iの入力は予め定めた大きさの画像（例えば２２４×２２４ピクセルの画像）であり、出力は例えば２５６次元のベクトル情報とする。この出力が、本発明の静止画像情報の特徴量に相当する。 In this example, the input of the feature amount extractor 22-i is an image of a predetermined size (for example, an image of 224 × 224 pixels), and the output is, for example, 256-dimensional vector information. This output corresponds to the feature amount of the still image information of the present invention.

画像関係重み乗算部２３は、Ｎ個の特徴量抽出器２２-i（ｉ＝１，２，…，Ｎ）が出力するＮ個の静止画像情報の特徴量ｆi（ｉ＝１，２，…，Ｎ）に係る画像関係重み情報αi（ｉ＝１，２，…，Ｎ）の計算方法を機械学習した状態にある機械学習器を用いて、画像関係重み情報をαを得る。 The image relation weight multiplying unit 23 outputs the feature amounts fi (i = 1, 2,...) Of the N pieces of still image information output from the N feature amount extractors 22-i (i = 1, 2,..., N). , N), the image relation weight information α is obtained using a machine learning device in a state in which the method of calculating the image relation weight information αi (i = 1, 2,..., N) is machine-learned.

また、この画像関係重み乗算部２３は、Ｎ個の特徴量抽出器２２-i（ｉ＝ａ，ｂ，…Ｎ）が出力するＮ個の静止画像情報の特徴量ｆiのそれぞれに、当該機械学習器を用いて得られた、対応する画像関係重み情報αiを乗じた値αiｆiを出力する。この画像関係重み乗算部２３の画像関係重み情報の計算方法を機械学習する方法については後に述べる。 Further, the image relation weight multiplying unit 23 applies the feature amount fi of the N pieces of still image information output from the N feature amount extractors 22-i (i = a, b,... The value αifi obtained by multiplying the corresponding image relation weight information αi obtained by using the learning device is output. A method of machine learning the calculation method of the image relation weight information of the image relation weight multiplication unit 23 will be described later.

累算部２４は、画像関係重み乗算部２３が出力する、重みを乗算した各特徴量αiｆiを累算し、当該累算結果Σαiｆiを、動画像情報に含まれる静止画像情報の特徴量Ｆframeとして出力する。この静止画像情報の特徴量Ｆframeは、本実施の形態の一例では２５６次元のベクトル情報で表される。 The accumulating unit 24 accumulates the weighted feature amounts αifi output from the image-related weight multiplying unit 23, and sets the accumulation result 累 αifi as the feature amount Fframe of the still image information included in the moving image information. Output. The feature amount Fframe of the still image information is represented by 256-dimensional vector information in an example of the present embodiment.

音声特徴量抽出部２５は、処理対象となった動画像データから、動画像データに含まれる音声情報の入力を受けて、当該音声情報の特徴量を抽出する。本実施の形態の一例ではこの音声情報の特徴量は、音声情報を、予め機械学習したニューラルネットワークに入力したときの出力とする。具体的に、音声特徴量抽出部２５は、soundnet（Yusuf Aytar, et.al., SoundNet: Learning Sound Representations from Unlabeled Video, arXiv:1610.09001）を、UrbanSound8k（https://serv.cusp.nyu.edu/projects/urbansounddataset/urbansound8k.html）等の所定のデータセットを用いて機械学習して得たニューラルネットワークを含む。なお音声情報は必ずしも用いられなくてもよく、その場合は、音声特徴量抽出部２５は必ずしも必要でない。 The audio feature amount extraction unit 25 receives the input of audio information included in the moving image data from the moving image data to be processed, and extracts the characteristic amount of the audio information. In one example of the present embodiment, the feature amount of the voice information is an output when the voice information is input to a neural network that has been machine-learned in advance. Specifically, the audio feature amount extraction unit 25 converts the soundnet (Yusuf Aytar, et. /projects/urbansounddataset/urbansound8k.html) and other neural networks obtained by machine learning using a predetermined data set. Note that the audio information does not necessarily have to be used, and in that case, the audio feature amount extraction unit 25 is not necessarily required.

この例では、音声特徴量抽出部２５は、当該ニューラルネットワークに、処理対象となった動画像データに含まれる音声情報を入力したときの出力を、音声情報の特徴量Ｆsoundとして出力する。 In this example, the audio feature amount extraction unit 25 outputs an output when audio information included in moving image data to be processed is input to the neural network as a feature amount Fsound of the audio information.

メタデータ特徴量抽出部２６は、処理対象となった動画像データについて、別途利用者から入力された付加情報（メタデータ）の特徴量を抽出する。本実施の形態の一例では、この付加情報は例えば、調査日、（動画像データの主な想定視聴者の）年齢層・性別、タイトル、ナレーション音声を文字として表記したもの、（動画像データが広告である場合の）広告される商品のカテゴリを表す情報、シリーズものであるか否かを表す情報、（動画像データの）提供者（広告の場合、出稿者）、（広告の場合、広告されている）商品名・サービス名、その他の情報である。もっともこれらは例示であり、またメタデータは必ずしも用いられなくてもよく、その場合は、メタデータ特徴量抽出部２６は必ずしも必要でない。 The metadata feature extraction unit 26 extracts the feature of the additional information (metadata) separately input by the user from the moving image data to be processed. In an example of the present embodiment, this additional information is, for example, information in which a survey date, an age group and a gender (of a main assumed viewer of moving image data), a title, and a narration voice are written as characters, Information indicating the category of the product being advertised (if it is an ad), information indicating whether it is a series or not, the provider (for the moving image data) (the advertiser for the ad), (for the ad, the ad Product name), service name, and other information. However, these are merely examples, and the metadata may not always be used. In that case, the metadata feature extraction unit 26 is not necessarily required.

メタデータ特徴量抽出部２６もまた、予め所定の方法で機械学習されたニューラルネットワーク（メタデータ用ニューラルネットワーク）を用いて実現できる。具体的には、メタデータ特徴量抽出部２６は上記メタデータを表すベクトル情報をメタデータ用ニューラルネットワークに入力し、その出力をメタデータ特徴量Ｆmetaとして出力する。 The metadata feature extraction unit 26 can also be realized using a neural network (neural network for metadata) machine-learned in advance by a predetermined method. Specifically, the metadata feature extracting unit 26 inputs the vector information representing the metadata to the metadata neural network, and outputs the output as the metadata feature Fmeta.

本実施の形態のある例では、音声特徴量抽出部２５が出力する音声情報の特徴量Ｆsoundは２５６次元のベクトル情報であり、メタデータ特徴量Ｆmetaも２５６次元のベクトル情報であるものとする（静止画像情報の特徴量Ｆframeと同じ次元のベクトル情報とする）。 In one example of the present embodiment, it is assumed that the feature amount Fsound of the audio information output by the audio feature amount extraction unit 25 is 256-dimensional vector information, and the metadata feature amount Fmeta is also 256-dimensional vector information ( Vector information having the same dimension as the feature amount Fframe of the still image information).

第２重み乗算部２７は、累算部２４が出力する静止画像情報の特徴量Ｆframeと、音声特徴量抽出部２５が出力する音声情報の特徴量Ｆsoundと、メタデータ特徴量抽出部２６が出力するメタデータ特徴量Ｆmetaとのそれぞれについての第２の重みの情報β1，β2，β3をそれぞれに乗じて総和した値Ｆ＝β1Ｆframe＋β2Ｆsound＋β3Ｆmetaを、動画像データに関する推定された特徴量として出力する。 The second weight multiplication unit 27 outputs the feature amount Fframe of the still image information output by the accumulation unit 24, the feature amount Fsound of the audio information output by the audio feature amount extraction unit 25, and the metadata feature amount extraction unit 26. A value F = β1Fframe + β2Fsound + β3Fmeta obtained by multiplying each of the metadata weights Fmeta and the second weight information β1, β2, β3 by each of them is output as the estimated feature quantity for the moving image data.

本実施の形態のある例では、この第２重み乗算部２７は、静止画像情報の特徴量Ｆframeと、音声情報の特徴量Ｆsoundと、メタデータ特徴量Ｆmetaとのそれぞれについての第２の重みの情報β1，β2，β3の計算方法を機械学習しておき、第２の重みの情報β1，β2，β3は、入力される情報（音声情報を含む動画像データやメタデータ）に応じてその都度計算されてもよい。この第２の重みの情報β1，β2，β3の計算方法の機械学習の方法は後に述べる。 In one example of the present embodiment, the second weight multiplying unit 27 calculates the second weight of each of the feature amount Fframe of the still image information, the feature amount Fsound of the audio information, and the metadata feature amount Fmeta. The method of calculating the information β1, β2, β3 is machine-learned in advance, and the information β1, β2, β3 of the second weight is calculated each time according to the input information (moving image data including audio information or metadata). It may be calculated. The machine learning method of calculating the second weight information β1, β2, β3 will be described later.

出力制御部２９は、第２重み乗算部２７が出力する動画像情報の特徴量Ｆに基づいて、所定の結果情報（スコア）Ｓを得て、出力部１４に出力する。ここで出力制御部２９は例えば、動画像情報の特徴量Ｆのベクトルの次元と同じ次元のベクトル情報を入力とし、結果情報（スコア）Ｓの数だけの次元を有するベクトル情報を出力とするニューラルネットワークを用いて実現できる。このニューラルネットワークは全結合型のニューラルネットワークでよい。このニューラルネットワークの機械学習の方法は、後に述べる。 The output control unit 29 obtains predetermined result information (score) S based on the feature amount F of the moving image information output from the second weight multiplication unit 27, and outputs the obtained result information to the output unit 14. Here, the output control unit 29 receives, for example, vector information having the same dimension as the dimension of the vector of the feature amount F of the moving image information, and outputs vector information having dimensions equal to the number of result information (scores) S. It can be realized using a network. This neural network may be a fully connected neural network. The method of machine learning of this neural network will be described later.

本実施の形態の一例ではこの出力制御部２９が出力するスコア（結果情報）は、処理対象となった動画像データが被験者に記憶されている割合（認知度）、処理対象となった動画像データが好意を持たれる割合（好感度）、処理対象となった動画像データが広告であった場合の、広告されている商品・サービスの購入を喚起する割合（購入喚起度）、処理対象となった動画像データが興味・関心を持たれる割合（興味・関心度）の４つの値を含む。これらの値は、必ずしも割合（０以上１以下の値）として表される必要はないが、数値が大きいほどそれぞれの値が表す割合が高いことを意味するものとする。この場合、出力制御部２９が用いるニューラルネットワークが出力するベクトルの次元は４次元となる。 In an example of the present embodiment, the score (result information) output by the output control unit 29 is the ratio (recognition) of the moving image data to be processed by the subject, the moving image to be processed. The percentage of data that have a favorable impression (favorableness), the percentage of consumers who want to purchase the advertised product or service when the moving image data that was processed is an advertisement (the degree of purchase arousal), It contains four values of the ratio (interest / interest level) at which the changed moving image data has interest / interest. These values do not necessarily need to be represented as ratios (values of 0 or more and 1 or less), but a larger numerical value means that the ratio represented by each value is higher. In this case, the dimensions of the vector output by the neural network used by the output control unit 29 are four dimensions.

次に、各部が備えるニューラルネットワークの機械学習の方法について説明する。本実施の形態では既に説明した例のように、複数の特徴量抽出器２２は、予め所定の静止画像データのデータセットを用いて機械学習した状態としておく。また、音声特徴量抽出部２５についても同様に、所定の音声データのデータセットを用いて機械学習した状態としておく。この機械学習の結果は以下の処理において更新しなくてもよいし、更新を行ってもよい。 Next, a method of machine learning of a neural network provided in each unit will be described. In the present embodiment, as in the example described above, the plurality of feature amount extractors 22 are in a state in which machine learning is performed in advance using a data set of predetermined still image data. Similarly, the audio feature amount extraction unit 25 is also set to a state in which machine learning is performed using a data set of predetermined audio data. The result of the machine learning may not be updated in the following processing, or may be updated.

つまり、各特徴量抽出器２２は、予め結果情報が知られている（あるいは設定できる）動画像データを動画像データ処理装置１に入力したときに、動画像データ処理装置１が出力する結果情報と、入力した動画像データについて予め知られている（あるいは設定されている）結果情報（以下区別のため教師結果情報と呼ぶ）との差に基づくバックプロパゲーションにより更新して設定してもよい。各特徴量抽出器２２のニューラルネットワークの重みは共通としておいてよい。 That is, each feature amount extractor 22 outputs the result information output by the moving image data processing device 1 when the moving image data whose result information is known (or can be set) is input to the moving image data processing device 1. And may be updated and set by back propagation based on the difference between the input moving image data and the result information known or set in advance (hereinafter referred to as teacher result information for distinction). . The weight of the neural network of each feature amount extractor 22 may be common.

一方、画像関係重み乗算部２３と第２重み乗算部２７とが乗じる重み、並びに、メタデータ特徴量抽出部２６と出力制御部２９とのそれぞれが備えるニューラルネットワークについては、予め結果情報が知られている（あるいは設定できる）動画像データ（やメタデータ）を動画像データ処理装置１に入力したときに、動画像データ処理装置１が出力する結果情報と、入力した動画像データについて予め知られている（あるいは設定されている）結果情報（以下区別のため教師結果情報と呼ぶ）との差に基づくバックプロパゲーションにより更新して設定する。なお、累算部２４は累算を行っているものであり、画像関係重み乗算部２３と第２重み乗算部２７とは重みの乗算や、累算を行っているだけであるので、バックプロパゲーションの処理を行うことができる。 On the other hand, as for the weights to be multiplied by the image relation weight multiplying unit 23 and the second weight multiplying unit 27, and for the neural networks included in each of the metadata feature amount extracting unit 26 and the output control unit 29, the result information is known in advance. When moving image data (or metadata) that is (or can be set) is input to the moving image data processing device 1, the result information output by the moving image data processing device 1 and the input moving image data are known in advance. This is updated and set by back propagation based on the difference from the existing (or set) result information (hereinafter referred to as teacher result information for distinction). The accumulating unit 24 performs accumulation, and the image-related weight multiplying unit 23 and the second weight multiplying unit 27 only perform weight multiplication and accumulation. Gating processing can be performed.

すなわち、本実施の形態の一例では、予め結果情報（教師結果情報）が知られている複数の動画像データのセットが用意される。ここでは例えば複数の動画像データについての教師結果情報の一例である認知度、好感度、購入喚起度、興味・関心度が、各複数の動画像データに対する事前の公知のテスト（例えばいわゆるＡ／Ｂテスト等の比較試験でよい）により設定される。 That is, in an example of the present embodiment, a set of a plurality of pieces of moving image data for which result information (teacher result information) is known in advance is prepared. Here, for example, the recognition degree, the favorable impression, the purchase arousal degree, and the interest / interest degree, which are examples of the teacher result information on the plurality of moving image data, are determined in advance by a known test (for example, so-called A / A comparative test such as a B test may be used).

利用者は、当該動画像データのセットに含まれる動画像データのそれぞれを教師データとして動画像データ処理装置１に入力し、対応する出力である結果情報と、入力した教師データである動画像データに対応して設定されている教師結果情報との差を用い、公知のバックプロパゲーションにより、画像関係重み乗算部２３と第２重み乗算部２７とが乗じる重みの計算方法、並びに、メタデータ特徴量抽出部２６と出力制御部２９とのそれぞれが備えるニューラルネットワークを機械学習する。これにより、画像関係重み乗算部２３と第２重み乗算部２７とが乗じる重みの計算方法や各部のニューラルネットワークの重みがそれぞれ機械学習された状態となる。 The user inputs each of the moving image data included in the set of moving image data to the moving image data processing apparatus 1 as teacher data, and outputs the corresponding result information as the corresponding output and the moving image data as the input teacher data. And a method for calculating a weight to be multiplied by the image-related weight multiplying unit 23 and the second weight multiplying unit 27 by a known back propagation using a difference from the teacher result information set corresponding to Machine learning is performed on the neural networks provided in each of the quantity extraction unit 26 and the output control unit 29. Thereby, the calculation method of the weight multiplied by the image-related weight multiplying unit 23 and the second weight multiplying unit 27 and the weight of the neural network of each unit are machine-learned.

その後、利用者は、評価の対象である動画像データを、処理の対象として動画像データ処理装置１に入力する。そしてその出力である結果情報を得て、処理の対象である動画像データの認知度、好感度、購入喚起度、興味・関心度の情報を取得する。 Thereafter, the user inputs the moving image data to be evaluated to the moving image data processing device 1 as a processing target. Then, the result information as the output is obtained, and information on the degree of recognition, liking, purchase arousal, and interest / degree of interest of the moving image data to be processed is obtained.

本発明の実施の形態に係る動画像データ処理装置１は以上の構成を備えており、次のように動作する。本実施の形態の以下の例では、複数の特徴量抽出器２２-1，２２-2，…２２-Nはいずれも同じ重みが設定されたニューラルネットワークとする。つまり、各特徴量抽出器２２は、同じデータが入力されたときには、同じ出力を行うものとなっている。 The moving image data processing apparatus 1 according to the embodiment of the present invention has the above configuration and operates as follows. In the following example of the present embodiment, a plurality of feature amount extractors 22-1, 22-2,..., 22-N are all neural networks in which the same weight is set. That is, each feature amount extractor 22 performs the same output when the same data is input.

また音声特徴量抽出部２５についてもUrbanSound8Kなど広く知られたデータセットを用いて予め機械学習が行われた状態にあるものとする。 It is also assumed that the speech feature extraction unit 25 has been machine-learned in advance using a widely known data set such as UrbanSound8K.

この状態で、予め結果情報（教師結果情報）が知られている複数の動画像データのセットに含まれる動画像データのそれぞれを教師データとして動画像データ処理装置１に入力し、対応する出力である結果情報と、入力した教師データである動画像データに対応して設定されている教師結果情報との差を用い、公知のバックプロパゲーションにより、画像関係重み乗算部２３と第２重み乗算部２７とが乗じる重みαi（ｉ＝１，２，…，Ｎ），β1，β2，β3の計算方法を表すニューラルネットワーク、並びに、特徴量抽出器２２とメタデータ特徴量抽出部２６と出力制御部２９とのそれぞれが備えるニューラルネットワークを機械学習する。 In this state, each of the moving image data included in the set of a plurality of moving image data for which the result information (teacher result information) is known in advance is input to the moving image data processing apparatus 1 as teacher data, and the corresponding output is output. Using the difference between certain result information and the teacher result information set corresponding to the input moving image data as the teacher data, the image-related weight multiplying unit 23 and the second weight multiplying unit are formed by known back propagation. 27, a neural network representing a calculation method of weights αi (i = 1, 2,..., N), β1, β2, and β3; a feature amount extractor 22, a metadata feature amount extraction unit 26, and an output control unit Machine learning is performed on the neural network provided in each of 29 and 29.

次に利用者は、実際に評価の対象とする動画像データを、処理の対象として動画像データ処理装置１に入力する。動画像データ処理装置１は、この処理対象となった動画像データに含まれる動画像情報について、これを再生したときに、再生時刻が０秒、１秒、２秒…の位置において表示されているＮ個の静止画像情報を抽出する（サンプリング）。 Next, the user inputs the moving image data to be actually evaluated into the moving image data processing device 1 as a processing target. When the moving image data processing device 1 reproduces the moving image information included in the moving image data to be processed, the reproduction time is displayed at the positions of 0 second, 1 second, 2 seconds,. The extracted N pieces of still image information are extracted (sampling).

動画像データ処理装置１の制御部１１は当該Ｎ個の静止画像情報を受け入れて、複数の特徴量抽出器２２-i（ｉ＝１，２，…Ｎ）のそれぞれに、対応する静止画像情報を入力する。例えば再生時刻がｉ秒の位置で表示されている静止画像情報は、特徴量抽出器２２-iに入力される。 The control unit 11 of the moving image data processing apparatus 1 accepts the N pieces of still image information and assigns the corresponding still image information to each of the plurality of feature amount extractors 22-i (i = 1, 2,... N). Enter For example, the still image information displayed at the position where the reproduction time is i seconds is input to the feature amount extractor 22-i.

すると特徴量抽出器２２-iが、入力された静止画像情報に対応するニューラルネットワーク（ここではImageNetで学習されたResNetあるいはそれを蒸留して得たニューラルネットワーク）の出力ｆi（ｉ＝１，２，…，Ｎ）を出力する。この出力は、分類器へ出力するベクトル値を用いればよい。 Then, the feature quantity extractor 22-i outputs an output fi (i = 1, 2) of a neural network (here, ResNet trained by ImageNet or a neural network obtained by distilling it) corresponding to the input still image information. ,..., N) are output. This output may use a vector value output to the classifier.

また、動画像データ処理装置１は、処理対象となった動画像データに含まれる音声情報を抽出して音声特徴量抽出部２５に入力する。 In addition, the moving image data processing device 1 extracts audio information included in the moving image data to be processed and inputs the audio information to the audio feature amount extraction unit 25.

すると音声特徴量抽出部２５は、入力された音声情報を、予め機械学習したニューラルネットワークに入力したときの出力を、音声情報の特徴量Ｆsoundとして出力する。この出力も、分類器へ出力するベクトル値を用いればよい。 Then, the audio feature extraction unit 25 outputs an output when the input audio information is input to a neural network that has been machine-learned in advance, as an audio information feature Fsound. For this output, a vector value output to the classifier may be used.

また利用者は、処理対象となった動画像データについてのメタデータを入力する。ここでメタデータは例えば、調査日、動画像データの主な想定視聴者の年齢層・性別、タイトル、ナレーション音声を文字として表記したもの、動画像データにより広告される商品のカテゴリを表す情報、シリーズものであるか否かを表す情報、動画像データの提供者である、広告の出稿者を特定する情報、広告されている商品名・サービス名でよい。 Further, the user inputs metadata about the moving image data to be processed. Here, the metadata is, for example, a survey date, the age group and gender of the main assumed viewer of the moving image data, the title, the narration voice described as characters, information indicating the category of the product advertised by the moving image data, The information may be information indicating whether or not it is a series, information specifying a poster of an advertisement, which is a provider of moving image data, and the name of a commercialized product or service.

動画像データ処理装置１の制御部１１は、メタデータの入力を受けると、メタデータ特徴量抽出部２６として機能し、予め所定の方法で機械学習されたニューラルネットワーク（メタデータ用ニューラルネットワーク）に当該メタデータを入力したときの出力をメタデータ特徴量Ｆmetaとして出力する。 When receiving the input of the metadata, the control unit 11 of the moving image data processing device 1 functions as a metadata feature amount extraction unit 26, and converts the metadata into a neural network (neural network for metadata) that has been machine-learned by a predetermined method in advance. The output when the metadata is input is output as the metadata feature amount Fmeta.

制御部１１はさらに画像関係重み乗算部２３として機能し、Ｎ個の特徴量抽出器２２-i（ｉ＝１，２，…，Ｎ）が出力するＮ個の静止画像情報の特徴量ｆi（ｉ＝１，２，…，Ｎ）のそれぞれに対応する画像関係重み情報αi（ｉ＝１，２，…，Ｎ）を、機械学習によって得られた計算方法によって得て、当該画像関係重み情報αiを、対応する特徴量ｆiに乗じた値、αiｆiを出力する。制御部１１は、累算部２４として、ここで重みを乗算した各特徴量αiｆiを累算し、当該累算結果Σαiｆiを、動画像情報に含まれる静止画像情報の特徴量Ｆframeとして出力する。 The control unit 11 further functions as an image relation weight multiplying unit 23, and the feature amounts fi (N) of the N still image information output from the N feature amount extractors 22-i (i = 1, 2,..., N). The image relation weight information αi (i = 1, 2,..., N) corresponding to each of i = 1, 2,..., N) is obtained by a calculation method obtained by machine learning. αifi, which is a value obtained by multiplying αi by the corresponding feature quantity fi, is output. The control unit 11 accumulates the respective feature amounts αifi multiplied by the weights here as the accumulation unit 24, and outputs the accumulation result Σαifi as the feature amount Fframe of the still image information included in the moving image information.

制御部１１は、また、第２重み乗算部２７として機能して、静止画像情報の特徴量Ｆframeと、音声情報の特徴量Ｆsoundと、メタデータ特徴量Ｆmetaとのそれぞれについての第２の重みの情報β1，β2，β3を機械学習によって得られた計算方法によって得て、当該第２の重みの情報β1，β2，β3を、対応する特徴量のそれぞれに乗じた値β1Ｆframe，β2Ｆsound，β3Ｆmetaを求める。そして制御部１１は、上記出力の累算の結果β1Ｆframe＋β2Ｆsound＋β3Ｆmetaを演算する。制御部１１は、この累算の結果を、動画像情報の特徴量Ｆとして出力する。 The control unit 11 also functions as a second weight multiplication unit 27, and calculates a second weight of each of the feature amount Fframe of the still image information, the feature amount Fsound of the audio information, and the metadata feature amount Fmeta. Information β1, β2, β3 is obtained by a calculation method obtained by machine learning, and values β1, Fframe, β2Fsound, and β3Fmeta are obtained by multiplying the second weight information β1, β2, β3 by respective corresponding feature amounts. . Then, the control unit 11 calculates β1Fframe + β2Fsound + β3Fmeta as a result of the accumulation of the output. The control unit 11 outputs the result of the accumulation as the feature value F of the moving image information.

制御部１１はさらに、出力制御部２９として機能し、機械学習されたニューラルネットワークに、この動画像情報の特徴量Ｆを入力し、当該ニューラルネットワークの出力を、結果情報（スコア）Ｓとして出力する。 The control unit 11 further functions as an output control unit 29, inputs the feature amount F of the moving image information to the machine-learned neural network, and outputs the output of the neural network as result information (score) S. .

この出力であるスコアＳは、出力制御部２９のニューラルネットワーク等を機械学習したときに用いた教師結果情報に対応して、認知度、好感度、購入喚起度、興味・関心度を表す各値を要素とするベクトル値となる。 The score S, which is the output, corresponds to the teacher result information used when the neural network or the like of the output control unit 29 has been machine-learned. Is a vector value having as an element.

利用者はこの出力である認知度、好感度、購入喚起度、興味・関心度を、処理の対象とした動画像データについての評価の情報として得る。 The user obtains the output of the recognition level, the preference level, the purchase arousal level, and the interest / degree of interest as the information on the evaluation of the moving image data to be processed.

［動画像情報に含まれる各フレームの評価］
さらに本実施の形態の動画像データ処理装置１は、処理対象となった動画像データからサンプリングして得られたＮ個の静止画像情報の少なくとも一部について、その評価を行ってもよい。ここでは、認知度、好感度、購入喚起度、興味・関心度など、動画像データ処理装置１の出力する結果情報への影響を評価するものとする。 [Evaluation of each frame included in video information]
Furthermore, the moving image data processing device 1 of the present embodiment may evaluate at least a part of the N pieces of still image information obtained by sampling the moving image data to be processed. Here, it is assumed that the influence on the result information output from the moving image data processing device 1, such as the degree of recognition, liking, degree of purchase arousal, and degree of interest, is evaluated.

この例では制御部１１の静止画像情報受入部２１がさらに、図３に例示するように、図２の機能的構成に加えて、選択指示部３１と、特徴量比較部３２と、比較出力部３３とを含む。 In this example, the still image information receiving unit 21 of the control unit 11 further includes a selection instruction unit 31, a feature amount comparison unit 32, and a comparison output unit in addition to the functional configuration of FIG. 33.

この例の制御部１１では静止画像情報受入部２１は、処理対象となった動画像データから所定のタイミングごと（例えば１秒ごと）にサンプリングして得たＮ個の静止画像情報を受け入れて保持する。 In the control section 11 of this example, the still image information receiving section 21 receives and holds N pieces of still image information obtained by sampling the moving image data to be processed at predetermined timings (for example, every 1 second). I do.

そして静止画像情報受入部２１は、ｉ番目の静止画像情報を、対応する特徴量抽出器２２-iに出力する。特徴量比較部３２は、このときの出力制御部２９の出力であるスコアＳを、動画像情報のスコアＳとして記憶する。 Then, the still image information receiving unit 21 outputs the i-th still image information to the corresponding feature amount extractor 22-i. The feature amount comparison unit 32 stores the score S output from the output control unit 29 at this time as the score S of the moving image information.

次に選択指示部３１が、静止画像情報受入部２１が保持しているＮ個の静止画像情報のうちから、評価の対象とする少なくとも一つの静止画像情報（以下ではＭ個（ＭはＭ＜Ｎである１以上の整数）の静止画像情報とする）を選択する指示を出力する。ここで評価の対象とする静止画像情報の選択は、利用者からの指示により行われてもよいし、予め定めた順に、互いに異なるＭ個（例えば１番目から順に１つずつ）を選択して、以下の処理を繰り返すこととしてもよい。 Next, the selection instructing unit 31 selects at least one piece of still image information to be evaluated from among the N pieces of still image information held by the still image information receiving unit 21 (hereinafter, M pieces of information (M is M <M < An instruction is output to select (N is one or more integers) still image information). Here, the selection of the still image information to be evaluated may be performed according to an instruction from the user, or M pieces (for example, one by one from the first) different from each other may be selected in a predetermined order. The following processing may be repeated.

静止画像情報受入部２１は、選択指示部３１が出力する指示に従い、指示されたＭ個（ＭはＭ＜Ｎである１以上の整数）の静止画像情報を、予め定められた試験用静止画像情報に置き換える。ここで試験用静止画像情報は、全面が黒色等、所定の色で塗りつぶされた静止画像情報とすればよい。なお、このとき、重みα，β１，β２，β３は記憶しているスコアＳを演算したときのものから変化させない（固定しておく）。 The still image information receiving unit 21 converts the designated M still image information (M is an integer of 1 or more where M <N) into a predetermined test still image in accordance with the instruction output from the selection instruction unit 31. Replace with information. Here, the test still image information may be still image information in which the entire surface is filled with a predetermined color such as black. At this time, the weights α, β1, β2, and β3 are not changed (fixed) from the values when the stored score S is calculated.

そして静止画像情報受入部２１は、ｉ番目の静止画像情報を、対応する特徴量抽出器２２-iに出力する。このとき、特徴量抽出器２２のうち、上記の選択がされなかった静止画像情報については、処理の対象となった動画像データからサンプリングされた静止画像情報についての特徴量を出力することとなる。また特徴量抽出器２２のうち、上記の選択がされた静止画像情報については、処理の対象となった動画像データからサンプリングされた静止画像情報に代えて、上記試験用静止画像情報についての特徴量を出力することとなる。特徴量比較部３２は、このときに出力制御部２９が出力する動画像情報のスコアＳを、仮スコアＳ′として取り出し、記憶しているスコアとの差ΔＳ＝Ｓ′−Ｓを求めて比較出力部３３に出力する。 Then, the still image information receiving unit 21 outputs the i-th still image information to the corresponding feature amount extractor 22-i. At this time, for the still image information for which the above selection has not been made among the feature amount extractors 22, the feature amount of the still image information sampled from the moving image data to be processed is output. . In the feature amount extractor 22, the still image information selected above is replaced by the still image information sampled from the moving image data to be processed, and the feature of the test still image information is changed. Will output the quantity. The feature amount comparison unit 32 extracts the score S of the moving image information output by the output control unit 29 at this time as a temporary score S ′, and calculates a difference ΔS = S′−S from the stored score to perform comparison. Output to the output unit 33.

比較出力部３３は、ここで求めた差ΔＳを、選択された静止画像情報の評価値として、そのときに選択されている静止画像情報を特定する情報（何番目の静止画像情報を選択したかを表す情報）とともに、出力部１４に出力して利用者に提示する。 The comparison output unit 33 uses the difference ΔS obtained here as the evaluation value of the selected still image information, and information for specifying the still image information selected at that time (the number of still image information that has been selected). Is output to the output unit 14 and presented to the user.

本実施の形態のこの例によると、選択された静止画像情報が含まれない場合の仮スコアＳ′と、含まれる場合のスコアＳとの差、つまり、選択された静止画像情報を含むことによる評価の上昇（または下降）量ΔＳが評価値として示されることとなる。 According to this example of the present embodiment, the difference between the provisional score S ′ when the selected still image information is not included and the score S when it is included, that is, by including the selected still image information The increase (or decrease) amount ΔS of the evaluation is indicated as the evaluation value.

また１番目から順に１つずつ選択して処理を繰り返す例によると、動画像情報からサンプリングされたＮ個の静止画像情報のそれぞれが含まれない場合の仮スコアＳ′と、含まれる場合のスコアＳとの差、つまり、各静止画像情報について、当該静止画像情報を含むことによる評価の上昇（または下降）量ΔＳが評価値として示されることとなる。 Also, according to the example of selecting one by one in order from the first and repeating the processing, a provisional score S ′ when each of the N pieces of still image information sampled from the moving image information is not included, and a score when each is included, The difference from S, that is, for each still image information, the increase (or decrease) amount ΔS of the evaluation due to the inclusion of the still image information is indicated as the evaluation value.

［制御部の構成］
なお、本実施の形態において、複数の特徴量抽出器２２-1，２２-2，…２２-Nと、画像関係重み乗算部２３と、累算部２４と、音声特徴量抽出部２５と、メタデータ特徴量抽出部２６と、第２重み乗算部２７と、出力制御部２９等の各部は、一つの制御部１１が逐次的に対応する処理を行うことにより実現されてもよいし、制御部１１が複数のＧＰＵ等を含んで、各ＧＰＵが並列的に動作して処理を行うこととしてもよい。 [Configuration of control unit]
In the present embodiment, a plurality of feature amount extractors 22-1, 22-2,..., 22-N, an image relation weight multiplying unit 23, an accumulating unit 24, an audio feature amount extracting unit 25, Each unit such as the metadata feature amount extraction unit 26, the second weight multiplication unit 27, and the output control unit 29 may be realized by one control unit 11 sequentially performing corresponding processing, The unit 11 may include a plurality of GPUs and the like, and each GPU may operate in parallel to perform processing.

［実施形態の効果］
本実施の形態によると、例えば広告用の動画像データを生成したときに、どの程度の広告効果があるかを実際に放送することなく、また、アンケート等の作業を要することなく知ることができる。従って、動画像データの評価を、コストを抑えた方法で得ることができる。 [Effects of Embodiment]
According to the present embodiment, for example, when moving image data for advertisement is generated, it is possible to know the degree of advertisement effect without actually broadcasting it and without requiring work such as a questionnaire. . Therefore, evaluation of moving image data can be obtained in a cost-reduced manner.

また、動画像情報のうちどの場面の画像が、好ましい方向に効果的であり、好ましくない方向に効果的であるかを推定できるので、動画像情報の検討も容易に可能となる。 Further, since it is possible to estimate which scene image of the moving image information is effective in a preferable direction and effective in an unfavorable direction, it is possible to easily examine the moving image information.

１動画像データ処理装置、１１制御部、１２記憶部、１３操作部、１４出力部、１５インタフェース部、２１静止画像情報受入部、２２特徴量抽出器、２３画像関係重み乗算部、２４累算部、２５音声特徴量抽出部、２６メタデータ特徴量抽出部、２７第２重み乗算部、２９出力制御部、３１選択指示部、３２特徴量比較部、３３比較出力部。

REFERENCE SIGNS LIST 1 moving image data processing device, 11 control unit, 12 storage unit, 13 operation unit, 14 output unit, 15 interface unit, 21 still image information receiving unit, 22 feature quantity extractor, 23 image relation weight multiplication unit, 24 accumulation Unit, 25 audio feature amount extraction unit, 26 metadata feature amount extraction unit, 27 second weight multiplication unit, 29 output control unit, 31 selection instruction unit, 32 feature amount comparison unit, 33 comparison output unit.

Claims

Receiving means for receiving each of N (N is an integer of 2 or more) still image information obtained by sampling from moving image information included in moving image data to be processed;
A feature extraction unit that receives the received N pieces of still image information as input, and extracts a feature amount of each piece of still image information;
Using the machine learning device in a state in which the image relation weight information relating to the feature amount corresponding to each of the extracted still image information obtained by the sampling is machine-learned, Accumulating means for accumulating each feature amount by multiplying the obtained image relation weight information;
Output means for outputting the result of the accumulation as a feature amount of moving image information included in the moving image data,
Including
A moving image data processing device, wherein the feature amount of the moving image information is used for a predetermined process related to evaluation of the moving image data to be processed.

The moving image data processing device according to claim 1,
The processing target moving image data further includes audio information,
Audio feature amount extraction means for extracting a feature amount of audio information included in the moving image data subjected to the processing,
Using a second machine learning device in a state in which information of a second weight corresponding to each of the feature amount of the moving image information and the feature amount of the audio information is machine-learned; Second accumulating means for accumulating each feature amount by multiplying the information of the second weight obtained through
Further comprising
A moving image data processing device in which a result of the accumulation is subjected to predetermined processing relating to evaluation of the moving image data to be processed.

The moving image data processing device according to claim 2,
Means for receiving an input of metadata regarding the moving image data that has been processed,
Metadata feature amount extraction means for extracting a feature amount of metadata related to the moving image data subjected to the processing,
The second accumulating means sets a state in which information of a second weight corresponding to each of the feature amount of the moving image information, the feature amount of the audio information, and the feature amount of the metadata is machine-learned. Using a certain second machine learning device, multiplying information of the second weight obtained through the second machine learning device to accumulate each feature amount,
A moving image data processing device in which a result of the accumulation is subjected to predetermined processing relating to evaluation of the moving image data to be processed.

The moving image data processing device according to any one of claims 1 to 3,
Further, M still image information (M is an integer of 1 or more where M <N) selected from the N pieces of still image information obtained by sampling is converted into a predetermined test still image. Means for inputting to the accumulating means, multiplying by the image relation weight information obtained by the machine learning to obtain a provisional feature amount by accumulating each feature amount,
Means for calculating and outputting an evaluation value of the selected still image information based on a difference between the feature amount of the moving image information and the provisional feature amount,
A moving image data processing device further including:

Computer
Receiving means for receiving each of N (N is an integer of 2 or more) still image information obtained by sampling from moving image information included in moving image data to be processed;
A feature extraction unit that receives the received N pieces of still image information as input, and extracts a feature amount of each piece of still image information;
Using the machine learning device in a state in which the image relation weight information relating to the feature amount corresponding to each of the extracted still image information obtained by the sampling is machine-learned, Accumulating means for accumulating each feature amount by multiplying the obtained image relation weight information;
Output means for outputting the result of the accumulation as a feature amount of moving image information included in the moving image data,
Function as
A program in which a feature amount of the moving image information is subjected to predetermined processing related to evaluation of the moving image data that has been processed.