JP2019139479A

JP2019139479A - Program, device, and method for estimating context using a plurality of recognition engines

Info

Publication number: JP2019139479A
Application number: JP2018021847A
Authority: JP
Inventors: 和之田坂; Kazuyuki Tasaka; 柳原　広昌; Hiromasa Yanagihara; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2019-08-22
Anticipated expiration: 2038-02-09
Also published as: JP6875058B2

Abstract

To provide a program, a device, and a method, capable of enhancing context recognition accuracy by automatically selecting one or more optimal recognition engines in accordance with input data.SOLUTION: A recognition device 1 includes: a selection engine 11 for selecting recognition engines corresponding to input data to be estimated using a learning model which has learned by training data in which input data is associated with identifiers of recognition engines, and outputting the input data into the selected recognition engines; and a recognition score determination section 13 for determining, for the input data, recognition engines whose recognition score calculated by recognition engines 121, 122 is equal to or greater than a recognition threshold, and allowing the selected engines to get feedback of the identifiers of the recognition engines. The selection engine 11 learns the learning model again by training data in which the input data is associated with the identifiers of the recognition engines received the feedback.SELECTED DRAWING: Figure 2

Description

本発明は、複数の認識エンジンを用いてコンテキストを推定する技術に関する。 The present invention relates to a technique for estimating a context using a plurality of recognition engines.

近年、ディープラーニングを用いることによって、物体認識や人物の行動認識における認識精度が飛躍的に向上してきている。
例えば、特定のデータセットを入力し、機械学習アルゴリズムの候補を比較する技術がある（例えば特許文献１参照）。この技術によれば、機械学習モデル毎の性能結果を集計することによって、機械学習モデルの評価を自動的に比較することができる。 In recent years, recognition accuracy in object recognition and human action recognition has been dramatically improved by using deep learning.
For example, there is a technique for inputting a specific data set and comparing machine learning algorithm candidates (see, for example, Patent Document 1). According to this technique, it is possible to automatically compare the evaluations of the machine learning models by counting the performance results for each machine learning model.

映像データに対する認識エンジンとして、例えばＲＧＢ画像に映り込む物体を検出する物体認識の技術がある（例えば非特許文献１参照）。
また、移動特徴量（オプティカルフロー）から物体の動きを検出する動体認識の技術もある（例えば非特許文献２参照）。例えばTwo-stream ConvNetsによれば、空間方向のＣＮＮ(Spatial stream ConvNet)と時系列方向のＣＮＮ(Temporal stream ConvNet)とを用いて、画像中の物体や背景のアピアランス特徴と、オプティカルフローの水平方向成分及び垂直方向成分の系列における動き特徴との両方を抽出する。これら両方の特徴を統合することによって、行動を高精度に認識する。
更に、３次元映像から、人物の行動を認識する技術もある（例えば非特許文献３参照）。
更に、人の関節とその連携部分のスケルトン情報を抽出することによって、人物の行動を認識する技術もある（例えば非特許文献４参照）。 As a recognition engine for video data, for example, there is an object recognition technique for detecting an object reflected in an RGB image (see, for example, Non-Patent Document 1).
There is also a moving object recognition technique for detecting the movement of an object from a moving feature amount (optical flow) (see, for example, Non-Patent Document 2). For example, according to Two-stream ConvNets, the appearance characteristics of objects and backgrounds in an image and the horizontal direction of an optical flow using spatial-direction CNN (Spatial stream ConvNet) and time-series CNN (Temporal stream ConvNet) Extract both the component and motion features in the sequence of vertical components. By integrating both features, the behavior is recognized with high accuracy.
Further, there is a technique for recognizing a person's action from a three-dimensional video (see, for example, Non-Patent Document 3).
Further, there is a technique for recognizing a person's action by extracting skeleton information of a human joint and its associated part (see, for example, Non-Patent Document 4).

その他の適用分野として、ロボットの自律動作によれば、階層的な学習モデルの強化学習を実行する技術もある（例えば非特許文献５参照）。また、ゲートネットワークが使用する機械学習のネットワークであるエキスパートネットワークを、入力データに応じて選択する技術もある（例えば非特許文献６参照）。 As another application field, there is a technology for executing reinforcement learning of a hierarchical learning model according to the autonomous operation of a robot (for example, see Non-Patent Document 5). There is also a technique for selecting an expert network, which is a machine learning network used by a gate network, according to input data (see, for example, Non-Patent Document 6).

図１は、認識装置を有するシステム構成図である。 FIG. 1 is a system configuration diagram having a recognition device.

図１のシステムによれば、認識装置１は、インターネットに接続されたサーバとして機能する。認識装置１は、教師データによって予め学習モデルを構築した認識エンジンを有する。認識エンジンが、人物の行動を認識するものである場合、教師データは、人の行動が映り込む映像データと、その行動対象（コンテキスト）とが予め対応付けられたものである。 According to the system of FIG. 1, the recognition apparatus 1 functions as a server connected to the Internet. The recognition apparatus 1 has a recognition engine in which a learning model is built in advance using teacher data. When the recognition engine recognizes a person's action, the teacher data is obtained by associating in advance video data in which a person's action is reflected and an action target (context).

端末２はそれぞれ、カメラを搭載しており、人の行動を撮影した映像データを、認識装置１へ送信する。端末２は、各ユーザによって所持されるスマートフォンや携帯端末であって、携帯電話網又は無線ＬＡＮのようなアクセスネットワークに接続する。
勿論、端末２は、スマートフォン等に限られず、例えば宅内に設置されたＷｅｂカメラであってもよい。また、Ｗｅｂカメラによって撮影された映像データがＳＤカードに記録され、その記録された映像データが認識装置１へ入力されるものであってもよい。 Each terminal 2 is equipped with a camera, and transmits video data obtained by photographing a human action to the recognition device 1. The terminal 2 is a smartphone or a mobile terminal possessed by each user, and is connected to an access network such as a mobile phone network or a wireless LAN.
Of course, the terminal 2 is not limited to a smartphone or the like, and may be, for example, a Web camera installed in a home. Alternatively, video data captured by a Web camera may be recorded on an SD card, and the recorded video data may be input to the recognition device 1.

具体的には、例えばユーザに、自らのスマートフォンのカメラで、自らの行動を撮影してもらう。そのスマートフォンは、その映像データを、認識装置１へ送信する。認識装置１は、その映像データから人の行動を推定し、その推定結果を様々なアプリケーションで利用する。
尚、認識装置１の各機能が端末２に組み込まれたものであってもよい。 Specifically, for example, a user photographs his / her behavior with his / her smartphone camera. The smartphone transmits the video data to the recognition device 1. The recognition apparatus 1 estimates a human action from the video data, and uses the estimation result in various applications.
Note that each function of the recognition device 1 may be incorporated in the terminal 2.

特開２０１７−００４５０９号公報JP 2017-004509 A

Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://arxiv.org/pdf/1506.01497.pdf＞Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, [online], [Search January 24, 2018], Internet <URL: https: / /arxiv.org/pdf/1506.01497.pdf> Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://arxiv.org/abs/1406.2199.pdf＞Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014, [online], [searched January 24, 2018], Internet <URL: https://arxiv.org/ abs / 1406.2199.pdf> Hernandez Ruiz, Alejandro & Porzi, Lorenzo & Rota Bul?, Samuel & Moreno-Noguer, Francesc: 3D CNNs on Distance Matrices for Human Action Recognition、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://www.researchgate.net/publication/320543521_3D_CNNs_on_Distance_Matrices_for_Human_Action_Recognition＞Hernandez Ruiz, Alejandro & Porzi, Lorenzo & Rota Bul ?, Samuel & Moreno-Noguer, Francesc: 3D CNNs on Distance Matrices for Human Action Recognition, [online], [Search January 24, 2018], Internet <URL: https://www.researchgate.net/publication/320543521_3D_CNNs_on_Distance_Matrices_for_Human_Action_Recognition> Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.、[online]、［平成３０年１月２８日検索］、インターネット＜https://arxiv.org/pdf/1611.08050.pdf＞Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields., [Online], [Search January 28, 2018], Internet <https: // arxiv. org / pdf / 1611.08050.pdf> 高橋泰岳、浅田稔、「複数の学習機構の階層的構築による行動獲得」、[online]、［平成３０年１月２４日検索］、インターネット＜URL:http://www.er.ams.eng.osaka-u.ac.jp/Paper/1999/Takahashi99c.pdf＞Yasutake Takahashi, Satoshi Asada, “Acquisition of Action by Hierarchical Construction of Multiple Learning Mechanisms”, [online], [Search January 24, 2018], Internet <URL: http: //www.er.ams. eng.osaka-u.ac.jp/Paper/1999/Takahashi99c.pdf> Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean: OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://openreview.net/pdf?id=B1ckMDqlg＞Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean: OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER, [online], [January 24, 2018 Search ] Internet <URL: https: //openreview.net/pdf? Id = B1ckMDqlg> OpenPose、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://github.com/CMU-Perceptual-Computing-Lab/openpose＞OpenPose, [online], [Search January 24, 2018], Internet <URL: https://github.com/CMU-Perceptual-Computing-Lab/openpose> 「動画や写真からボーンが検出できる OpenPoseを試してみた」、[online]、［平成３０年１月２４日検索］、インターネット＜URL:http://hackist.jp/?p=8285＞“I tried OpenPose, which can detect bones from videos and photos”, [online], [Search January 24, 2018], Internet <URL: http://hackist.jp/?p=8285> 「OpenPoseがどんどんバージョンアップして3d pose estimationも試せるようになっている」、[online]、［平成３０年１月２４日検索］、インターネット＜URL: http://izm-11.hatenablog.com/entry/2017/08/01/140945＞“OpenPose has been upgraded and 3d pose estimation can be tried now,” [online], [searched January 24, 2018], Internet <URL: http://izm-11.hatenablog.com / entry / 2017/08/01/140945>

非特許文献１〜４の技術によれば、認識精度が最も高くなるであろう学習モデルを予め決定しておく必要がある。そのために、入力データによっては、結果的に最適でない学習モデルが選択される場合もあり得る。
特許文献１の技術によれば、機械学習モデルを比較するために、全ての機械学習モデルに入力データを入力する必要がある。機械学習モデルが多いほど、サーバリソースを必要とする。
非特許文献５の技術によれば、階層的な学習モデルの強化学習を実行するものであって、複数の認識エンジンを用いてコンテキストを認識するものではない。
非特許文献６の技術によれば、ネットワークを選択するエキスパートの学習が不十分である場合、ユーザ所望のスコアに満たないこともある。 According to the techniques of Non-Patent Documents 1 to 4, it is necessary to determine in advance a learning model that will have the highest recognition accuracy. Therefore, depending on the input data, a learning model that is not optimal as a result may be selected.
According to the technique of Patent Document 1, in order to compare machine learning models, it is necessary to input input data to all machine learning models. The more machine learning models, the more server resources are required.
According to the technique of Non-Patent Document 5, reinforcement learning of a hierarchical learning model is executed, and context is not recognized using a plurality of recognition engines.
According to the technique of Non-Patent Document 6, when the learning of an expert who selects a network is insufficient, the score desired by the user may not be reached.

前述したいずれの従来技術についても、学習モデルが最適に構築された認識エンジンを利用することを前提としたものであって、入力データに応じて、最適な学習モデルの認識エンジンを予め決定しておく必要がある。
これに対し、本発明の発明者らは、入力データに応じて、最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキスト（認識結果）の認識精度を高めることができないか、と考えた。 All of the above-described conventional techniques are based on the premise that a learning engine having an optimally constructed learning model is used, and an optimum learning model recognition engine is determined in advance according to input data. It is necessary to keep.
On the other hand, the inventors of the present invention can increase the recognition accuracy of the context (recognition result) by automatically selecting one or more optimal recognition engines according to the input data. I thought.

そこで、本発明は、入力データに応じて最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキストの認識精度を高めることができるプログラム、装置及び方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a program, an apparatus, and a method that can improve context recognition accuracy by automatically selecting one or more optimal recognition engines according to input data. .

本発明によれば、複数の認識エンジンを用いて、入力データからコンテキストを推定するようにコンピュータを機能させる認識プログラムにおいて、
入力データと認識エンジンの識別子とを対応付けた教師データによって学習した学習モデルを用いて、推定すべき入力データに対する認識エンジンを選択し、選択された当該認識エンジンへ当該入力データを出力する選択エンジンと、
当該入力データに対して認識エンジンによって算出された認識スコアが認識閾値以上となる認識エンジンを判定し、当該認識エンジンの識別子を選択エンジンへフィードバックする認識スコア判定手段と
してコンピュータに機能させ、
選択エンジンは、当該入力データと、フィードバックされた当該認識エンジンの識別子とを対応付けた教師データによって、学習モデルを再学習する
ようにコンピュータに機能させることを特徴とする。 According to the present invention, in a recognition program for causing a computer to function to estimate a context from input data using a plurality of recognition engines,
A selection engine that selects a recognition engine for input data to be estimated using a learning model learned from teacher data in which input data is associated with an identifier of the recognition engine, and outputs the input data to the selected recognition engine When,
Determining a recognition engine for which the recognition score calculated by the recognition engine for the input data is equal to or greater than a recognition threshold, and causing the computer to function as a recognition score determination unit that feeds back the identifier of the recognition engine to the selection engine;
The selection engine causes the computer to function so as to re-learn the learning model based on the teacher data in which the input data is associated with the fed back identifier of the recognition engine.

本発明の認識プログラムにおける他の実施形態によれば、
認識エンジンは、クラス毎に認識スコアを算出するクラス分類に基づくものであり、
認識エンジンは、複数のクラスの複数のスコアにおける最高値、最低値、平均値又は加算値のいずれかの統計値を、認識スコアとして算出する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the recognition program of the present invention,
The recognition engine is based on a classification that calculates a recognition score for each class.
The recognition engine also preferably causes the computer to function as a recognition score by calculating a statistical value of any one of a maximum value, a minimum value, an average value, and an addition value in a plurality of scores of a plurality of classes.

本発明の認識プログラムにおける他の実施形態によれば、
認識スコア判定手段は、入力データに対して要した処理時間が、所定閾値時間以下となった認識エンジンの識別子のみを、選択エンジンへフィードバックする
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the recognition program of the present invention,
The recognition score determination means preferably causes the computer to function so as to feed back only the identifier of the recognition engine whose processing time required for the input data is equal to or less than a predetermined threshold time to the selection engine.

本発明の認識プログラムにおける他の実施形態によれば、
選択エンジンは、クラス毎に選択スコアを算出するクラス分類に基づくものであり、
選択エンジンは、推定すべき入力データに対する当該選択スコアが第１の選択閾値以上となる認識エンジンへ、当該入力データを出力する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the recognition program of the present invention,
The selection engine is based on a class classification that calculates a selection score for each class,
The selection engine preferably causes the computer to function so as to output the input data to a recognition engine in which the selection score for the input data to be estimated is equal to or higher than the first selection threshold.

本発明の認識プログラムにおける他の実施形態によれば、
選択エンジンは、当該選択スコアが第１の選択閾値未満で且つ第２の選択閾値以上となる認識エンジンへ更に、当該入力データを出力する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the recognition program of the present invention,
Preferably, the selection engine further causes the computer to function to output the input data to a recognition engine whose selection score is less than the first selection threshold and equal to or greater than the second selection threshold.

本発明の認識プログラムにおける他の実施形態によれば、
選択エンジンは、第１の選択閾値以上となった一方の認識エンジンの選択スコアと、第１の選択閾値未満となった他方の認識エンジンの認識スコアとの差が、所定差分以下である場合、他方の認識エンジンへ更に、当該入力データを出力する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the recognition program of the present invention,
When the difference between the selection score of one recognition engine that is equal to or greater than the first selection threshold and the recognition score of the other recognition engine that is less than the first selection threshold is equal to or less than a predetermined difference, It is also preferable to further cause the computer to function so as to output the input data to the other recognition engine.

本発明の認識プログラムにおける他の実施形態によれば、
入力データは、映像データであり、
複数の認識エンジンは、互いに異なるものであり、
ＲＧＢ画像に基づく物体認識エンジン、
オプティカルフローに基づく動体認識エンジン、及び／又は、
スケルトン情報に基づく人物の関節認識エンジン
のいずれかである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the recognition program of the present invention,
The input data is video data,
Multiple recognition engines are different from each other,
An object recognition engine based on RGB images;
Motion recognition engine based on optical flow and / or
It is also preferred to make the computer function so that it is one of the human joint recognition engines based on skeleton information.

本発明によれば、複数の認識エンジンを用いて、入力データからコンテキストを推定する認識装置において、
入力データと認識エンジンの識別子とを対応付けた教師データによって学習した学習モデルを用いて、推定すべき入力データに対する認識エンジンを選択し、選択された当該認識エンジンへ当該入力データを出力する選択エンジンと、
当該入力データに対して認識エンジンによって算出された認識スコアが認識閾値以上となる認識エンジンを判定し、当該認識エンジンの識別子を選択エンジンへフィードバックする認識スコア判定手段と
を有し、
選択エンジンは、当該入力データと、フィードバックされた当該認識エンジンの識別子とを対応付けた教師データによって、学習モデルを再学習する
ことを特徴とする。 According to the present invention, in a recognition apparatus that estimates a context from input data using a plurality of recognition engines,
A selection engine that selects a recognition engine for input data to be estimated using a learning model learned from teacher data in which input data is associated with an identifier of the recognition engine, and outputs the input data to the selected recognition engine When,
A recognition score determining means for determining a recognition engine for which the recognition score calculated by the recognition engine for the input data is equal to or greater than a recognition threshold, and feeding back the identifier of the recognition engine to the selection engine;
The selection engine is characterized in that the learning model is re-learned by using teacher data in which the input data is associated with the fed back identifier of the recognition engine.

本発明によれば、複数の認識エンジンを用いて、入力データからコンテキストを推定する装置の認識方法において、
装置は、
入力データと認識エンジンの識別子とを対応付けた教師データによって学習した学習モデルを用いて、推定すべき入力データに対する認識エンジンを選択し、選択された当該認識エンジンへ当該入力データを出力する第１のステップと、
当該入力データに対して認識エンジンによって算出された認識スコアが認識閾値以上となる認識エンジンを判定する第２のステップと、
当該入力データと、第２のステップによって真と判定された当該認識エンジンの識別子とを対応付けた教師データによって、学習モデルを再学習する第３のステップと
を実行することを特徴とする。 According to the present invention, in a recognition method for an apparatus for estimating a context from input data using a plurality of recognition engines,
The device
First, a recognition engine for input data to be estimated is selected using a learning model learned from teacher data in which input data is associated with an identifier of a recognition engine, and the input data is output to the selected recognition engine. And the steps
A second step of determining a recognition engine having a recognition score calculated by the recognition engine for the input data equal to or greater than a recognition threshold;
A third step of re-learning the learning model is performed using the teacher data in which the input data is associated with the identifier of the recognition engine determined to be true in the second step.

本発明のプログラム、装置及び方法によれば、入力データに応じて最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキストの認識精度を高めることができる。 According to the program, apparatus, and method of the present invention, it is possible to improve context recognition accuracy by automatically selecting one or more optimal recognition engines according to input data.

認識装置を有するシステム構成図である。It is a system block diagram which has a recognition apparatus. 本発明における認識装置の機能構成図である。It is a functional block diagram of the recognition apparatus in this invention. 本発明における具体的な第１の処理フローである。It is a specific 1st processing flow in this invention. 本発明における具体的な第２の処理フローである。It is a concrete 2nd processing flow in this invention. 本発明における具体的な第３の処理フローである。It is a specific 3rd processing flow in this invention. 本発明における具体的な第４の処理フローである。It is a concrete 4th processing flow in the present invention. 映像データに対する具体的な第５の処理フローである。It is a specific 5th processing flow with respect to video data. 図７に基づくフローチャートである。It is a flowchart based on FIG.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、本発明における認識装置の機能構成図である。 FIG. 2 is a functional configuration diagram of the recognition device according to the present invention.

認識装置１は、複数の認識エンジンを用いて、入力データからコンテキスト（例えば物体、動体、人物行動など）を推定する。
図２によれば、認識装置１は、選択エンジン１１と、複数の認識エンジン１２（第１の認識エンジン１２１、第２の認識エンジン１２２）と、認識スコア判定部１３とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現できる。また、これら機能構成部の処理の流れは、入力データに対する装置の認識方法としても理解できる。 The recognition apparatus 1 estimates a context (for example, an object, a moving object, a human action, etc.) from input data using a plurality of recognition engines.
According to FIG. 2, the recognition apparatus 1 includes a selection engine 11, a plurality of recognition engines 12 (a first recognition engine 121 and a second recognition engine 122), and a recognition score determination unit 13. These functional components can be realized by executing a program that causes a computer installed in the apparatus to function. Further, the processing flow of these functional components can be understood as a device recognition method for input data.

［選択エンジン１１］
選択エンジン１１は、クラス分類に基づくものであって、推定すべき入力データに、クラス（認識エンジン１２の識別子）を付与する機械学習エンジンである。選択エンジン１１は、入力データと認識エンジンの識別子とを対応付けた教師データに基づいて、学習モデルを予め構築したものである。
＜教師データ＞
入力データ <-> 認識エンジンの識別子
選択エンジン１１は、具体的には、認識エンジン（クラス）毎に、スコア（認識精度）を算出する。一般的には、スコアが最も高い１つの認識エンジンが、推定結果として選択される。但し、本発明によれば、認識エンジンは、１つに限られず、複数であってもよい。選択エンジン１１の選択方法における実施形態については、図３〜図６で後述する。
そして、選択エンジン１１は、学習モデルを用いて、推定すべき入力データに対する認識エンジン１２を選択し、選択された認識エンジン１２へ入力データを出力する。 [Selection engine 11]
The selection engine 11 is based on a class classification, and is a machine learning engine that gives a class (identifier of the recognition engine 12) to input data to be estimated. The selection engine 11 builds a learning model in advance based on teacher data in which input data is associated with an identifier of a recognition engine.
<Teacher data>
Input Data <-> Recognition Engine Identifier The selection engine 11 specifically calculates a score (recognition accuracy) for each recognition engine (class). In general, one recognition engine having the highest score is selected as the estimation result. However, according to the present invention, the number of recognition engines is not limited to one and may be plural. An embodiment of the selection method of the selection engine 11 will be described later with reference to FIGS.
Then, the selection engine 11 selects a recognition engine 12 for the input data to be estimated using the learning model, and outputs the input data to the selected recognition engine 12.

尚、本発明の選択エンジン１１は、完全な学習モデルを予め構築しておく必要はなく、後述する認識スコア判定部１３からのフィードバックによって再学習していく。「再学習」とは、入力データと、フィードバックされた認識エンジンの識別子とを教師データとして、当該学習モデルに更に学習させることをいう。 Note that the selection engine 11 of the present invention does not need to build a complete learning model in advance, and re-learns by feedback from a recognition score determination unit 13 described later. “Relearning” means that the learning model is further trained using the input data and the fed back recognition engine identifier as teacher data.

［認識エンジン１２］
選択エンジン１１によって選択された認識エンジン１２は、当該選択エンジン１１から、入力データを入力する。認識エンジン１２も、クラス分類に基づくものであって、クラス（推定可能なコンテキスト）毎に、認識スコア（認識精度）を算出する。一般的には、認識スコアが最も高い１つのコンテキストが、推定結果として出力される。 [Recognition engine 12]
The recognition engine 12 selected by the selection engine 11 inputs input data from the selection engine 11. The recognition engine 12 is also based on the class classification, and calculates a recognition score (recognition accuracy) for each class (estimable context). Generally, one context having the highest recognition score is output as an estimation result.

本発明によれば、異なる種類の複数の認識エンジン１２を有する。例えば、物体を主として認識するエンジン、大まかな行動を主として認識するエンジン、細かな行動を主として認識するエンジンのように、異なる種類の認識エンジンを組み合わせる。各認識エンジンは、その種類に応じて異なる教師データに基づいて、学習モデルを予め構築したものである。 According to the present invention, it has a plurality of different types of recognition engines 12. For example, different types of recognition engines are combined, such as an engine that mainly recognizes objects, an engine that mainly recognizes rough actions, and an engine that mainly recognizes fine actions. Each recognition engine builds a learning model in advance based on different teacher data depending on its type.

図２によれば、２つの認識エンジン（第１の認識エンジン１２１、第２の認識エンジン１２２）を有する。認識エンジン１２によって算出される認識スコアは、複数のコンテキストの複数の認識スコアにおける最高値、最低値、平均値又は加算値のいずれかの「統計値」であってもよい。
そして、各認識エンジン１２は、コンテキスト毎に算出された認識スコアを、認識スコア判定部１３へ出力する。 According to FIG. 2, it has two recognition engines (the 1st recognition engine 121 and the 2nd recognition engine 122). The recognition score calculated by the recognition engine 12 may be a “statistical value” of any one of the highest value, the lowest value, the average value, and the added value of the plurality of recognition scores in the plurality of contexts.
Each recognition engine 12 then outputs a recognition score calculated for each context to the recognition score determination unit 13.

［認識スコア判定部１３］
認識スコア判定部１３は、当該入力データに対して各認識エンジン１２の各コンテキストについて算出された認識スコアが、「認識閾値」以上であるか否かを判定する。
ここで、真（認識スコア≧認識閾値）と判定された場合、当該認識エンジン１２の識別子を選択エンジン１１へフィードバックする。これに対して、選択エンジン１１は、当該入力データと当該認識エンジンの識別子とを対応付けた教師データとして、学習モデルを再学習する。
また、各認識エンジン１２によって算出された認識スコアの中で、認識閾値以上となるコンテキストは、推定結果として、アプリケーションへ出力される。
尚、認識閾値は、オペレータによって任意に設定可能なものである。 [Recognition score determination unit 13]
The recognition score determination unit 13 determines whether or not the recognition score calculated for each context of each recognition engine 12 with respect to the input data is greater than or equal to the “recognition threshold”.
Here, when it is determined to be true (recognition score ≧ recognition threshold), the identifier of the recognition engine 12 is fed back to the selection engine 11. On the other hand, the selection engine 11 re-learns the learning model as teacher data in which the input data is associated with the identifier of the recognition engine.
Further, among the recognition scores calculated by each recognition engine 12, a context that is equal to or higher than the recognition threshold is output to the application as an estimation result.
The recognition threshold can be arbitrarily set by the operator.

結果的に、選択エンジン１１は、認識スコア判定部１３からのフィードバックに基づいて学習モデルを再学習することによって、その後、推定すべき入力データに対して、できる限り最適な認識エンジン１２を選択するようになる。 As a result, the selection engine 11 re-learns the learning model based on the feedback from the recognition score determination unit 13, and then selects the most suitable recognition engine 12 for the input data to be estimated thereafter. It becomes like this.

図３は、本発明における具体的な第１の処理フローである。 FIG. 3 is a specific first processing flow in the present invention.

図３によれば、選択エンジン１１は、推定すべき入力データに対する各認識エンジンについて、以下のように選択スコアを算出したとする。
［認識エンジンＩＤ］［選択スコア］
Ｓ１ -> ０．７
Ｓ２ -> ０．６
＜選択エンジン１１＞※第１の選択閾値＝０．６
ここで、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．６）以上となる認識エンジン１２１及び１２２の両方へ、当該入力データを出力している。
尚、第１の選択閾値は、オペレータによって任意に設定可能なものである。 According to FIG. 3, it is assumed that the selection engine 11 calculates a selection score for each recognition engine for input data to be estimated as follows.
[Recognition engine ID] [Selected score]
S1-> 0.7
S2-> 0.6
<Selection engine 11> * first selection threshold = 0.6
Here, the selection engine 11 outputs the input data to both the recognition engines 121 and 122 whose selection score is equal to or higher than the first selection threshold (for example, 0.6).
Note that the first selection threshold can be arbitrarily set by the operator.

次に、第１の認識エンジン１２１及び第２の認識エンジン１２２は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。ここでは、複数のコンテキストの複数の認識スコアにおける「最高値」を統計値としたものである。
＜第１の認識エンジン１２１＞（コンテキスト）：（認識スコア）
ｃ１１：０．５
ｃ１２：０．２
ｃ１３：０．１
※最高値（統計値）＝０．５
＜第２の認識エンジン１２２＞（コンテキスト）：（認識スコア）
ｃ２１：０．７
ｃ２２：０．３
ｃ２３：０．３
※最高値（統計値）＝０．７
＜認識スコア判定部１３＞ ※認識閾値＝０．６ Next, the first recognition engine 121 and the second recognition engine 122 output a recognition score for each context with respect to the input data to the recognition score determination unit 13. Here, the “highest value” in a plurality of recognition scores in a plurality of contexts is a statistical value.
<First recognition engine 121> (Context): (Recognition score)
c11: 0.5
c12: 0.2
c13: 0.1
* Maximum value (statistical value) = 0.5
<Second recognition engine 122> (context): (recognition score)
c21: 0.7
c22: 0.3
c23: 0.3
* Maximum value (statistical value) = 0.7
<Recognition score determination unit 13> * Recognition threshold = 0.6

認識スコア判定部１３は、認識スコアが、認識閾値（０．６）以上であるか否かを判定する。ここでは、第２の認識エンジン１２２のみが認識スコア０．６以上であるために、第２の認識エンジン１２２の識別子（ＩＤ：１２２）を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、フィードバックされた第２の認識エンジン１２２の識別子とを対応付けた教師データによって、学習モデルを再学習する。
尚、図３によれば、統計値は、最高値であるとして説明したが、最低値、平均値、加算値であってもよい。 The recognition score determination unit 13 determines whether or not the recognition score is a recognition threshold (0.6) or more. Here, since only the second recognition engine 122 has a recognition score of 0.6 or more, the identifier (ID: 122) of the second recognition engine 122 is fed back to the selection engine 11.
As a result, the selection engine 11 re-learns the learning model based on the teacher data in which the input data is associated with the fed back identifier of the second recognition engine 122.
Although the statistical value has been described as being the highest value according to FIG. 3, it may be the lowest value, the average value, or the added value.

図４は、本発明における具体的な第２の処理フローである。 FIG. 4 is a specific second processing flow in the present invention.

図４によれば、図３と比較して、選択エンジン１１は、推定すべき入力データに対する各認識エンジンについて、以下のように選択スコアを算出したとする。
［認識エンジンＩＤ］［選択スコア］
Ｓ１ -> ０．７
Ｓ２ -> ０．６
＜選択エンジン１１＞※第１の選択閾値＝０．７
ここで、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．７）以上となる第１の認識エンジン１２１のみへ、当該入力データを出力する。この場合、第２の認識エンジン１２２へは、入力データは出力されない。 According to FIG. 4, it is assumed that the selection engine 11 calculates a selection score for each recognition engine for input data to be estimated as follows, as compared with FIG.
[Recognition engine ID] [Selected score]
S1-> 0.7
S2-> 0.6
<Selection engine 11> * First selection threshold = 0.7
Here, the selection engine 11 outputs the input data only to the first recognition engine 121 whose selection score is equal to or higher than the first selection threshold (for example, 0.7). In this case, input data is not output to the second recognition engine 122.

次に、第１の認識エンジン１２１は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。ここでも、コンテキストの複数の認識スコアにおける「最高値」を統計値とする。
＜第１の認識エンジン１２１＞（コンテキスト）：（認識スコア）
ｃ１１：０．５
ｃ１２：０．２
ｃ１３：０．１
※最高値（統計値）＝０．５ Next, the first recognition engine 121 outputs a recognition score for each context for the input data to the recognition score determination unit 13. Again, the “highest value” in the plurality of recognition scores of the context is used as the statistical value.
<First recognition engine 121> (Context): (Recognition score)
c11: 0.5
c12: 0.2
c13: 0.1
* Maximum value (statistical value) = 0.5

そして、認識スコア判定部１３は、認識スコアが認識閾値（例えば０．５）以上となる第１の認識エンジン１２１の識別子（ＩＤ：１２１）を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、当該第１の認識エンジン１２１の識別子とを対応付けた教師データとして更に、学習モデルを再学習する。 Then, the recognition score determination unit 13 feeds back to the selection engine 11 an identifier (ID: 121) of the first recognition engine 121 whose recognition score is equal to or greater than a recognition threshold (for example, 0.5).
Thereby, the selection engine 11 further re-learns the learning model as teacher data in which the input data is associated with the identifier of the first recognition engine 121.

図５は、本発明における具体的な第３の処理フローである。 FIG. 5 is a specific third processing flow in the present invention.

図５によれば、図３と同様に、認識エンジン１２によって算出される認識スコアは、複数のコンテキストの複数のスコアにおける最高値を統計値として、算出している。
＜第１の認識エンジン１２１＞（コンテキスト）：（スコア）
ｃ１１：０．６
ｃ１２：０．２
ｃ１３：０．１
※最高値（統計値）＝０．６
※処理時間＝１００ｍｓ
＜第２の認識エンジン１２２＞（コンテキスト）：（スコア）
ｃ２１：０．７
ｃ２２：０．３
ｃ２３：０．３
※最高値（統計値）＝０．７
※処理時間＝５００ｍｓ
＜認識スコア判定部１３＞ ※認識閾値＝０．６
※所定閾値時間＝２００ｍｓ According to FIG. 5, as in FIG. 3, the recognition score calculated by the recognition engine 12 is calculated by using the highest value of a plurality of scores in a plurality of contexts as a statistical value.
<First recognition engine 121> (Context): (Score)
c11: 0.6
c12: 0.2
c13: 0.1
* Maximum value (statistical value) = 0.6
* Processing time = 100ms
<Second recognition engine 122> (context): (score)
c21: 0.7
c22: 0.3
c23: 0.3
* Maximum value (statistical value) = 0.7
* Processing time = 500ms
<Recognition score determination unit 13> * Recognition threshold = 0.6
* Predetermined threshold time = 200 ms

認識スコア判定部１３は、認識スコアが、認識閾値（０．６）以上であるか否かを判定する。ここでは、認識エンジン１２１及び１２２の両方の認識スコアが０．６以上である。
また、認識スコア判定部１３は、入力データに対して要した処理時間が、所定閾値時間（２００ｍｓ）以下であるか否かを判定する。ここでは、第２の認識エンジン１２２の処理時間が５００ｍｓであって、偽となる。この場合、認識スコア判定部１３は、第１の認識エンジン１２１の識別子（ＩＤ：１２１）のみを、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、第１の認識エンジン１２１の識別子とを対応付けた教師データによって、学習モデルを再学習する。
このように、認識スコアのみならず、認識エンジンの「処理時間」に基づいて、選択エンジン１１の学習モデルを再学習することは、処理リソースの観点も好ましい。 The recognition score determination unit 13 determines whether or not the recognition score is a recognition threshold (0.6) or more. Here, the recognition scores of both the recognition engines 121 and 122 are 0.6 or more.
The recognition score determination unit 13 determines whether the processing time required for the input data is equal to or shorter than a predetermined threshold time (200 ms). Here, the processing time of the second recognition engine 122 is 500 ms, which is false. In this case, the recognition score determination unit 13 feeds back only the identifier (ID: 121) of the first recognition engine 121 to the selection engine 11.
As a result, the selection engine 11 re-learns the learning model based on the teacher data in which the input data is associated with the identifier of the first recognition engine 121.
Thus, it is preferable from the viewpoint of processing resources to re-learn the learning model of the selection engine 11 based not only on the recognition score but also on the “processing time” of the recognition engine.

図６は、本発明における具体的な第４の処理フローである。 FIG. 6 is a specific fourth processing flow in the present invention.

図６によれば、選択エンジン１１の学習モデルに学習漏れを考慮したものである。即ち、選択スコアが第１の選択閾値未満となった認識エンジン１２であっても、選択エンジン１１の学習モデルの学習が不完全であったために、選択すべき認識エンジン１２を選択できなかった可能性がある。その場合、その認識エンジン１２の認識スコアについて改めて、選択エンジン１１の学習モデルの再学習に利用するか否かを判定する。
図６によれば、２つの実施形態について記載されている。 According to FIG. 6, the learning model of the selection engine 11 is considered in the learning omission. That is, even if the recognition engine 12 has a selection score that is less than the first selection threshold, the recognition engine 12 to be selected cannot be selected because the learning model of the selection engine 11 is incompletely learned. There is sex. In this case, the recognition score of the recognition engine 12 is determined again to determine whether or not to use the learning engine for the learning model again.
According to FIG. 6, two embodiments are described.

＜第１の実施形態＞
前述した選択エンジン１１によれば、選択スコアが第１の選択閾値（例えば０．６）以上となる第１の認識エンジン１２１を選択する。
これに対し更に、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．６）未満で且つ第２の選択閾値（例えば０．５）以上となる第２の認識エンジン１２２も選択する。
そして、選択エンジン１１は、選択された認識エンジン１２１及び１２２の両方へ入力データを出力する。
次に、認識エンジン１２１及び１２２は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。
そして、認識スコア判定部１３は、認識スコアが認識閾値（例えば０．６）以上となる認識エンジン１２１及び１２２の識別子を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、当該認識エンジン１２１及び１２２の識別子とを対応付けた教師データとして、学習モデルを再学習する。 <First Embodiment>
According to the selection engine 11 described above, the first recognition engine 121 having a selection score equal to or higher than a first selection threshold (for example, 0.6) is selected.
On the other hand, the selection engine 11 also selects the second recognition engine 122 whose selection score is less than the first selection threshold (for example, 0.6) and equal to or higher than the second selection threshold (for example, 0.5). .
Then, the selection engine 11 outputs input data to both of the selected recognition engines 121 and 122.
Next, the recognition engines 121 and 122 output a recognition score for each context for the input data to the recognition score determination unit 13.
Then, the recognition score determination unit 13 feeds back to the selection engine 11 the identifiers of the recognition engines 121 and 122 whose recognition score is equal to or greater than the recognition threshold (for example, 0.6).
As a result, the selection engine 11 re-learns the learning model as teacher data in which the input data and the identifiers of the recognition engines 121 and 122 are associated with each other.

＜第２の実施形態＞
前述した選択エンジン１１によれば、選択スコアが第１の選択閾値（例えば０．６）以上となる第１の認識エンジン１２１を選択する。
これに対し更に、選択エンジン１１は、選択された第１の認識エンジン１２１の選択スコア（例えば０．７）と、第１の選択閾値未満となった第２の認識エンジン１２２の選択スコア（例えば０．５）との差が、所定差分（例えば０．２）以下であるか否かを判定する。真と判定された場合、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．６未満）となる第２の認識エンジン１２２も選択する。
そして、選択エンジン１１は、選択された認識エンジン１２１及び１２２の両方へ入力データを出力する。
次に、認識エンジン１２１及び１２２は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。
そして、認識スコア判定部１３は、認識スコアが認識閾値（例えば０．６）以上となる認識エンジン１２１及び１２２の識別子を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、当該認識エンジン１２１及び１２２の識別子とを対応付けた教師データとして、学習モデルを再学習する。 <Second Embodiment>
According to the selection engine 11 described above, the first recognition engine 121 having a selection score equal to or higher than a first selection threshold (for example, 0.6) is selected.
On the other hand, the selection engine 11 further selects the selection score (for example, 0.7) of the selected first recognition engine 121 and the selection score (for example, the second recognition engine 122 that is less than the first selection threshold). It is determined whether or not the difference from 0.5) is a predetermined difference (for example, 0.2) or less. If it is determined to be true, the selection engine 11 also selects the second recognition engine 122 whose selection score is a first selection threshold (for example, less than 0.6).
Then, the selection engine 11 outputs input data to both of the selected recognition engines 121 and 122.
Next, the recognition engines 121 and 122 output a recognition score for each context for the input data to the recognition score determination unit 13.
Then, the recognition score determination unit 13 feeds back to the selection engine 11 the identifiers of the recognition engines 121 and 122 whose recognition score is equal to or greater than the recognition threshold (for example, 0.6).
As a result, the selection engine 11 re-learns the learning model as teacher data in which the input data and the identifiers of the recognition engines 121 and 122 are associated with each other.

前述した図３〜図６によれば、全ての認識エンジン１２によって算出された認識スコアを、１つの認識閾値によって判定している。これに対し、他の実施形態として、認識エンジン１２毎に、異なる認識閾値によって判定するもであってもよい。 According to FIGS. 3 to 6 described above, the recognition scores calculated by all the recognition engines 12 are determined by one recognition threshold value. On the other hand, as another embodiment, each recognition engine 12 may be determined by a different recognition threshold.

尚、全ての認識エンジン１２によって算出された認識スコアが、認識閾値に満たない場合、別途又は特定の認識エンジンによって認識するようにしたものであってもよいし、当該入力データに認識エンジン無しを対応付けた教師データとして、選択エンジン１１の学習モデルを再学習するものであってもよい。 In addition, when the recognition score calculated by all the recognition engines 12 is less than a recognition threshold value, you may be made to recognize separately or with a specific recognition engine, and the said input data may be without a recognition engine. As the associated teacher data, the learning model of the selection engine 11 may be relearned.

図７は、映像データに対する具体的な第５の処理フローである。
図８は、図７に基づくフローチャートである。 FIG. 7 is a specific fifth processing flow for video data.
FIG. 8 is a flowchart based on FIG.

認識装置１は、入力データとして、人の行動が映り込む映像データを入力し、行動認識結果（コンテキスト）を推定するとする。
図７及び図８によれば、互いに異なる３つの認識エンジンを有する。
（１）ＲＧＢ画像に基づく物体認識エンジン
（２）オプティカルフローに基づく動体認識エンジン
（３）スケルトン情報に基づく人物の関節認識エンジン
これら認識エンジンはそれぞれ、人物が映り込む大量の映像データに行動結果が対応付けられた教師データによって、学習モデルを予め生成したものである。物体認識、動体認識及び関節認識では、同じ映像データを認識する場合であっても、行動結果としてのコンテキストが異なっていてもよい。 Assume that the recognition device 1 inputs video data in which a human action is reflected as input data and estimates an action recognition result (context).
According to FIG.7 and FIG.8, it has three mutually different recognition engines.
(1) Object recognition engine based on RGB image (2) Motion recognition engine based on optical flow (3) Human joint recognition engine based on skeleton information Each of these recognition engines has an action result in a large amount of video data in which a person is reflected. A learning model is generated in advance using the associated teacher data. In object recognition, moving object recognition, and joint recognition, even if the same video data is recognized, the contexts as behavior results may be different.

（１）ＲＧＢ認識に基づく物体認識エンジンは、具体的にはＣＮＮ(Convolutional Neural Network)のようなニューラルネットワークを用いて、撮影映像に映り込むオブジェクト（対象物）を推定する。
例えば「コップ」「スマホ」「テレビ」「建物」のように、映像データに物体が映り込んでいる場合、物体を高い精度で認識する。 (1) The object recognition engine based on RGB recognition specifically estimates an object (target object) reflected in a captured image using a neural network such as CNN (Convolutional Neural Network).
For example, when an object is reflected in video data such as “cop”, “smartphone”, “television”, and “building”, the object is recognized with high accuracy.

（２）オプティカルフローに基づく動体認識エンジンは、フレーム間で同一の特徴点が動いている箇所を抽出し、撮影映像の中の物体の動きを「ベクトル」で表すものである。
例えば「把持」「振る」「パンチ」「蹴る」のように、映像データに人物の動きが映り込んでいる場合、動体を高い精度で認識する。 (2) The moving object recognition engine based on the optical flow extracts a part where the same feature point is moving between frames, and represents the movement of the object in the captured video as a “vector”.
For example, when a person's movement is reflected in the video data, such as “gripping”, “shaking”, “punch”, “kick”, the moving object is recognized with high accuracy.

（３）スケルトン情報に基づく人物の関節認識エンジンは、具体的にはOpenPose（登録商標）のようなスケルトンモデルを用いて、人の関節の特徴点を抽出するものである（例えば非特許文献７〜９参照）。OpenPoseとは、画像から複数の人間の体／手／顔のキーポイントをリアルタイムに検出可能なソフトウェアであって、GitHubによって公開されている。撮影映像に映る人の身体全体であれば、例えば１５点のキーポイントを検出できる。
例えば「飲む」「食べる」「走る」「畳む」のように、映像データに人物の関節の角度や位置に基づく人物の動きが映り込んでいる場合、人物の関節の動きを高い精度で認識する。 (3) The human joint recognition engine based on the skeleton information specifically extracts a feature point of a human joint using a skeleton model such as OpenPose (registered trademark) (for example, Non-Patent Document 7). To 9). OpenPose is software that can detect multiple human body / hand / face keypoints in real time from images, and is published by GitHub. For example, 15 key points can be detected in the whole body of a person shown in a captured image.
For example, if the movement of a person based on the angle or position of a person's joint is reflected in the video data, such as “drink”, “eat”, “run”, or “fold”, the movement of the person's joint is recognized with high accuracy. .

人物の行動認識については、一般的に、物体認識よりも、動体認識及び関節認識の方が、認識精度は高い。また、人物の身体の動作認識の場合、動体認識よりも、関節認識の方が、認識精度は高い。 As for human action recognition, in general, moving object recognition and joint recognition have higher recognition accuracy than object recognition. Also, in the case of motion recognition of a person's body, joint recognition has higher recognition accuracy than motion recognition.

図７及び図８によれば、以下のように処理されている。
（Ｓ１０）認識装置１は、「映像データ」を入力する。
（Ｓ１１）選択エンジン１１は、図７によれば、全ての認識エンジン１２を選択しているとする。この場合、選択エンジン１１は、各認識エンジン１２へ、映像データを出力している。
（Ｓ１２）各認識エンジン１２は、以下のようなコンテキスト及び認識スコアを出力している。
＜ＲＧＢ認識エンジン１２１＞（コンテキスト）：（スコア）
コップ：０．７
スマホ：０．４
テレビ：０．１
※最高値（統計値）＝０．７
＜オプティカルフロー認識エンジン１２２＞（コンテキスト）：（スコア）
把持：０．４
振る：０．２
パンチ：０．１
※最高値（統計値）＝０．４
＜スケルトン認識エンジン１２３＞（コンテキスト）：（スコア）
飲む：０．６
食べる：０．２
走る：０．０
※最高値（統計値）＝０．６
（Ｓ１３）認識スコア判定部１３は、認識スコアが、認識閾値（０．６）以上であるか否かを判定する。ここでは、認識エンジン１２１及び１２３が、認識スコア０．６以上となっている。
また、認識エンジン１２１及び１２３によって算出された認識スコアの中で、認識閾値以上となるコンテンツ「コップ」「飲む」が、推定結果として、アプリケーションへ出力される。
（Ｓ１４）認識エンジン１２１及び１２３の識別子（ＩＤ：１２１、１２３）を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、フィードバックされた認識エンジン１２１及び１２３の識別子とを対応付けた教師データによって、学習モデルを再学習する。 According to FIG.7 and FIG.8, it processes as follows.
(S10) The recognition apparatus 1 inputs “video data”.
(S11) According to FIG. 7, it is assumed that the selection engine 11 has selected all the recognition engines 12. In this case, the selection engine 11 outputs video data to each recognition engine 12.
(S12) Each recognition engine 12 outputs the following context and recognition score.
<RGB Recognition Engine 121> (Context): (Score)
Cup: 0.7
Smartphone: 0.4
TV: 0.1
* Maximum value (statistical value) = 0.7
<Optical flow recognition engine 122> (context): (score)
Grasping: 0.4
Shake: 0.2
Punch: 0.1
* Maximum value (statistical value) = 0.4
<Skeleton recognition engine 123> (context): (score)
Drinking: 0.6
Eat: 0.2
Run: 0.0
* Maximum value (statistical value) = 0.6
(S13) The recognition score determination unit 13 determines whether or not the recognition score is greater than or equal to the recognition threshold (0.6). Here, the recognition engines 121 and 123 have a recognition score of 0.6 or more.
Further, among the recognition scores calculated by the recognition engines 121 and 123, the contents “cup” and “drink” that are equal to or higher than the recognition threshold are output to the application as estimation results.
(S14) The identifiers (ID: 121, 123) of the recognition engines 121 and 123 are fed back to the selection engine 11.
As a result, the selection engine 11 re-learns the learning model based on the teacher data in which the input data is associated with the fed back identifiers of the recognition engines 121 and 123.

他の実施形態として、本発明の認識エンジンは、映像データに基づくものに限られず、文字認識のものであってもよいし、特定の物体（例えば花の種類）専用に認識するものであってもよい。 As another embodiment, the recognition engine of the present invention is not limited to the one based on video data, but may be a character recognition one or a recognition object dedicated to a specific object (for example, a type of flower). Also good.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、入力データに応じて最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキストの認識精度を高めることができる。
本発明によれば、入力データに対する認識エンジンは、学習モデルを有する選択エンジンによって切り替えられるために、予め決定しておく必要がない。
特に、本発明によれば、選択エンジンの学習モデルは、学習段階のみならず、運用段階であっても再学習することができる。 As described above in detail, according to the program, apparatus, and method of the present invention, the context recognition accuracy is improved by automatically selecting one or more optimum recognition engines according to input data. Can do.
According to the present invention, since the recognition engine for the input data is switched by the selection engine having the learning model, it is not necessary to determine in advance.
In particular, according to the present invention, the learning model of the selection engine can be relearned not only at the learning stage but also at the operation stage.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various changes, modifications, and omissions of the above-described various embodiments of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１認識装置
１１選択エンジン
１２認識エンジン
１２１第１の認識エンジン
１２２第２の認識エンジン
１３認識スコア判定部
２端末

DESCRIPTION OF SYMBOLS 1 Recognition apparatus 11 Selection engine 12 Recognition engine 121 1st recognition engine 122 2nd recognition engine 13 Recognition score determination part 2 Terminal

Claims

In a recognition program that allows a computer to function to infer context from input data using multiple recognition engines,
A selection engine that selects a recognition engine for input data to be estimated using a learning model learned from teacher data in which input data is associated with an identifier of the recognition engine, and outputs the input data to the selected recognition engine When,
Determining a recognition engine in which the recognition score calculated by the recognition engine for the input data is equal to or greater than a recognition threshold, and causing the computer to function as a recognition score determination unit that feeds back the identifier of the recognition engine to the selection engine;
The selection engine causes a computer to function so as to re-learn the learning model based on teacher data in which the input data is associated with the fed back identifier of the recognition engine.

The recognition engine is based on a class classification that calculates a recognition score for each class,
The recognition engine causes a computer to calculate a statistical value of any one of a maximum value, a minimum value, an average value, and an addition value of a plurality of scores of a plurality of classes as the recognition score. Item 4. The recognition program according to item 1.

The recognition score determination means causes the computer to function so as to feed back only the identifier of the recognition engine whose processing time required for the input data is equal to or less than a predetermined threshold time to the selection engine. The recognition program according to claim 1 or 2.

The selection engine is based on a class classification that calculates a selection score for each class,
The said selection engine makes a computer function so that the said input data may be output to the recognition engine from which the said selection score with respect to the input data which should be estimated becomes more than a 1st selection threshold value. The recognition program according to any one of claims.

The selection engine further causes the computer to function so as to output the input data to a recognition engine whose selection score is less than a first selection threshold and greater than or equal to a second selection threshold. The recognition program described in 1.

The selection engine has a difference between a selection score of one recognition engine that is equal to or higher than the first selection threshold and a recognition score of the other recognition engine that is lower than the first selection threshold is equal to or less than a predetermined difference. 5. The recognition program according to claim 4, further causing a computer to function to output the input data to the other recognition engine.

The input data is video data,
The plurality of recognition engines are different from each other.
An object recognition engine based on RGB images;
Motion recognition engine based on optical flow and / or
The recognition program according to any one of claims 1 to 6, wherein the computer functions so as to be one of human joint recognition engines based on skeleton information.

In a recognition apparatus that estimates a context from input data using a plurality of recognition engines,
A selection engine that selects a recognition engine for input data to be estimated using a learning model learned from teacher data in which input data is associated with an identifier of the recognition engine, and outputs the input data to the selected recognition engine When,
A recognition score determining means for determining a recognition engine for which the recognition score calculated by the recognition engine for the input data is equal to or greater than a recognition threshold, and feeding back an identifier of the recognition engine to the selection engine;
The recognition apparatus, wherein the selection engine re-learns the learning model based on teacher data in which the input data is associated with the fed back identifier of the recognition engine.

In a method for recognizing a device that estimates a context from input data using a plurality of recognition engines,
The device is
First, a recognition engine for input data to be estimated is selected using a learning model learned from teacher data in which input data is associated with an identifier of a recognition engine, and the input data is output to the selected recognition engine. And the steps
A second step of determining a recognition engine having a recognition score calculated by the recognition engine for the input data equal to or greater than a recognition threshold;
And a third step of re-learning the learning model with the teacher data in which the input data is associated with the identifier of the recognition engine determined to be true in the second step. Recognition method.