JP6875058B2

JP6875058B2 - Programs, devices and methods for estimating context using multiple recognition engines

Info

Publication number: JP6875058B2
Application number: JP2018021847A
Authority: JP
Inventors: 和之田坂; 柳原　広昌; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2021-05-19
Anticipated expiration: 2038-02-09
Also published as: JP2019139479A

Description

本発明は、複数の認識エンジンを用いてコンテキストを推定する技術に関する。 The present invention relates to a technique for estimating a context using a plurality of recognition engines.

近年、ディープラーニングを用いることによって、物体認識や人物の行動認識における認識精度が飛躍的に向上してきている。
例えば、特定のデータセットを入力し、機械学習アルゴリズムの候補を比較する技術がある（例えば特許文献１参照）。この技術によれば、機械学習モデル毎の性能結果を集計することによって、機械学習モデルの評価を自動的に比較することができる。 In recent years, the use of deep learning has dramatically improved the recognition accuracy in object recognition and human behavior recognition.
For example, there is a technique for inputting a specific data set and comparing candidates for machine learning algorithms (see, for example, Patent Document 1). According to this technique, the evaluations of machine learning models can be automatically compared by aggregating the performance results for each machine learning model.

映像データに対する認識エンジンとして、例えばＲＧＢ画像に映り込む物体を検出する物体認識の技術がある（例えば非特許文献１参照）。
また、移動特徴量（オプティカルフロー）から物体の動きを検出する動体認識の技術もある（例えば非特許文献２参照）。例えばTwo-stream ConvNetsによれば、空間方向のＣＮＮ(Spatial stream ConvNet)と時系列方向のＣＮＮ(Temporal stream ConvNet)とを用いて、画像中の物体や背景のアピアランス特徴と、オプティカルフローの水平方向成分及び垂直方向成分の系列における動き特徴との両方を抽出する。これら両方の特徴を統合することによって、行動を高精度に認識する。
更に、３次元映像から、人物の行動を認識する技術もある（例えば非特許文献３参照）。
更に、人の関節とその連携部分のスケルトン情報を抽出することによって、人物の行動を認識する技術もある（例えば非特許文献４参照）。 As a recognition engine for video data, for example, there is an object recognition technique for detecting an object reflected in an RGB image (see, for example, Non-Patent Document 1).
There is also a moving object recognition technique for detecting the movement of an object from a moving feature (optical flow) (see, for example, Non-Patent Document 2). For example, according to Two-stream ConvNets, using CNN (Spatial stream ConvNet) in the spatial direction and CNN (Temporal stream ConvNet) in the time series direction, the appearance features of objects and backgrounds in the image and the horizontal direction of optical flow. Both components and motion features in the series of vertical components are extracted. By integrating both of these features, behavior is recognized with high accuracy.
Further, there is also a technique for recognizing a person's behavior from a three-dimensional image (see, for example, Non-Patent Document 3).
Further, there is also a technique of recognizing a person's behavior by extracting skeleton information of a person's joint and its cooperation portion (see, for example, Non-Patent Document 4).

その他の適用分野として、ロボットの自律動作によれば、階層的な学習モデルの強化学習を実行する技術もある（例えば非特許文献５参照）。また、ゲートネットワークが使用する機械学習のネットワークであるエキスパートネットワークを、入力データに応じて選択する技術もある（例えば非特許文献６参照）。 As another application field, there is also a technique of executing reinforcement learning of a hierarchical learning model according to the autonomous motion of the robot (see, for example, Non-Patent Document 5). There is also a technique for selecting an expert network, which is a machine learning network used by a gate network, according to input data (see, for example, Non-Patent Document 6).

図１は、認識装置を有するシステム構成図である。 FIG. 1 is a system configuration diagram having a recognition device.

図１のシステムによれば、認識装置１は、インターネットに接続されたサーバとして機能する。認識装置１は、教師データによって予め学習モデルを構築した認識エンジンを有する。認識エンジンが、人物の行動を認識するものである場合、教師データは、人の行動が映り込む映像データと、その行動対象（コンテキスト）とが予め対応付けられたものである。 According to the system of FIG. 1, the recognition device 1 functions as a server connected to the Internet. The recognition device 1 has a recognition engine in which a learning model is constructed in advance based on teacher data. When the recognition engine recognizes a person's behavior, the teacher data is a pre-association between the video data in which the person's behavior is reflected and the action target (context).

端末２はそれぞれ、カメラを搭載しており、人の行動を撮影した映像データを、認識装置１へ送信する。端末２は、各ユーザによって所持されるスマートフォンや携帯端末であって、携帯電話網又は無線ＬＡＮのようなアクセスネットワークに接続する。
勿論、端末２は、スマートフォン等に限られず、例えば宅内に設置されたＷｅｂカメラであってもよい。また、Ｗｅｂカメラによって撮影された映像データがＳＤカードに記録され、その記録された映像データが認識装置１へ入力されるものであってもよい。 Each terminal 2 is equipped with a camera, and transmits video data of a person's behavior to the recognition device 1. The terminal 2 is a smartphone or mobile terminal owned by each user and connects to an access network such as a mobile phone network or a wireless LAN.
Of course, the terminal 2 is not limited to a smartphone or the like, and may be, for example, a Web camera installed in the house. Further, the video data captured by the Web camera may be recorded on the SD card, and the recorded video data may be input to the recognition device 1.

具体的には、例えばユーザに、自らのスマートフォンのカメラで、自らの行動を撮影してもらう。そのスマートフォンは、その映像データを、認識装置１へ送信する。認識装置１は、その映像データから人の行動を推定し、その推定結果を様々なアプリケーションで利用する。
尚、認識装置１の各機能が端末２に組み込まれたものであってもよい。 Specifically, for example, the user is asked to take a picture of his / her behavior with the camera of his / her smartphone. The smartphone transmits the video data to the recognition device 1. The recognition device 1 estimates a person's behavior from the video data, and uses the estimation result in various applications.
It should be noted that each function of the recognition device 1 may be incorporated in the terminal 2.

特開２０１７−００４５０９号公報JP-A-2017-004509

Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://arxiv.org/pdf/1506.01497.pdf＞Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, [online], [Search January 24, 2018], Internet <URL: https: / /arxiv.org/pdf/1506.01497.pdf ＞ Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://arxiv.org/abs/1406.2199.pdf＞Karen Simonyan and Andrew Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS 2014, [online], [Search January 24, 2018], Internet <URL: https://arxiv.org/ abs / 1406.2199.pdf ＞ Hernandez Ruiz, Alejandro & Porzi, Lorenzo & Rota Bul?, Samuel & Moreno-Noguer, Francesc: 3D CNNs on Distance Matrices for Human Action Recognition、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://www.researchgate.net/publication/320543521_3D_CNNs_on_Distance_Matrices_for_Human_Action_Recognition＞Hernandez Ruiz, Alejandro & Porzi, Lorenzo & Rota Bul ?, Samuel & Moreno-Noguer, Francesc: 3D CNNs on Distance Matrices for Human Action Recognition, [online], [Search January 24, 2018], Internet <URL: https://www.researchgate.net/publication/320543521_3D_CNNs_on_Distance_Matrices_for_Human_Action_Recognition> Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.、[online]、［平成３０年１月２８日検索］、インターネット＜https://arxiv.org/pdf/1611.08050.pdf＞Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh: Realtime Multi-Person 2D Pose Optimization using Part Affinity Fields., [Online], [Search January 28, 2018], Internet <https: // arxiv. org / pdf / 1611.08050.pdf ＞高橋泰岳、浅田稔、「複数の学習機構の階層的構築による行動獲得」、[online]、［平成３０年１月２４日検索］、インターネット＜URL:http://www.er.ams.eng.osaka-u.ac.jp/Paper/1999/Takahashi99c.pdf＞Yasudake Takahashi, Minoru Asada, "Acquisition of actions by hierarchical construction of multiple learning mechanisms", [online], [Search on January 24, 2018], Internet <URL: http: //www.er.ams. eng.osaka-u.ac.jp/Paper/1999/Takahashi99c.pdf ＞ Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean: OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://openreview.net/pdf?id=B1ckMDqlg＞Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean: OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER, [online], [January 24, 2018 Search ], Internet <URL: https://openreview.net/pdf?id=B1ckMDqlg> OpenPose、[online]、［平成３０年１月２４日検索］、インターネット＜URL:https://github.com/CMU-Perceptual-Computing-Lab/openpose＞OpenPose, [online], [Search January 24, 2018], Internet <URL: https://github.com/CMU-Perceptual-Computing-Lab/openpose> 「動画や写真からボーンが検出できる OpenPoseを試してみた」、[online]、［平成３０年１月２４日検索］、インターネット＜URL:http://hackist.jp/?p=8285＞"I tried OpenPose, which can detect bones from videos and photos", [online], [Searched on January 24, 2018], Internet <URL: http://hackist.jp/?p=8285> 「OpenPoseがどんどんバージョンアップして3d pose estimationも試せるようになっている」、[online]、［平成３０年１月２４日検索］、インターネット＜URL: http://izm-11.hatenablog.com/entry/2017/08/01/140945＞"OpenPose has been upgraded so that you can try 3d pose estimation", [online], [Search on January 24, 2018], Internet <URL: http://izm-11.hatenablog.com / entry / 2017/08/01/140945 ＞

非特許文献１〜４の技術によれば、認識精度が最も高くなるであろう学習モデルを予め決定しておく必要がある。そのために、入力データによっては、結果的に最適でない学習モデルが選択される場合もあり得る。
特許文献１の技術によれば、機械学習モデルを比較するために、全ての機械学習モデルに入力データを入力する必要がある。機械学習モデルが多いほど、サーバリソースを必要とする。
非特許文献５の技術によれば、階層的な学習モデルの強化学習を実行するものであって、複数の認識エンジンを用いてコンテキストを認識するものではない。
非特許文献６の技術によれば、ネットワークを選択するエキスパートの学習が不十分である場合、ユーザ所望のスコアに満たないこともある。 According to the techniques of Non-Patent Documents 1 to 4, it is necessary to determine in advance the learning model that will have the highest recognition accuracy. Therefore, depending on the input data, a non-optimal learning model may be selected as a result.
According to the technique of Patent Document 1, in order to compare machine learning models, it is necessary to input input data to all machine learning models. The more machine learning models you have, the more server resources you need.
According to the technique of Non-Patent Document 5, the reinforcement learning of the hierarchical learning model is executed, and the context is not recognized by using a plurality of recognition engines.
According to the technique of Non-Patent Document 6, if the learning of the expert who selects the network is insufficient, the score may not reach the user's desired score.

前述したいずれの従来技術についても、学習モデルが最適に構築された認識エンジンを利用することを前提としたものであって、入力データに応じて、最適な学習モデルの認識エンジンを予め決定しておく必要がある。
これに対し、本発明の発明者らは、入力データに応じて、最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキスト（認識結果）の認識精度を高めることができないか、と考えた。 All of the above-mentioned conventional techniques are based on the premise that the recognition engine in which the learning model is optimally constructed is used, and the recognition engine of the optimum learning model is determined in advance according to the input data. Need to keep.
On the other hand, the inventors of the present invention may be able to improve the recognition accuracy of the context (recognition result) by automatically selecting one or more optimum recognition engines according to the input data. I thought.

そこで、本発明は、入力データに応じて最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキストの認識精度を高めることができるプログラム、装置及び方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a program, an apparatus and a method capable of improving the recognition accuracy of a context by automatically selecting one or more optimum recognition engines according to input data. ..

本発明によれば、複数の認識エンジンを用いて、入力データからコンテキストを推定するようにコンピュータを機能させる認識プログラムにおいて、
入力データと認識エンジンの識別子とを対応付けた教師データによって学習した学習モデルを用いて、推定すべき入力データに対する認識エンジンを選択し、選択された当該認識エンジンへ当該入力データを出力する選択エンジンと、
当該入力データに対して認識エンジンによって算出された認識スコアが認識閾値以上となる認識エンジンを判定し、当該認識エンジンの識別子を選択エンジンへフィードバックする認識スコア判定手段と
してコンピュータに機能させ、
選択エンジンは、当該入力データと、フィードバックされた当該認識エンジンの識別子とを対応付けた教師データによって、学習モデルを再学習する
ようにコンピュータに機能させることを特徴とする。 According to the present invention, in a recognition program that uses a plurality of recognition engines to make a computer function to estimate a context from input data.
A selection engine that selects a recognition engine for the input data to be estimated using the learning model learned from the teacher data that associates the input data with the identifier of the recognition engine, and outputs the input data to the selected recognition engine. When,
A recognition engine whose recognition score calculated by the recognition engine is equal to or higher than the recognition threshold is determined for the input data, and the computer functions as a recognition score judgment means for feeding back the identifier of the recognition engine to the selection engine.
The selection engine is characterized in that the computer functions to relearn the learning model by the teacher data in which the input data is associated with the feedback identifier of the recognition engine.

本発明の認識プログラムにおける他の実施形態によれば、
認識エンジンは、クラス毎に認識スコアを算出するクラス分類に基づくものであり、
認識エンジンは、複数のクラスの複数のスコアにおける最高値、最低値、平均値又は加算値のいずれかの統計値を、認識スコアとして算出する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the recognition program of the present invention.
The recognition engine is based on a classification that calculates a recognition score for each class.
It is also preferred that the recognition engine causes the computer to function as a recognition score to calculate any statistic of the highest, lowest, average or added values in a plurality of scores in a plurality of classes.

本発明の認識プログラムにおける他の実施形態によれば、
認識スコア判定手段は、入力データに対して要した処理時間が、所定閾値時間以下となった認識エンジンの識別子のみを、選択エンジンへフィードバックする
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the recognition program of the present invention.
It is also preferable that the recognition score determining means causes the computer to function so as to feed back only the identifier of the recognition engine whose processing time required for the input data is equal to or less than the predetermined threshold time to the selection engine.

本発明の認識プログラムにおける他の実施形態によれば、
選択エンジンは、クラス毎に選択スコアを算出するクラス分類に基づくものであり、
選択エンジンは、推定すべき入力データに対する当該選択スコアが第１の選択閾値以上となる認識エンジンへ、当該入力データを出力する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the recognition program of the present invention.
The selection engine is based on a classification that calculates a selection score for each class.
It is also preferable that the selection engine causes the computer to function to output the input data to the recognition engine in which the selection score for the input data to be estimated is equal to or higher than the first selection threshold.

本発明の認識プログラムにおける他の実施形態によれば、
選択エンジンは、当該選択スコアが第１の選択閾値未満で且つ第２の選択閾値以上となる認識エンジンへ更に、当該入力データを出力する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the recognition program of the present invention.
It is also preferable that the selection engine causes the computer to further output the input data to the recognition engine whose selection score is less than the first selection threshold and greater than or equal to the second selection threshold.

本発明の認識プログラムにおける他の実施形態によれば、
選択エンジンは、第１の選択閾値以上となった一方の認識エンジンの選択スコアと、第１の選択閾値未満となった他方の認識エンジンの選択スコアとの差が、所定差分以下である場合、他方の認識エンジンへ更に、当該入力データを出力する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the recognition program of the present invention.
When the difference between the selection score of one recognition engine that is equal to or higher than the first selection threshold and the selection score of the other recognition engine that is less than the first selection threshold is equal to or less than a predetermined difference, the selection engine is used. It is also preferable to make the computer function to output the input data to the other recognition engine.

本発明の認識プログラムにおける他の実施形態によれば、
入力データは、映像データであり、
複数の認識エンジンは、互いに異なるものであり、
ＲＧＢ画像に基づく物体認識エンジン、
オプティカルフローに基づく動体認識エンジン、及び／又は、
スケルトン情報に基づく人物の関節認識エンジン
のいずれかである
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the recognition program of the present invention.
The input data is video data,
Multiple recognition engines are different from each other
Object recognition engine based on RGB images,
Motion recognition engine based on optical flow and / or
It is also preferable to make the computer function as one of the human joint recognition engines based on skeleton information.

本発明によれば、複数の認識エンジンを用いて、入力データからコンテキストを推定する認識装置において、
入力データと認識エンジンの識別子とを対応付けた教師データによって学習した学習モデルを用いて、推定すべき入力データに対する認識エンジンを選択し、選択された当該認識エンジンへ当該入力データを出力する選択エンジンと、
当該入力データに対して認識エンジンによって算出された認識スコアが認識閾値以上となる認識エンジンを判定し、当該認識エンジンの識別子を選択エンジンへフィードバックする認識スコア判定手段と
を有し、
選択エンジンは、当該入力データと、フィードバックされた当該認識エンジンの識別子とを対応付けた教師データによって、学習モデルを再学習する
ことを特徴とする。 According to the present invention, in a recognition device that estimates a context from input data using a plurality of recognition engines.
A selection engine that selects a recognition engine for the input data to be estimated using the learning model learned from the teacher data that associates the input data with the identifier of the recognition engine, and outputs the input data to the selected recognition engine. When,
It has a recognition score determination means that determines a recognition engine whose recognition score calculated by the recognition engine is equal to or higher than the recognition threshold for the input data, and feeds back the identifier of the recognition engine to the selection engine.
The selection engine is characterized in that the learning model is retrained by the teacher data in which the input data and the feedback identifier of the recognition engine are associated with each other.

本発明によれば、複数の認識エンジンを用いて、入力データからコンテキストを推定する装置の認識方法において、
装置は、
入力データと認識エンジンの識別子とを対応付けた教師データによって学習した学習モデルを用いて、推定すべき入力データに対する認識エンジンを選択し、選択された当該認識エンジンへ当該入力データを出力する第１のステップと、
当該入力データに対して認識エンジンによって算出された認識スコアが認識閾値以上となる認識エンジンを判定する第２のステップと、
当該入力データと、第２のステップによって真と判定された当該認識エンジンの識別子とを対応付けた教師データによって、学習モデルを再学習する第３のステップと
を実行することを特徴とする。 According to the present invention, in a method of recognizing a device that estimates a context from input data using a plurality of recognition engines.
The device is
Using the learning model learned by the teacher data that associates the input data with the identifier of the recognition engine, the recognition engine for the input data to be estimated is selected, and the input data is output to the selected recognition engine. Steps and
The second step of determining the recognition engine whose recognition score calculated by the recognition engine for the input data is equal to or higher than the recognition threshold, and
It is characterized in that the third step of re-learning the learning model is executed by the teacher data in which the input data is associated with the identifier of the recognition engine determined to be true by the second step.

本発明のプログラム、装置及び方法によれば、入力データに応じて最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキストの認識精度を高めることができる。 According to the program, apparatus and method of the present invention, the recognition accuracy of the context can be improved by automatically selecting one or more optimum recognition engines according to the input data.

認識装置を有するシステム構成図である。It is a system block diagram which has a recognition device. 本発明における認識装置の機能構成図である。It is a functional block diagram of the recognition device in this invention. 本発明における具体的な第１の処理フローである。This is a specific first processing flow in the present invention. 本発明における具体的な第２の処理フローである。This is a specific second processing flow in the present invention. 本発明における具体的な第３の処理フローである。This is a specific third processing flow in the present invention. 本発明における具体的な第４の処理フローである。This is a specific fourth processing flow in the present invention. 映像データに対する具体的な第５の処理フローである。This is a specific fifth processing flow for video data. 図７に基づくフローチャートである。It is a flowchart based on FIG.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、本発明における認識装置の機能構成図である。 FIG. 2 is a functional configuration diagram of the recognition device according to the present invention.

認識装置１は、複数の認識エンジンを用いて、入力データからコンテキスト（例えば物体、動体、人物行動など）を推定する。
図２によれば、認識装置１は、選択エンジン１１と、複数の認識エンジン１２（第１の認識エンジン１２１、第２の認識エンジン１２２）と、認識スコア判定部１３とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現できる。また、これら機能構成部の処理の流れは、入力データに対する装置の認識方法としても理解できる。 The recognition device 1 estimates a context (for example, an object, a moving object, a human behavior, etc.) from input data by using a plurality of recognition engines.
According to FIG. 2, the recognition device 1 has a selection engine 11, a plurality of recognition engines 12 (first recognition engine 121, second recognition engine 122), and a recognition score determination unit 13. These functional components can be realized by executing a program that makes the computer mounted on the device function. Further, the processing flow of these functional components can be understood as a method of recognizing the device for the input data.

［選択エンジン１１］
選択エンジン１１は、クラス分類に基づくものであって、推定すべき入力データに、クラス（認識エンジン１２の識別子）を付与する機械学習エンジンである。選択エンジン１１は、入力データと認識エンジンの識別子とを対応付けた教師データに基づいて、学習モデルを予め構築したものである。
＜教師データ＞
入力データ <-> 認識エンジンの識別子
選択エンジン１１は、具体的には、認識エンジン（クラス）毎に、スコア（認識精度）を算出する。一般的には、スコアが最も高い１つの認識エンジンが、推定結果として選択される。但し、本発明によれば、認識エンジンは、１つに限られず、複数であってもよい。選択エンジン１１の選択方法における実施形態については、図３〜図６で後述する。
そして、選択エンジン１１は、学習モデルを用いて、推定すべき入力データに対する認識エンジン１２を選択し、選択された認識エンジン１２へ入力データを出力する。 [Selection engine 11]
The selection engine 11 is a machine learning engine that is based on the classification and assigns a class (identifier of the recognition engine 12) to the input data to be estimated. The selection engine 11 is a learning model constructed in advance based on the teacher data in which the input data and the identifier of the recognition engine are associated with each other.
<Teacher data>
Input data <-> Recognition engine identifier The selection engine 11 specifically calculates a score (recognition accuracy) for each recognition engine (class). Generally, the one recognition engine with the highest score is selected as the estimation result. However, according to the present invention, the number of recognition engines is not limited to one, and may be plural. The embodiment in the selection method of the selection engine 11 will be described later with reference to FIGS. 3 to 6.
Then, the selection engine 11 selects the recognition engine 12 for the input data to be estimated by using the learning model, and outputs the input data to the selected recognition engine 12.

尚、本発明の選択エンジン１１は、完全な学習モデルを予め構築しておく必要はなく、後述する認識スコア判定部１３からのフィードバックによって再学習していく。「再学習」とは、入力データと、フィードバックされた認識エンジンの識別子とを教師データとして、当該学習モデルに更に学習させることをいう。 The selection engine 11 of the present invention does not need to build a complete learning model in advance, and relearns by feedback from the recognition score determination unit 13 described later. "Re-learning" means that the training model is further trained by using the input data and the feedback recognition engine identifier as teacher data.

［認識エンジン１２］
選択エンジン１１によって選択された認識エンジン１２は、当該選択エンジン１１から、入力データを入力する。認識エンジン１２も、クラス分類に基づくものであって、クラス（推定可能なコンテキスト）毎に、認識スコア（認識精度）を算出する。一般的には、認識スコアが最も高い１つのコンテキストが、推定結果として出力される。 [Recognition engine 12]
The recognition engine 12 selected by the selection engine 11 inputs input data from the selection engine 11. The recognition engine 12 is also based on the classification, and calculates the recognition score (recognition accuracy) for each class (estimable context). Generally, one context with the highest recognition score is output as the estimation result.

本発明によれば、異なる種類の複数の認識エンジン１２を有する。例えば、物体を主として認識するエンジン、大まかな行動を主として認識するエンジン、細かな行動を主として認識するエンジンのように、異なる種類の認識エンジンを組み合わせる。各認識エンジンは、その種類に応じて異なる教師データに基づいて、学習モデルを予め構築したものである。 According to the present invention, there are a plurality of recognition engines 12 of different types. For example, different types of recognition engines are combined, such as an engine that mainly recognizes an object, an engine that mainly recognizes a rough action, and an engine that mainly recognizes a fine action. Each recognition engine builds a learning model in advance based on different teacher data depending on the type.

図２によれば、２つの認識エンジン（第１の認識エンジン１２１、第２の認識エンジン１２２）を有する。認識エンジン１２によって算出される認識スコアは、複数のコンテキストの複数の認識スコアにおける最高値、最低値、平均値又は加算値のいずれかの「統計値」であってもよい。
そして、各認識エンジン１２は、コンテキスト毎に算出された認識スコアを、認識スコア判定部１３へ出力する。 According to FIG. 2, it has two recognition engines (a first recognition engine 121 and a second recognition engine 122). The recognition score calculated by the recognition engine 12 may be a "statistical value" of any of the highest value, the lowest value, the average value, or the added value in the plurality of recognition scores in the plurality of contexts.
Then, each recognition engine 12 outputs the recognition score calculated for each context to the recognition score determination unit 13.

［認識スコア判定部１３］
認識スコア判定部１３は、当該入力データに対して各認識エンジン１２の各コンテキストについて算出された認識スコアが、「認識閾値」以上であるか否かを判定する。
ここで、真（認識スコア≧認識閾値）と判定された場合、当該認識エンジン１２の識別子を選択エンジン１１へフィードバックする。これに対して、選択エンジン１１は、当該入力データと当該認識エンジンの識別子とを対応付けた教師データとして、学習モデルを再学習する。
また、各認識エンジン１２によって算出された認識スコアの中で、認識閾値以上となるコンテキストは、推定結果として、アプリケーションへ出力される。
尚、認識閾値は、オペレータによって任意に設定可能なものである。 [Recognition score determination unit 13]
The recognition score determination unit 13 determines whether or not the recognition score calculated for each context of each recognition engine 12 with respect to the input data is equal to or greater than the “recognition threshold”.
Here, when it is determined to be true (recognition score ≥ recognition threshold value), the identifier of the recognition engine 12 is fed back to the selection engine 11. On the other hand, the selection engine 11 relearns the learning model as teacher data in which the input data and the identifier of the recognition engine are associated with each other.
Further, among the recognition scores calculated by each recognition engine 12, the context that is equal to or higher than the recognition threshold is output to the application as an estimation result.
The recognition threshold can be arbitrarily set by the operator.

結果的に、選択エンジン１１は、認識スコア判定部１３からのフィードバックに基づいて学習モデルを再学習することによって、その後、推定すべき入力データに対して、できる限り最適な認識エンジン１２を選択するようになる。 As a result, the selection engine 11 relearns the learning model based on the feedback from the recognition score determination unit 13, and then selects the most suitable recognition engine 12 for the input data to be estimated. Will be.

図３は、本発明における具体的な第１の処理フローである。 FIG. 3 is a specific first processing flow in the present invention.

図３によれば、選択エンジン１１は、推定すべき入力データに対する各認識エンジンについて、以下のように選択スコアを算出したとする。
［認識エンジンＩＤ］［選択スコア］
Ｓ１ -> ０．７
Ｓ２ -> ０．６
＜選択エンジン１１＞※第１の選択閾値＝０．６
ここで、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．６）以上となる認識エンジン１２１及び１２２の両方へ、当該入力データを出力している。
尚、第１の選択閾値は、オペレータによって任意に設定可能なものである。 According to FIG. 3, it is assumed that the selection engine 11 calculates the selection score as follows for each recognition engine for the input data to be estimated.
[Recognition engine ID] [Selection score]
S1-> 0.7
S2-> 0.6
<Selection engine 11> * First selection threshold = 0.6
Here, the selection engine 11 outputs the input data to both the recognition engines 121 and 122 whose selection score is equal to or higher than the first selection threshold value (for example, 0.6).
The first selection threshold value can be arbitrarily set by the operator.

次に、第１の認識エンジン１２１及び第２の認識エンジン１２２は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。ここでは、複数のコンテキストの複数の認識スコアにおける「最高値」を統計値としたものである。
＜第１の認識エンジン１２１＞（コンテキスト）：（認識スコア）
ｃ１１：０．５
ｃ１２：０．２
ｃ１３：０．１
※最高値（統計値）＝０．５
＜第２の認識エンジン１２２＞（コンテキスト）：（認識スコア）
ｃ２１：０．７
ｃ２２：０．３
ｃ２３：０．３
※最高値（統計値）＝０．７
＜認識スコア判定部１３＞ ※認識閾値＝０．６ Next, the first recognition engine 121 and the second recognition engine 122 output the recognition score for each context with respect to the input data to the recognition score determination unit 13. Here, the "highest value" in a plurality of recognition scores in a plurality of contexts is used as a statistical value.
<First recognition engine 121> (Context): (Recognition score)
c11: 0.5
c12: 0.2
c13: 0.1
* Maximum value (statistical value) = 0.5
<Second recognition engine 122> (Context): (Recognition score)
c21: 0.7
c22: 0.3
c23: 0.3
* Maximum value (statistical value) = 0.7
<Recognition score judgment unit 13> * Recognition threshold = 0.6

認識スコア判定部１３は、認識スコアが、認識閾値（０．６）以上であるか否かを判定する。ここでは、第２の認識エンジン１２２のみが認識スコア０．６以上であるために、第２の認識エンジン１２２の識別子（ＩＤ：１２２）を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、フィードバックされた第２の認識エンジン１２２の識別子とを対応付けた教師データによって、学習モデルを再学習する。
尚、図３によれば、統計値は、最高値であるとして説明したが、最低値、平均値、加算値であってもよい。 The recognition score determination unit 13 determines whether or not the recognition score is equal to or higher than the recognition threshold value (0.6). Here, since only the second recognition engine 122 has a recognition score of 0.6 or more, the identifier (ID: 122) of the second recognition engine 122 is fed back to the selection engine 11.
As a result, the selection engine 11 relearns the learning model with the teacher data in which the input data is associated with the feedback identifier of the second recognition engine 122.
Although it has been described that the statistical value is the maximum value according to FIG. 3, it may be the minimum value, the average value, or the added value.

図４は、本発明における具体的な第２の処理フローである。 FIG. 4 is a specific second processing flow in the present invention.

図４によれば、図３と比較して、選択エンジン１１は、推定すべき入力データに対する各認識エンジンについて、以下のように選択スコアを算出したとする。
［認識エンジンＩＤ］［選択スコア］
Ｓ１ -> ０．７
Ｓ２ -> ０．６
＜選択エンジン１１＞※第１の選択閾値＝０．７
ここで、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．７）以上となる第１の認識エンジン１２１のみへ、当該入力データを出力する。この場合、第２の認識エンジン１２２へは、入力データは出力されない。 According to FIG. 4, it is assumed that the selection engine 11 calculates the selection score for each recognition engine for the input data to be estimated as follows, as compared with FIG.
[Recognition engine ID] [Selection score]
S1-> 0.7
S2-> 0.6
<Selection engine 11> * First selection threshold = 0.7
Here, the selection engine 11 outputs the input data only to the first recognition engine 121 whose selection score is equal to or higher than the first selection threshold value (for example, 0.7). In this case, the input data is not output to the second recognition engine 122.

次に、第１の認識エンジン１２１は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。ここでも、コンテキストの複数の認識スコアにおける「最高値」を統計値とする。
＜第１の認識エンジン１２１＞（コンテキスト）：（認識スコア）
ｃ１１：０．５
ｃ１２：０．２
ｃ１３：０．１
※最高値（統計値）＝０．５ Next, the first recognition engine 121 outputs the recognition score for each context with respect to the input data to the recognition score determination unit 13. Here, too, the "highest value" in the multiple recognition scores of the context is used as the statistical value.
<First recognition engine 121> (Context): (Recognition score)
c11: 0.5
c12: 0.2
c13: 0.1
* Maximum value (statistical value) = 0.5

そして、認識スコア判定部１３は、認識スコアが認識閾値（例えば０．５）以上となる第１の認識エンジン１２１の識別子（ＩＤ：１２１）を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、当該第１の認識エンジン１２１の識別子とを対応付けた教師データとして更に、学習モデルを再学習する。 Then, the recognition score determination unit 13 feeds back the identifier (ID: 121) of the first recognition engine 121 whose recognition score is equal to or higher than the recognition threshold value (for example, 0.5) to the selection engine 11.
As a result, the selection engine 11 further relearns the learning model as teacher data in which the input data is associated with the identifier of the first recognition engine 121.

図５は、本発明における具体的な第３の処理フローである。 FIG. 5 is a specific third processing flow in the present invention.

図５によれば、図３と同様に、認識エンジン１２によって算出される認識スコアは、複数のコンテキストの複数のスコアにおける最高値を統計値として、算出している。
＜第１の認識エンジン１２１＞（コンテキスト）：（認識スコア）
ｃ１１：０．６
ｃ１２：０．２
ｃ１３：０．１
※最高値（統計値）＝０．６
※処理時間＝１００ｍｓ
＜第２の認識エンジン１２２＞（コンテキスト）：（認識スコア）
ｃ２１：０．７
ｃ２２：０．３
ｃ２３：０．３
※最高値（統計値）＝０．７
※処理時間＝５００ｍｓ
＜認識スコア判定部１３＞ ※認識閾値＝０．６
※所定閾値時間＝２００ｍｓ According to FIG. 5, similarly to FIG. 3, the recognition score calculated by the recognition engine 12 is calculated by using the highest value in a plurality of scores in a plurality of contexts as a statistical value.
<First recognition engine 121> (Context): ( Recognition score)
c11: 0.6
c12: 0.2
c13: 0.1
* Maximum value (statistical value) = 0.6
* Processing time = 100ms
<Second recognition engine 122> (Context): ( Recognition score)
c21: 0.7
c22: 0.3
c23: 0.3
* Maximum value (statistical value) = 0.7
* Processing time = 500ms
<Recognition score judgment unit 13> * Recognition threshold = 0.6
* Predetermined threshold time = 200ms

認識スコア判定部１３は、認識スコアが、認識閾値（０．６）以上であるか否かを判定する。ここでは、認識エンジン１２１及び１２２の両方の認識スコアが０．６以上である。
また、認識スコア判定部１３は、入力データに対して要した処理時間が、所定閾値時間（２００ｍｓ）以下であるか否かを判定する。ここでは、第２の認識エンジン１２２の処理時間が５００ｍｓであって、偽となる。この場合、認識スコア判定部１３は、第１の認識エンジン１２１の識別子（ＩＤ：１２１）のみを、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、第１の認識エンジン１２１の識別子とを対応付けた教師データによって、学習モデルを再学習する。
このように、認識スコアのみならず、認識エンジンの「処理時間」に基づいて、選択エンジン１１の学習モデルを再学習することは、処理リソースの観点も好ましい。 The recognition score determination unit 13 determines whether or not the recognition score is equal to or higher than the recognition threshold value (0.6). Here, the recognition scores of both the recognition engines 121 and 122 are 0.6 or higher.
Further, the recognition score determination unit 13 determines whether or not the processing time required for the input data is equal to or less than the predetermined threshold time (200 ms). Here, the processing time of the second recognition engine 122 is 500 ms, which is false. In this case, the recognition score determination unit 13 feeds back only the identifier (ID: 121) of the first recognition engine 121 to the selection engine 11.
As a result, the selection engine 11 relearns the learning model with the teacher data in which the input data is associated with the identifier of the first recognition engine 121.
As described above, it is preferable from the viewpoint of processing resources to relearn the learning model of the selection engine 11 based not only on the recognition score but also on the "processing time" of the recognition engine.

図６は、本発明における具体的な第４の処理フローである。 FIG. 6 is a specific fourth processing flow in the present invention.

図６によれば、選択エンジン１１の学習モデルに学習漏れを考慮したものである。即ち、選択スコアが第１の選択閾値未満となった認識エンジン１２であっても、選択エンジン１１の学習モデルの学習が不完全であったために、選択すべき認識エンジン１２を選択できなかった可能性がある。その場合、その認識エンジン１２の認識スコアについて改めて、選択エンジン１１の学習モデルの再学習に利用するか否かを判定する。
図６によれば、２つの実施形態について記載されている。 According to FIG. 6, learning omission is taken into consideration in the learning model of the selection engine 11. That is, even if the recognition engine 12 has a selection score less than the first selection threshold value, it is possible that the recognition engine 12 to be selected could not be selected because the learning model of the selection engine 11 was incompletely trained. There is sex. In that case, it is determined whether or not the recognition score of the recognition engine 12 is used for re-learning of the learning model of the selection engine 11.
According to FIG. 6, two embodiments are described.

＜第１の実施形態＞
前述した選択エンジン１１によれば、選択スコアが第１の選択閾値（例えば０．６）以上となる第１の認識エンジン１２１を選択する。
これに対し更に、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．６）未満で且つ第２の選択閾値（例えば０．５）以上となる第２の認識エンジン１２２も選択する。
そして、選択エンジン１１は、選択された認識エンジン１２１及び１２２の両方へ入力データを出力する。
次に、認識エンジン１２１及び１２２は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。
そして、認識スコア判定部１３は、認識スコアが認識閾値（例えば０．６）以上となる認識エンジン１２１及び１２２の識別子を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、当該認識エンジン１２１及び１２２の識別子とを対応付けた教師データとして、学習モデルを再学習する。 <First Embodiment>
According to the selection engine 11 described above, the first recognition engine 121 having a selection score equal to or higher than the first selection threshold value (for example, 0.6) is selected.
On the other hand, the selection engine 11 also selects the second recognition engine 122 whose selection score is less than the first selection threshold (for example, 0.6) and equal to or more than the second selection threshold (for example, 0.5). ..
Then, the selection engine 11 outputs the input data to both the selected recognition engines 121 and 122.
Next, the recognition engines 121 and 122 output the recognition score for each context for the input data to the recognition score determination unit 13.
Then, the recognition score determination unit 13 feeds back the identifiers of the recognition engines 121 and 122 whose recognition score is equal to or higher than the recognition threshold value (for example, 0.6) to the selection engine 11.
As a result, the selection engine 11 relearns the learning model as teacher data in which the input data is associated with the identifiers of the recognition engines 121 and 122.

＜第２の実施形態＞
前述した選択エンジン１１によれば、選択スコアが第１の選択閾値（例えば０．６）以上となる第１の認識エンジン１２１を選択する。
これに対し更に、選択エンジン１１は、選択された第１の認識エンジン１２１の選択スコア（例えば０．７）と、第１の選択閾値未満となった第２の認識エンジン１２２の選択スコア（例えば０．５）との差が、所定差分（例えば０．２）以下であるか否かを判定する。真と判定された場合、選択エンジン１１は、選択スコアが第１の選択閾値（例えば０．６）未満となる第２の認識エンジン１２２も選択する。
そして、選択エンジン１１は、選択された認識エンジン１２１及び１２２の両方へ入力データを出力する。
次に、認識エンジン１２１及び１２２は、入力データに対するコンテキスト毎の認識スコアを、認識スコア判定部１３へ出力する。
そして、認識スコア判定部１３は、認識スコアが認識閾値（例えば０．６）以上となる認識エンジン１２１及び１２２の識別子を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、当該認識エンジン１２１及び１２２の識別子とを対応付けた教師データとして、学習モデルを再学習する。

<Second embodiment>
According to the selection engine 11 described above, the first recognition engine 121 having a selection score equal to or higher than the first selection threshold value (for example, 0.6) is selected.
On the other hand, the selection engine 11 further has a selection score of the selected first recognition engine 121 (for example, 0.7) and a selection score of the second recognition engine 122 that is less than the first selection threshold (for example, 0.7). It is determined whether or not the difference from 0.5) is equal to or less than a predetermined difference (for example, 0.2). If determined to be true, the selection engine 11 also selects a second recognition engine 122 whose selection score is less than the first selection threshold (eg, 0.6).
Then, the selection engine 11 outputs the input data to both the selected recognition engines 121 and 122.
Next, the recognition engines 121 and 122 output the recognition score for each context for the input data to the recognition score determination unit 13.
Then, the recognition score determination unit 13 feeds back the identifiers of the recognition engines 121 and 122 whose recognition score is equal to or higher than the recognition threshold value (for example, 0.6) to the selection engine 11.
As a result, the selection engine 11 relearns the learning model as teacher data in which the input data is associated with the identifiers of the recognition engines 121 and 122.

前述した図３〜図６によれば、全ての認識エンジン１２によって算出された認識スコアを、１つの認識閾値によって判定している。これに対し、他の実施形態として、認識エンジン１２毎に、異なる認識閾値によって判定するもであってもよい。 According to FIGS. 3 to 6 described above, the recognition scores calculated by all the recognition engines 12 are determined by one recognition threshold value. On the other hand, as another embodiment, it may be determined by a different recognition threshold value for each recognition engine 12.

尚、全ての認識エンジン１２によって算出された認識スコアが、認識閾値に満たない場合、別途又は特定の認識エンジンによって認識するようにしたものであってもよいし、当該入力データに認識エンジン無しを対応付けた教師データとして、選択エンジン１１の学習モデルを再学習するものであってもよい。 If the recognition scores calculated by all the recognition engines 12 do not reach the recognition threshold, they may be recognized separately or by a specific recognition engine, or the input data may include no recognition engine. As the associated teacher data, the learning model of the selection engine 11 may be relearned.

図７は、映像データに対する具体的な第５の処理フローである。
図８は、図７に基づくフローチャートである。 FIG. 7 is a specific fifth processing flow for video data.
FIG. 8 is a flowchart based on FIG. 7.

認識装置１は、入力データとして、人の行動が映り込む映像データを入力し、行動認識結果（コンテキスト）を推定するとする。
図７及び図８によれば、互いに異なる３つの認識エンジンを有する。
（１）ＲＧＢ画像に基づく物体認識エンジン
（２）オプティカルフローに基づく動体認識エンジン
（３）スケルトン情報に基づく人物の関節認識エンジン
これら認識エンジンはそれぞれ、人物が映り込む大量の映像データに行動結果が対応付けられた教師データによって、学習モデルを予め生成したものである。物体認識、動体認識及び関節認識では、同じ映像データを認識する場合であっても、行動結果としてのコンテキストが異なっていてもよい。 It is assumed that the recognition device 1 inputs video data in which a person's behavior is reflected as input data and estimates a behavior recognition result (context).
According to FIGS. 7 and 8, it has three recognition engines that are different from each other.
(1) Object recognition engine based on RGB images (2) Motion recognition engine based on optical flow (3) Joint recognition engine for people based on skeleton information Each of these recognition engines has action results in a large amount of video data in which a person is reflected. A learning model is generated in advance from the associated teacher data. In object recognition, moving object recognition, and joint recognition, even when the same video data is recognized, the context as an action result may be different.

（１）ＲＧＢ認識に基づく物体認識エンジンは、具体的にはＣＮＮ(Convolutional Neural Network)のようなニューラルネットワークを用いて、撮影映像に映り込むオブジェクト（対象物）を推定する。
例えば「コップ」「スマホ」「テレビ」「建物」のように、映像データに物体が映り込んでいる場合、物体を高い精度で認識する。 (1) An object recognition engine based on RGB recognition specifically estimates an object (object) to be reflected in a captured image by using a neural network such as CNN (Convolutional Neural Network).
For example, when an object is reflected in the video data such as "cup", "smartphone", "television", and "building", the object is recognized with high accuracy.

（２）オプティカルフローに基づく動体認識エンジンは、フレーム間で同一の特徴点が動いている箇所を抽出し、撮影映像の中の物体の動きを「ベクトル」で表すものである。
例えば「把持」「振る」「パンチ」「蹴る」のように、映像データに人物の動きが映り込んでいる場合、動体を高い精度で認識する。 (2) The moving object recognition engine based on the optical flow extracts the points where the same feature points are moving between frames, and expresses the movement of the object in the captured image by a "vector".
For example, when the movement of a person is reflected in the video data such as "grasping", "shaking", "punching", and "kicking", the moving object is recognized with high accuracy.

（３）スケルトン情報に基づく人物の関節認識エンジンは、具体的にはOpenPose（登録商標）のようなスケルトンモデルを用いて、人の関節の特徴点を抽出するものである（例えば非特許文献７〜９参照）。OpenPoseとは、画像から複数の人間の体／手／顔のキーポイントをリアルタイムに検出可能なソフトウェアであって、GitHubによって公開されている。撮影映像に映る人の身体全体であれば、例えば１５点のキーポイントを検出できる。
例えば「飲む」「食べる」「走る」「畳む」のように、映像データに人物の関節の角度や位置に基づく人物の動きが映り込んでいる場合、人物の関節の動きを高い精度で認識する。 (3) A human joint recognition engine based on skeleton information specifically extracts feature points of human joints using a skeleton model such as OpenPose (registered trademark) (for example, Non-Patent Document 7). ~ 9). OpenPose is software that can detect multiple human body / hand / face key points in real time from images, and is published by GitHub. For example, 15 key points can be detected in the entire human body shown in the captured image.
For example, when the movement of a person is reflected in the video data based on the angle and position of the joint of the person, such as "drinking", "eating", "running", and "folding", the movement of the joint of the person is recognized with high accuracy. ..

人物の行動認識については、一般的に、物体認識よりも、動体認識及び関節認識の方が、認識精度は高い。また、人物の身体の動作認識の場合、動体認識よりも、関節認識の方が、認識精度は高い。 Regarding human behavior recognition, in general, motion recognition and joint recognition have higher recognition accuracy than object recognition. Further, in the case of motion recognition of a person's body, joint recognition has higher recognition accuracy than motion recognition.

図７及び図８によれば、以下のように処理されている。
（Ｓ１０）認識装置１は、「映像データ」を入力する。
（Ｓ１１）選択エンジン１１は、図７によれば、全ての認識エンジン１２を選択しているとする。この場合、選択エンジン１１は、各認識エンジン１２へ、映像データを出力している。
（Ｓ１２）各認識エンジン１２は、以下のようなコンテキスト及び認識スコアを出力している。
＜ＲＧＢ認識エンジン１２１＞（コンテキスト）：（スコア）
コップ：０．７
スマホ：０．４
テレビ：０．１
※最高値（統計値）＝０．７
＜オプティカルフロー認識エンジン１２２＞（コンテキスト）：（スコア）
把持：０．４
振る：０．２
パンチ：０．１
※最高値（統計値）＝０．４
＜スケルトン認識エンジン１２３＞（コンテキスト）：（スコア）
飲む：０．６
食べる：０．２
走る：０．０
※最高値（統計値）＝０．６
（Ｓ１３）認識スコア判定部１３は、認識スコアが、認識閾値（０．６）以上であるか否かを判定する。ここでは、認識エンジン１２１及び１２３が、認識スコア０．６以上となっている。
また、認識エンジン１２１及び１２３によって算出された認識スコアの中で、認識閾値以上となるコンテンツ「コップ」「飲む」が、推定結果として、アプリケーションへ出力される。
（Ｓ１４）認識エンジン１２１及び１２３の識別子（ＩＤ：１２１、１２３）を、選択エンジン１１へフィードバックする。
これによって、選択エンジン１１は、当該入力データと、フィードバックされた認識エンジン１２１及び１２３の識別子とを対応付けた教師データによって、学習モデルを再学習する。 According to FIGS. 7 and 8, the processing is as follows.
(S10) The recognition device 1 inputs "video data".
(S11) According to FIG. 7, it is assumed that the selection engine 11 has selected all the recognition engines 12. In this case, the selection engine 11 outputs video data to each recognition engine 12.
(S12) Each recognition engine 12 outputs the following context and recognition score.
<RGB recognition engine 121> (Context): (Score)
Cup: 0.7
Smartphone: 0.4
TV: 0.1
* Maximum value (statistical value) = 0.7
<Optical flow recognition engine 122> (Context): (Score)
Gripping: 0.4
Shake: 0.2
Punch: 0.1
* Maximum value (statistical value) = 0.4
<Skeleton recognition engine 123> (Context): (Score)
Drink: 0.6
Eat: 0.2
Run: 0.0
* Maximum value (statistical value) = 0.6
(S13) The recognition score determination unit 13 determines whether or not the recognition score is equal to or higher than the recognition threshold value (0.6). Here, the recognition engines 121 and 123 have a recognition score of 0.6 or more.
Further, among the recognition scores calculated by the recognition engines 121 and 123, the contents "cup" and "drink" that are equal to or higher than the recognition threshold are output to the application as estimation results.
(S14) The identifiers (ID: 121, 123) of the recognition engines 121 and 123 are fed back to the selection engine 11.
As a result, the selection engine 11 relearns the learning model with the teacher data in which the input data is associated with the feedback recognition engines 121 and 123 identifiers.

他の実施形態として、本発明の認識エンジンは、映像データに基づくものに限られず、文字認識のものであってもよいし、特定の物体（例えば花の種類）専用に認識するものであってもよい。 As another embodiment, the recognition engine of the present invention is not limited to one based on video data, may be character recognition, or recognizes only a specific object (for example, a type of flower). May be good.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、入力データに応じて最適な１つ以上の認識エンジンを自動的に選択することによって、コンテキストの認識精度を高めることができる。
本発明によれば、入力データに対する認識エンジンは、学習モデルを有する選択エンジンによって切り替えられるために、予め決定しておく必要がない。
特に、本発明によれば、選択エンジンの学習モデルは、学習段階のみならず、運用段階であっても再学習することができる。 As described in detail above, according to the program, apparatus and method of the present invention, the recognition accuracy of the context is improved by automatically selecting one or more optimum recognition engines according to the input data. Can be done.
According to the present invention, the recognition engine for the input data does not need to be determined in advance because it is switched by the selection engine having the learning model.
In particular, according to the present invention, the learning model of the selection engine can be relearned not only in the learning stage but also in the operation stage.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 With respect to the various embodiments of the present invention described above, various changes, modifications and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above explanation is just an example and does not attempt to restrict anything. The present invention is limited only to the scope of claims and their equivalents.

１認識装置
１１選択エンジン
１２認識エンジン
１２１第１の認識エンジン
１２２第２の認識エンジン
１３認識スコア判定部
２端末

1 Recognition device 11 Selection engine 12 Recognition engine 121 First recognition engine 122 Second recognition engine 13 Recognition score judgment unit 2 Terminal

Claims

In a recognition program that uses multiple recognition engines to make a computer function to infer context from input data.
A selection engine that selects a recognition engine for the input data to be estimated using the learning model learned from the teacher data that associates the input data with the identifier of the recognition engine, and outputs the input data to the selected recognition engine. When,
A recognition engine whose recognition score calculated by the recognition engine is equal to or higher than the recognition threshold for the input data is determined, and the computer functions as a recognition score determination means for feeding back the identifier of the recognition engine to the selection engine.
The selection engine is a recognition program characterized in that a computer functions to relearn the learning model by means of teacher data in which the input data and the feedback identifier of the recognition engine are associated with each other.

The recognition engine is based on a classification that calculates a recognition score for each class.
The claim is characterized in that the recognition engine operates a computer to calculate a statistical value of any of a maximum value, a minimum value, an average value, or an addition value in a plurality of scores of a plurality of classes as the recognition score. Item 1. The recognition program according to item 1.

The recognition score determining means is characterized in that the computer functions so that only the identifier of the recognition engine whose processing time required for the input data is equal to or less than a predetermined threshold time is fed back to the selection engine. The recognition program according to claim 1 or 2.

The selection engine is based on a classification that calculates a selection score for each class.
The selection engine according to claims 1 to 3, wherein the selection engine causes a computer to output the input data to a recognition engine in which the selection score for the input data to be estimated is equal to or higher than the first selection threshold. The recognition program according to any one of the items.

4. The selection engine is characterized in that the computer is further operated to output the input data to the recognition engine whose selection score is less than the first selection threshold and equal to or more than the second selection threshold. The recognition program described in.

When the difference between the selection score of one recognition engine that is equal to or greater than the first selection threshold and the selection score of the other recognition engine that is less than the first selection threshold is equal to or less than a predetermined difference. The recognition program according to claim 4, further comprising a computer functioning to output the input data to the other recognition engine.

The input data is video data and
The plurality of recognition engines are different from each other and
Object recognition engine based on RGB images,
Motion recognition engine based on optical flow and / or
The recognition program according to any one of claims 1 to 6, wherein the computer functions so as to be one of the joint recognition engines of a person based on skeleton information.

In a recognition device that estimates context from input data using multiple recognition engines
A selection engine that selects a recognition engine for the input data to be estimated using the learning model learned from the teacher data that associates the input data with the identifier of the recognition engine, and outputs the input data to the selected recognition engine. When,
It has a recognition score determination means that determines a recognition engine whose recognition score calculated by the recognition engine is equal to or higher than the recognition threshold for the input data, and feeds back the identifier of the recognition engine to the selection engine.
The selection engine is a recognition device characterized in that the learning model is relearned by the teacher data in which the input data and the feedback identifier of the recognition engine are associated with each other.

In the recognition method of the device that estimates the context from the input data using multiple recognition engines,
The device is
Using the learning model learned by the teacher data that associates the input data with the identifier of the recognition engine, the recognition engine for the input data to be estimated is selected, and the input data is output to the selected recognition engine. Steps and
The second step of determining the recognition engine whose recognition score calculated by the recognition engine for the input data is equal to or higher than the recognition threshold, and
A device characterized in that a third step of re-learning the learning model is executed by teacher data in which the input data is associated with the identifier of the recognition engine determined to be true by the second step. Recognition method.