JP2022546153A

JP2022546153A - Action recognition method, device, computer equipment and storage medium

Info

Publication number: JP2022546153A
Application number: JP2021565729A
Authority: JP
Inventors: フェイワン; チェンチエン
Original assignee: シャンハイセンスタイムリンガンインテリジェントテクノロジーカンパニーリミテッド
Priority date: 2020-07-31
Filing date: 2021-04-16
Publication date: 2022-11-04
Also published as: CN111881854A; KR20220122735A; WO2022021948A1; TWI776566B; TW202207075A

Abstract

本発明の実施例は動作認識方法、装置、コンピュータ機器及び記憶媒体を開示し、該動作認識方法は、第１画像を取得することと、前記第１画像におけるターゲット対象を含むターゲット画像領域を認識することと、複数の動作検出ブランチを有する動作検出ネットワークを使用して、前記ターゲット画像領域に対して動作検出処理を行うことで、前記ターゲット対象に対応する複数のタイプの第１動作検出結果を得ることであって、ここで、異なる動作検出ブランチによって検出される動作のカテゴリーが異なることと、複数の動作検出ブランチにそれぞれ対応する第１動作検出結果に基づき、前記ターゲット対象の第２動作検出結果を決定することと、を含む。【選択図】図１Embodiments of the present invention disclose a motion recognition method, apparatus, computer equipment and storage medium, comprising obtaining a first image and recognizing a target image region including a target object in the first image. and performing motion detection processing on the target image region using a motion detection network having a plurality of motion detection branches to generate a plurality of types of first motion detection results corresponding to the target object. obtaining a second motion detection of the target object based on different categories of motion detected by different motion detection branches and first motion detection results respectively corresponding to the plurality of motion detection branches; determining a result. [Selection drawing] Fig. 1

Description

（関連出願の相互参照）
本開示は、出願番号が２０２０１０７５５５５３．３であり、出願日が２０２０年０７月３１日である中国特許出願に基づいて提出され、該中国特許出願の優先権を主張し、該中国特許出願の全ての内容が参照によって本開示に組み込まれる。 (Cross reference to related applications)
This disclosure is filed under and claims priority from a Chinese patent application with application number 202010755553.3 and filing date of July 31, 2020, is incorporated into this disclosure by reference.

本開示は、コンピュータビジョンの技術分野に関し、具体的には、動作認識方法、装置、コンピュータ機器及び記憶媒体に関する。 TECHNICAL FIELD The present disclosure relates to the technical field of computer vision, and in particular to motion recognition methods, devices, computer equipment, and storage media.

現在、インターネット教育業界は急速に発展し、生徒及び教師に便利で快適な授業環境を提供している。教室内インタラクションのインテリジェント化は、現在のインターネット教育の重要な方向であり、主に生徒の動作認識と表情認識に基づくインテリジェント化を含む。従来のインターネット教育では、主に電子呼び鈴を鳴らす等で生徒と教師のインタラクションを完成し、生徒の状態を判別しにくく、体験感が制限されている。 At present, the Internet education industry is developing rapidly, providing students and teachers with a convenient and comfortable teaching environment. Intelligentization of classroom interaction is an important direction of current Internet education, mainly including intelligentization based on student's action recognition and facial expression recognition. In the conventional Internet education, the interaction between the student and the teacher is completed mainly by ringing an electronic doorbell, etc., which makes it difficult to determine the student's condition and limits the sense of experience.

本発明の実施例は、少なくとも動作認識方法、装置、コンピュータ機器及び記憶媒体を提供する。 Embodiments of the present invention provide at least a motion recognition method, apparatus, computing device and storage medium.

第１態様では、本発明の実施例は、動作認識方法を提供する。前記動作認識方法は、第１画像を取得することと、前記第１画像におけるターゲット対象を含むターゲット画像領域を認識することと、複数の動作検出ブランチを有する動作検出ネットワークを使用して、前記ターゲット画像領域に対して動作検出処理を行うことで、前記ターゲット対象に対応する複数のタイプの第１動作検出結果を得ることであって、ここで、異なる動作検出ブランチによって検出される動作のカテゴリーが異なることと、複数の動作検出ブランチにそれぞれ対応する第１動作検出結果に基づき、前記ターゲット対象の第２動作検出結果を決定することと、を含む。 In a first aspect, embodiments of the present invention provide a motion recognition method. The method for motion recognition comprises: acquiring a first image; recognizing a target image region containing a target object in the first image; performing motion detection processing on an image region to obtain a plurality of types of first motion detection results corresponding to the target object, wherein categories of motion detected by different motion detection branches are: different; and determining a second motion detection result for the target object based on first motion detection results respectively corresponding to a plurality of motion detection branches.

このように、複数の動作検出ブランチを有する動作検出ネットワークを使用して生徒の動作を認識することにより、ここで、異なる動作検出ブランチの検出できる動作のカテゴリーが異なり、さらに、一回の検出処理プロセスで、生徒がした様々な動作の各々についての検出結果を得て、それによって、生徒の動作を全面的且つ正確に認識することができる。 Thus, by recognizing a student's motion using a motion detection network having multiple motion detection branches, different motion detection branches can now detect different categories of motion, and furthermore, a single detection process In the process, detection results are obtained for each of the various actions made by the student, thereby allowing a comprehensive and accurate recognition of the student's actions.

一つの選択可能な実施形態では、前記第１画像におけるターゲット対象を含むターゲット画像領域を認識することは、前記第１画像に対して特徴抽出処理を行うことで、前記第１画像の第１特徴マップを得ることであって、前記第１特徴マップは複数の特徴チャネルにそれぞれ対応する特徴サブマップを含み、異なる前記特徴サブマップに含まれる特徴は異なることと、複数の特徴サブマップのうちの第１特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の中心点の第１座標情報を決定すると共に、前記第１特徴マップでの前記中心点の第１座標情報及び前記複数の特徴サブマップのうちの第２特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の第１サイズ情報を決定することと、前記第１座標情報及び前記第１サイズ情報に基づき、前記ターゲット画像領域を決定することと、を含む。 In one optional embodiment, recognizing a target image region containing a target object in said first image comprises performing a feature extraction process on said first image to obtain first features of said first image. obtaining a map, wherein the first feature map includes feature submaps respectively corresponding to a plurality of feature channels, wherein features included in different feature submaps are different; determining first coordinate information of a center point of the target object in the first feature map based on features contained in a first feature submap; and determining first coordinate information of the center point in the first feature map. and determining first size information of the target object on the first feature map based on features contained in a second feature submap of the plurality of feature submaps; and determining the first coordinate information and the determining the target image area based on the first size information.

このように、第１画像からターゲット対象を含むターゲット画像領域を正確に決定することができる。 Thus, a target image region containing the target object can be accurately determined from the first image.

一つの選択可能な実施形態では、複数の特徴サブマップのうちの第１特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の中心点の第１座標情報を決定することは、予め設定されたプーリングサイズ及びプーリングストライドに応じて、前記第１特徴サブマップに対して最大プーリング処理を行うことで、複数のプーリング値及び複数の前記プーリング値のうちの各々に対応する位置インデックスを得ることであって、前記位置インデックスは前記第１特徴サブマップでの前記プーリング値の位置を識別するために用いられることと、前記各プーリング値及び第１閾値に基づき、複数の前記プーリング値から前記中心点に属するターゲットプーリング値を決定することと、前記ターゲットプーリング値に対応する位置インデックスに基づき、前記第１特徴マップでの前記中心点の第１座標情報を決定することと、を含む。 In one optional embodiment, determining first coordinate information of a center point of the target object on the first feature map based on features included in a first feature submap of the plurality of feature submaps. According to a preset pooling size and pooling stride, maximum pooling processing is performed on the first feature submap to correspond to a plurality of pooling values and each of the plurality of pooling values. obtaining a position index, wherein the position index is used to identify the position of the pooling value in the first feature submap; and based on each of the pooling values and a first threshold, a plurality of the determining a target pooling value belonging to the center point from pooling values; determining first coordinate information of the center point in the first feature map based on a position index corresponding to the target pooling value; including.

このように、第１特徴サブマップに対して最大プーリング処理を行うことで、複数のプーリング値からターゲット対象の中心点に属するターゲットプーリング値をさらにより正確に決定し、それにより第１画像からターゲット対象の位置をより正確に決定することができる。 Thus, by performing the max pooling operation on the first feature submap, the target pooling value belonging to the center point of the target object is determined even more accurately from the plurality of pooling values, thereby obtaining the target from the first image. The position of the object can be determined more accurately.

一つの選択可能な実施形態では、前記第１座標情報及び前記第１サイズ情報に基づき、前記ターゲット画像領域を決定することは、前記第１座標情報、前記第１サイズ情報、及び前記第１特徴マップでの第１特徴点と前記第１画像のうちの各画素点との間の位置マッピング関係に基づき、前記第１画像での前記中心点の第２座標情報、及び前記第１画像での前記ターゲット対象の第２サイズ情報を決定することと、前記第２座標情報及び前記第２サイズ情報に基づき、前記ターゲット画像領域を決定することと、を含む。 In one optional embodiment, determining the target image area based on the first coordinate information and the first size information comprises: the first coordinate information, the first size information, and the first feature second coordinate information of the center point in the first image, and Determining second size information of the target object, and determining the target image area based on the second coordinate information and the second size information.

一つの選択可能な実施形態では、前記第２座標情報及び前記第２サイズ情報に基づき、前記ターゲット画像領域を決定することは、前記第２座標情報及び前記第２サイズ情報に基づき、前記第１画像から前記ターゲット対象を含む第１領域範囲を決定することと、前記ターゲット対象を含む第１領域範囲に基づき、前記ターゲット対象を含む第２領域範囲を決定することであって、前記第２領域範囲は前記第１領域範囲より大きく、且つ前記第２領域範囲は前記第１領域範囲を包含することと、前記第２領域範囲に基づき、前記第１画像から前記ターゲット画像領域を決定することと、を含む。 In one optional embodiment, determining the target image area based on the second coordinate information and the second size information comprises: determining the first target image area based on the second coordinate information and the second size information; determining a first area extent containing the target object from an image; and based on the first area extent containing the target object, determining a second area extent containing the target object, wherein the second area an area greater than the first area extent and the second area extent encompassing the first area extent; and determining the target image area from the first image based on the second area extent. ,including.

このように、第１領域範囲を拡張して第２領域範囲を得ることで、ターゲット対象をより完全に包含することができ、それによって、ターゲット画像領域に基づいてターゲット対象がした動作を検出する時、より正確な検出結果を得ることができる。 Thus, by extending the first area extent to obtain the second area extent, the target object can be more completely encompassed, thereby detecting the motion made by the target object based on the target image area. At times, more accurate detection results can be obtained.

一つの選択可能な実施形態では、前記動作検出ネットワークは、特徴抽出ネットワーク及び前記特徴抽出ネットワークに接続される複数の動作検出ブランチネットワークを含み、複数の動作検出ブランチを有する動作検出ネットワークを使用して、前記ターゲット画像領域に対して動作検出処理を行うことで、前記ターゲット対象に対応する複数のタイプの第１動作検出結果を得ることは、前記特徴抽出ネットワークを使用して前記ターゲット画像領域に対して特徴抽出処理を行うことで、前記ターゲット画像領域の第２特徴マップを得ることと、複数の前記動作検出ブランチネットワークを使用して前記第２特徴マップに対してそれぞれ動作検出処理を行うことで、各前記動作検出ブランチネットワークにそれぞれ対応する第１動作検出結果を得ることと、を含む。 In one optional embodiment, the motion detection network comprises a feature extraction network and a plurality of motion detection branch networks connected to the feature extraction network, using a motion detection network having a plurality of motion detection branches. and obtaining a plurality of types of first motion detection results corresponding to the target object by performing a motion detection process on the target image region using the feature extraction network on the target image region. obtaining a second feature map of the target image region by performing feature extraction processing on the second feature map; and performing motion detection processing on each of the second feature maps using a plurality of the motion detection branch networks. , obtaining a first motion detection result respectively corresponding to each of the motion detection branch networks.

このように、複数の動作検出ブランチネットワークを使用してターゲット画像領域の第２特徴マップに対してそれぞれ動作検出処理を行うことで、ターゲット対象ごとのターゲット画像領域に対して様々な動作のカテゴリーの検出を行うことが実現され、さらにターゲット対象ごとのより全面的な動作検出結果が得られる。 In this way, by performing motion detection processing on each of the second feature maps of the target image region using a plurality of motion detection branch networks, various motion categories can be detected for each target image region. detection is realized, and a more comprehensive motion detection result for each target object is obtained.

一つの選択可能な実施形態では、複数の前記動作検出ブランチネットワークを使用して前記第２特徴マップに対してそれぞれ動作検出処理を行うことで、各前記動作検出ブランチネットワークにそれぞれ対応する第１動作検出結果を得ることは、複数の動作検出ブランチネットワークのうちの各々について、前記動作検出ブランチネットワークを使用して前記第２特徴マップに対して動作検出処理を行うことで、前記ターゲット対象が前記動作検出ブランチネットワークにより検出される動作のカテゴリーをする確率を得ることと、前記確率及び予め決定された第２閾値に基づき、前記動作検出ブランチネットワークに対応する第１動作検出結果を決定することと、を含む。 In one optional embodiment, a plurality of said motion detection branch networks are used to respectively perform a motion detection process on said second feature map to detect a first motion respectively corresponding to each said motion detection branch network. Obtaining a detection result includes, for each of a plurality of motion detection branch networks, performing motion detection processing on the second feature map using the motion detection branch network, wherein the target object detects the motion. obtaining a probability of categorizing a motion detected by a detection branch network; determining a first motion detection result corresponding to the motion detection branch network based on the probability and a second predetermined threshold; including.

第２態様では、本発明の実施例は、動作認識装置をさらに提供する。前記動作認識装置は、第１画像を取得するように構成される取得モジュールと、前記第１画像におけるターゲット対象を含むターゲット画像領域を認識するように構成される認識モジュールと、複数の動作検出ブランチを有する動作検出ネットワークを使用して、前記ターゲット画像領域に対して動作検出処理を行うことで、前記ターゲット対象に対応する複数のタイプの第１動作検出結果を得るように構成される検出モジュールであって、ここで、異なる動作検出ブランチによって検出される動作のカテゴリーが異なる検出モジュールと、複数の動作検出ブランチにそれぞれ対応する第１動作検出結果に基づき、前記ターゲット対象の第２動作検出結果を決定するように構成される決定モジュールと、を含む。 In a second aspect, embodiments of the present invention further provide a motion recognition device. The motion recognition device comprises an acquisition module configured to acquire a first image, a recognition module configured to recognize a target image region including a target object in the first image, and a plurality of motion detection branches. a detection module configured to perform a motion detection process on the target image region to obtain a plurality of types of first motion detection results corresponding to the target object using a motion detection network comprising wherein a second motion detection result of the target object is determined based on detection modules with different categories of motion detected by different motion detection branches and first motion detection results respectively corresponding to a plurality of motion detection branches; a decision module configured to decide.

一つの可能な実施形態では、前記認識モジュールは、前記第１画像に対して特徴抽出処理を行うことで、前記第１画像の第１特徴マップを得て、前記第１特徴マップは複数の特徴チャネルにそれぞれ対応する特徴サブマップを含み、異なる前記特徴サブマップに含まれる特徴は異なり、複数の特徴サブマップのうちの第１特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の中心点の第１座標情報を決定すると共に、前記第１特徴マップでの前記中心点の第１座標情報及び前記複数の特徴サブマップのうちの第２特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の第１サイズ情報を決定し、前記第１座標情報及び前記第１サイズ情報に基づき、前記ターゲット画像領域を決定するように構成される。 In one possible embodiment, the recognition module performs a feature extraction process on the first image to obtain a first feature map of the first image, the first feature map comprising a plurality of features. including feature submaps respectively corresponding to channels, wherein features included in different feature submaps are different, and based on features included in a first feature submap of a plurality of feature submaps, Determining first coordinate information of a center point of the target object and including first coordinate information of the center point on the first feature map and a feature included in a second one of the plurality of feature submaps. and determining the target image region based on the first coordinate information and the first size information.

一つの可能な実施形態では、前記認識モジュールは、予め設定されたプーリングサイズ及びプーリングストライドに応じて、前記第１特徴サブマップに対して最大プーリング処理を行うことで、複数のプーリング値及び複数の前記プーリング値のうちの各々に対応する位置インデックスを得て、前記位置インデックスは前記第１特徴サブマップでの前記プーリング値の位置を識別するために用いられ、前記各プーリング値及び第１閾値に基づき、複数の前記プーリング値から前記中心点に属するターゲットプーリング値を決定し、前記ターゲットプーリング値に対応する位置インデックスに基づき、前記第１特徴マップでの前記中心点の第１座標情報を決定するように構成される。 In one possible embodiment, the recognition module performs a maximum pooling operation on the first feature submap according to a preset pooling size and pooling stride to obtain multiple pooling values and multiple pooling values. obtaining a position index corresponding to each of said pooling values, said position index being used to identify the position of said pooling value in said first feature submap, and for each said pooling value and a first threshold; determining a target pooling value belonging to the center point from the plurality of pooling values; and determining first coordinate information of the center point in the first feature map according to a position index corresponding to the target pooling value. configured as

一つの可能な実施形態では、前記認識モジュールは、前記第１座標情報、前記第１サイズ情報、及び前記第１特徴マップでの第１特徴点と前記第１画像のうちの各画素点との間の位置マッピング関係に基づき、前記第１画像での前記中心点の第２座標情報、及び前記第１画像での前記ターゲット対象の第２サイズ情報を決定し、前記第２座標情報及び前記第２サイズ情報に基づき、前記ターゲット画像領域を決定するように構成される。 In one possible embodiment, the recognition module stores the first coordinate information, the first size information, and the first feature point in the first feature map and each pixel point in the first image. determining second coordinate information of the center point in the first image and second size information of the target object in the first image based on a positional mapping relationship between the second coordinate information and the second It is configured to determine the target image area based on two size information.

一つの可能な実施形態では、前記認識モジュールは、前記第２座標情報及び前記第２サイズ情報に基づき、前記第１画像から前記ターゲット対象を含む第１領域範囲を決定し、前記ターゲット対象を含む第１領域範囲に基づき、前記ターゲット対象を含む第２領域範囲を決定し、前記第２領域範囲は前記第１領域範囲を包含し、前記第２領域範囲に基づき、前記第１画像から前記ターゲット画像領域を決定するように構成される。 In one possible embodiment, the recognition module determines, from the first image, a first area extent containing the target object based on the second coordinate information and the second size information; determining a second area extent including the target object based on the first area extent, the second area extent encompassing the first area extent, and the target from the first image based on the second area extent; It is configured to determine an image region.

一つの可能な実施形態では、前記動作検出ネットワークは、特徴抽出ネットワーク及び前記特徴抽出ネットワークに接続される複数の動作検出ブランチネットワークを含み、前記検出モジュールは、前記特徴抽出ネットワークを使用して前記ターゲット画像領域に対して特徴抽出処理を行うことで、前記ターゲット画像領域の第２特徴マップを得て、複数の前記動作検出ブランチネットワークを使用して前記第２特徴マップに対してそれぞれ動作検出処理を行うことで、各前記動作検出ブランチネットワークにそれぞれ対応する第１動作検出結果を得るように構成される。 In one possible embodiment, the motion detection network comprises a feature extraction network and a plurality of motion detection branch networks connected to the feature extraction network, the detection module using the feature extraction network to detect the target performing feature extraction processing on an image region to obtain a second feature map of the target image region, and performing motion detection processing on each of the second feature maps using a plurality of the motion detection branch networks. Doing so is configured to obtain a first motion detection result respectively corresponding to each of said motion detection branch networks.

一つの可能な実施形態では、前記検出モジュールは、複数の動作検出ブランチネットワークのうちの各々について、前記動作検出ブランチネットワークを使用して前記第２特徴マップに対して動作検出処理を行うことで、前記ターゲット対象が前記動作検出ブランチネットワークにより検出される動作のカテゴリーをする確率を得て、前記確率及び予め決定された第２閾値に基づき、前記動作検出ブランチネットワークに対応する第１動作検出結果を決定するように構成される。 In one possible embodiment, the detection module performs motion detection processing on the second feature map using the motion detection branch network for each of a plurality of motion detection branch networks, Obtaining a probability that the target object performs a category of motion detected by the motion detection branch network, and determining a first motion detection result corresponding to the motion detection branch network based on the probability and a predetermined second threshold. configured to determine.

第３態様では、本発明の選択可能な実施形態はコンピュータ機器をさらに提供する。前記コンピュータ機器はプロセッサ、メモリを含み、前記メモリには、前記プロセッサにより実行可能な機械読み取り可能な命令が記憶され、前記プロセッサは、前記メモリに記憶される機械読み取り可能な命令を実行するために用いられ、前記機械読み取り可能な命令が前記プロセッサにより実行される時、上記第１態様又は第１態様のいずれか一つの可能な実施形態におけるステップを実行する。 In a third aspect, optional embodiments of the present invention further provide computer equipment. The computing device includes a processor, a memory in which machine-readable instructions executable by the processor are stored, the processor for executing the machine-readable instructions stored in the memory. used to perform the steps in any one possible embodiment of the first aspect or the first aspect above when the machine-readable instructions are executed by the processor.

第４態様では、本発明の選択可能な実施形態は、コンピュータプログラムが記憶され、当該コンピュータプログラムが実行される時、上記第１態様又は第１態様のいずれか一つの可能な実施形態におけるステップを実行するコンピュータ読み取り可能な記憶媒体をさらに提供する。 In a fourth aspect, an alternative embodiment of the invention provides that, when a computer program is stored and the computer program is executed, the steps of any one of the first aspect or any one possible embodiment of the first aspect above are performed. Further provided is a computer readable storage medium for execution.

第５態様では、本発明の選択可能な実施形態は、コンピュータに上記第１態様又は第１態様のいずれか一つの可能な実施形態におけるステップを実行させるコンピュータプログラムをさらに提供する。 In a fifth aspect, an optional embodiment of the invention further provides a computer program product for causing a computer to perform the steps in any one possible embodiment of the first aspect or the first aspect above.

本発明の上記目的、特徴及び利点をより分かりやすくするために、以下において、好適な実施例を特に挙げ、添付する図面を参照しながら詳しく説明する。 In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments will be particularly mentioned below and will be described in detail with reference to the accompanying drawings.

本発明の実施例の技術的解決手段をより明瞭に説明するために、以下において、実施例に必要とされる図面について簡単に紹介し、ここでの図面は明細書に組み込まれて本明細書の一部を構成し、これらの図面は本発明に合致する実施例を示し、明細書と共に本発明の技術的解決手段を説明するために用いられる。以下の図面は本発明の一部の実施例のみを示すため、範囲を限定するものと見なすべきではなく、当業者であれば、創造的な労力を要することなく、これらの図面に基づいて他の関連する図面を取得することもできることを理解すべきである。
本発明の実施例により提供される動作認識方法を示すフローチャートである。本発明の実施例により提供される第１画像におけるターゲット対象を含むターゲット画像領域を認識する具体的な方法を示すフローチャートである。本発明の実施例により提供される動作認識ネットワークの構造を示す模式図である。本発明の実施例により提供される動作認識装置を示す模式図である。本発明の実施例により提供されるコンピュータ機器を示す模式図である。 In order to describe the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the drawings required for the embodiments, and the drawings here are incorporated into the specification to be incorporated into the specification. These drawings show embodiments consistent with the present invention and are used together with the description to explain the technical solutions of the present invention. The following drawings show only some embodiments of the present invention and should not be considered as limiting the scope, and those skilled in the art will be able to make other modifications based on these drawings without creative effort. It should be understood that the relevant drawings of the can also be obtained.
4 is a flow chart illustrating a method for motion recognition provided by an embodiment of the present invention; 4 is a flow chart illustrating a specific method for recognizing a target image region containing a target object in a first image provided by an embodiment of the present invention; 1 is a schematic diagram illustrating the structure of a motion recognition network provided by an embodiment of the present invention; FIG. 1 is a schematic diagram of a motion recognition device provided by an embodiment of the present invention; FIG. 1 is a schematic diagram of a computing device provided by an embodiment of the present invention; FIG.

本発明の実施例の目的、技術的解決手段及び利点をより明確にするために、以下に本発明の実施例における図面を参照しながら本発明の実施例における技術的解決手段を明確且つ完全に説明し、当然ながら、説明される実施例は本発明の実施例の一部に過ぎず、全ての実施例ではない。通常、ここでの図面において記述及び示される本発明の実施例のコンポーネントは様々な異なる構成で配置及び設計されることができる。従って、図面において提供される本発明の実施例についての以下の詳細な説明は、保護が要求される本発明の範囲を限定することを意図するものではなく、単に本発明の選定された実施例を示すものに過ぎない。本発明の実施例に基づき、当業者が創造的な労力を要することなく得られた他の全ての実施例は、いずれも本発明の保護範囲に属する。 In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the following clearly and completely describes the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. It should be understood that the described embodiments are only some, but not all embodiments of the present invention. In general, the components of the embodiments of the invention described and illustrated in the drawings herein can be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of embodiments of the invention provided in the drawings is not intended to limit the scope of the invention for which protection is sought, but merely selected embodiments of the invention. It is nothing more than an indication of All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the protection scope of the present invention.

研究によると、コンピュータビジョンに基づく教室での生徒の動作認識では、主に人体検出、追跡及び動作分類技術によって、起立、挙手、机にうつ伏せ等の行為を含む教室での生徒の動作を分析する。生徒の動作について分類して認識する際に、モデル構造の設計は一般的に多クラス分類設計であり、例えば、起立、挙手、机にうつ伏せの３つの動作については、３クラス分類ニューラルネットワークによって、生徒が起立、挙手や机にうつ伏せの３つの動作の各々をする確率を予測し、続いて確率が最大の方を生徒がした動作として決定する。しかし、実際の教室では、生徒は複数の動作を同時にする可能性があり、例えば、挙手と起立が同時に発生し、又は挙手と机にうつ伏せが同時に発生する。しかし、現在の検出方法では、生徒が同時にした複数の動作を検出することができないため、教室での生徒の動作を完全に認識できない問題が存在する。 Studies show that computer vision-based student action recognition in the classroom mainly analyzes student actions in the classroom, including actions such as standing up, raising hands, and lying down on a desk, mainly through body detection, tracking and action classification technology. . When classifying and recognizing the actions of students, the design of the model structure is generally a multi-class classification design. The probability that the student will perform each of the three actions of standing up, raising his hand, or lying down on the desk is then predicted, and then the action with the highest probability is determined as the action that the student has taken. However, in an actual classroom, students may perform multiple actions at the same time, such as raising their hands and standing up at the same time, or raising their hands and lying down on a desk at the same time. However, current detection methods cannot detect multiple actions performed by students at the same time, so there is a problem that the actions of students in the classroom cannot be fully recognized.

上記研究に基づき、本発明の実施例は、動作認識方法を提供し、複数の動作検出ブランチを有する動作検出ネットワークを使用して生徒の動作を認識し、ここで、異なる動作検出ブランチが検出できる動作のカテゴリーは異なり、さらに、一回の検出処理プロセスによって、生徒がした様々な動作の各々についての検出結果を得て、それによって、生徒の動作を全面的且つ正確に認識することができる。 Based on the above research, an embodiment of the present invention provides a motion recognition method, using a motion detection network with multiple motion detection branches to recognize a student's motion, where different motion detection branches can be detected. The categories of actions are different, and a single detection processing process can obtain detection results for each of the various actions performed by the student, thereby recognizing the student's actions comprehensively and accurately.

上述した解決手段に存在する欠点は、いずれも発明者が実践して細心の研究を経てから得られた結果であり、従って、上記問題の発見過程及び以下の本発明が上記問題に対して提案する解決手段は、いずれも発明者が本発明において本発明に寄与したものとする。 The drawbacks of the above-mentioned solutions are the result of the inventor's practice and meticulous research. The inventors contributed to the present invention in the present invention.

なお、類似する符号及びアルファベットは以下の図面において類似項を表し、従って、ある１項が１つの図面において定義されれば、以降の図面においてそれをさらに定義して解釈する必要がないことに注意されたい。 It should be noted that similar symbols and letters represent similar items in the following drawings, and therefore, if a term is defined in one drawing, it need not be further defined and construed in subsequent drawings. want to be

本実施例を容易に理解するために、まず本発明の実施例により開示される動作認識方法を詳しく説明する。本発明の実施例により提供される動作認識方法の実行本体は一般的に動作認識装置であり、該動作認識装置は、例えば、端末装置又はサーバ又は他の処理装置を含み、端末装置はユーザ装置（ＵＥ：ＵｓｅｒＥｑｕｉｐｍｅｎｔ）、モバイルデバイス、ユーザ端末、端末、セルラー電話、コードレス電話機、パーソナルデジタルアシスタント（ＰＤＡ：ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、携帯型デバイス、計算装置、車載装置、ウェアラブル装置等であってよい。一部の可能な実施形態では、該動作認識方法はプロセッサによりメモリに記憶されたコンピュータ読み取り可能な命令を呼び出すことで実現されてもよい。 In order to facilitate understanding of this embodiment, first, the motion recognition method disclosed by the embodiment of the present invention will be described in detail. The execution body of the motion recognition method provided by the embodiments of the present invention is generally a motion recognition device, which includes, for example, a terminal device or a server or other processing device, and the terminal device is a user device. (UE: User Equipment), mobile devices, user terminals, terminals, cellular phones, cordless phones, personal digital assistants (PDA: Personal Digital Assistant), portable devices, computing devices, in-vehicle devices, wearable devices, and the like. In some possible embodiments, the motion recognition method may be implemented by a processor invoking computer readable instructions stored in memory.

以下において、動作認識装置を実行本体とすることを例にして、本発明の実施例により提供される動作認識方法を説明する。なお、本発明の実施例により提供される動作認識方法は、教室で生徒の動作を認識できるだけでなく、様々な動作を同時にすることができる他の動作検出のシーンにも適用可能である。 In the following, the action recognition method provided by the embodiments of the present invention will be described by taking the action recognition device as an execution body as an example. It should be noted that the motion recognition method provided by the embodiment of the present invention not only can recognize the motion of the students in the classroom, but also can be applied to other motion detection scenes in which various motions can be performed simultaneously.

図１は、本発明の実施例により提供される動作認識方法のフローチャートであり、前記方法は、
第１画像を取得するステップＳ１０１と、
前記第１画像におけるターゲット対象を含むターゲット画像領域を認識するステップＳ１０２と、
複数の動作検出ブランチを有する動作検出ネットワークを使用して、前記ターゲット画像領域に対して動作検出処理を行うことで、前記ターゲット対象に対応する複数のタイプの第１動作検出結果を得るステップＳ１０３であって、ここで、異なる動作検出ブランチによって検出される動作のカテゴリーが異なるステップＳ１０３と、
複数の動作検出ブランチにそれぞれ対応する第１動作検出結果に基づき、前記ターゲット対象の第２動作検出結果を決定するステップＳ１０４と、を含む。 FIG. 1 is a flowchart of a motion recognition method provided by an embodiment of the present invention, the method comprising:
step S101 of obtaining a first image;
recognizing a target image region containing a target object in said first image S102;
In step S103, obtain a plurality of types of first motion detection results corresponding to the target object by performing motion detection processing on the target image region using a motion detection network having multiple motion detection branches. a step S103 in which categories of motions detected by different motion detection branches are different;
determining a second motion detection result of the target object based on first motion detection results respectively corresponding to a plurality of motion detection branches.

本発明の実施例が提供する動作認識方法において、ターゲット対象は、例えば、人、動物、機械的装置、車両、ロボット等のうちのいずれか１つを含む。 In the motion recognition method provided by the embodiments of the present invention, the target object includes, for example, any one of a person, an animal, a mechanical device, a vehicle, a robot, and the like.

動作検出ネットワークは、例えば、ニューラルネットワークモデルである。例示的に、ニューラルネットワークをトレーニングした後、第１画像に含まれるターゲット対象がした動作を検出できるニューラルネットワークモデルが得られる。該ニューラルネットワークモデルは複数の動作検出ブランチを含み、動作検出ブランチは、検出ヘッドとも呼ばれ、動作検出ネットワークにおけるブランチネットワークである。各検出ヘッドは、ターゲット対象があるカテゴリーの動作を実行する確率をそれぞれ得ることができる。検出ヘッドが異なると、検出される動作のカテゴリーは異なる。各動作検出ブランチに対応する第１動作検出結果は、ターゲット対象が対応する動作検出ブランチにより検出される動作のカテゴリーを実行するか否かを示すことができる。第２動作検出結果は、ターゲット対象が複数の動作検出ブランチによりそれぞれ検出される動作のカテゴリーを実行するか否かを示すことができる。 A motion detection network is, for example, a neural network model. Illustratively, after training the neural network, a neural network model is obtained that can detect actions made by the target object contained in the first image. The neural network model includes a plurality of motion detection branches, also called detection heads, which are branch networks in the motion detection network. Each detection head can obtain a respective probability that the target object will perform certain categories of actions. Different detection heads detect different categories of motion. A first action detection result corresponding to each action detection branch may indicate whether the target subject performs the category of action detected by the corresponding action detection branch. The second action detection result may indicate whether the target subject performs a category of actions respectively detected by the plurality of action detection branches.

本発明の実施例では、第１画像を取得した後、第１画像に含まれるターゲット対象のターゲット画像領域を認識し、複数の動作検出ブランチを有する動作検出ネットワークを使用して、ターゲット画像領域に対して動作検出処理を行い、ターゲット対象に対応する複数のタイプの第１動作検出結果を得て、さらに複数の動作検出ブランチにそれぞれ対応する第１動作検出結果に基づき、ターゲット対象の第２動作検出結果を決定し、さらに、一回の検出処理プロセスによって、生徒がした様々な動作の各々についての検出結果を得て、それによって、生徒の動作を全面的且つ正確に認識することができる。 In an embodiment of the present invention, after acquiring a first image, a target image region of a target object contained in the first image is recognized, and a motion detection network having multiple motion detection branches is used to perform motion detection on the target image region. performing motion detection processing on the target object to obtain a plurality of types of first motion detection results corresponding to the target object; and further based on the first motion detection results respectively corresponding to the plurality of motion detection branches to perform a second motion of the target object. A detection result is determined, and a single detection processing process obtains a detection result for each of the various actions performed by the student, so that the student's action can be recognized comprehensively and accurately.

本発明の実施例では、動作認識方法を生徒に対する動作検出に応用することを例として、上記Ｓ１０１～Ｓ１０４を詳しく説明する。 In the embodiment of the present invention, S101 to S104 will be described in detail by taking as an example the application of the motion recognition method to motion detection for students.

Ｉ：上記Ｓ１０１では、第１画像を取得する方式は応用シーンによって相違する。 I: In the above S101, the method of acquiring the first image differs depending on the application scene.

例示的に、該方法を教室シーンに応用する場合、教師が授業する教室で動作認識装置を設けることができ、該動作認識装置は、例えば、端末装置であり、動作認識装置は、授業用教室で取り付けられたカメラによって生徒が授業を受ける時の第１画像をリアルタイムで取得することができ、又は、動作認識装置にカメラが設けられ、動作認識装置は自体に設けられたカメラによって生徒が授業を受ける時の第１画像を取得することができる。 Exemplarily, when the method is applied to a classroom scene, a motion recognition device can be installed in a classroom where a teacher teaches, the motion recognition device is, for example, a terminal device, and the motion recognition device is a class room. A first image can be acquired in real time when a student takes a class by a camera attached to the camera, or a camera is provided in the action recognition device, and the action recognition device is provided with a camera so that the student can take a lesson. A first image can be acquired when receiving

オンライン授業のシーンにおいて、動作認識装置は、例えば、教師側端末、生徒側端末、又はサーバであり、動作認識装置が教師側端末の場合、生徒側端末にカメラが接続され、生徒側端末はカメラによって生徒を含む第１画像を取得し、且つ該第１画像を教師側端末に送信し、教師側端末は生徒側端末から送信された第１画像を受信し、第１画像に基づいて生徒がした動作を検出する。動作認識装置が生徒側端末の場合、生徒側端末にカメラが接続され、生徒側端末はそれに接続されるカメラによって生徒を含む第１画像を取得し、第１画像に基づいて生徒がした動作を検出し、続いて検出結果を教師側端末に送信し、それによって、教師は教師側端末によって生徒がした動作をリアルタイムで知ることができる。動作認識装置がサーバの場合、サーバは生徒側端末から送信された第１画像を受信し、第１画像に基づいて生徒がした動作を検出し、さらに検出結果を教師側端末に送信する。 In the scene of an online class, the motion recognition device is, for example, a teacher terminal, a student terminal, or a server. acquires a first image including the students by and transmits the first image to the teacher terminal, the teacher terminal receives the first image transmitted from the student terminal, and the student based on the first image is Detects motion. When the motion recognition device is a student-side terminal, a camera is connected to the student-side terminal, and the student-side terminal obtains a first image including the student by the camera connected thereto, and recognizes the action performed by the student based on the first image. detect, and then send the detection result to the teacher-side terminal, so that the teacher can know in real time what the student has done through the teacher-side terminal. When the action recognition device is a server, the server receives the first image transmitted from the student terminal, detects the action performed by the student based on the first image, and further transmits the detection result to the teacher terminal.

ＩＩ：上記Ｓ１０２では、取得された第１画像には、ターゲット対象以外に、他の画像背景情報も含まれ、画像背景情報は、ターゲット対象の第２動作検出結果にある程度の妨害を及ぼす可能性があるため、まず、第１画像に含まれるターゲット対象のターゲット画像領域を検出し、次にターゲット画像領域に基づいてターゲット対象に対する動作検出を実現することができる。 II: In S102 above, in addition to the target object, the acquired first image also contains other image background information, and the image background information may interfere with the second motion detection result of the target object to some extent. , it is possible to first detect the target image area of the target object contained in the first image, and then realize motion detection for the target object based on the target image area.

一部の選択可能な実施例において、図２に示すように、本発明の実施例は、前記第１画像におけるターゲット対象を含むターゲット画像領域を認識する具体的な方法を提供し、当該方法は、以下のステップを含む。 In some optional embodiments, as shown in FIG. 2, embodiments of the present invention provide a specific method for recognizing a target image region containing a target object in the first image, the method comprising: , including the following steps:

Ｓ２０１において、前記第１画像に対して特徴抽出処理を行い、前記第１画像の第１特徴マップを得て、前記第１特徴マップは複数の特徴チャネルにそれぞれ対応する特徴サブマップを含み、異なる前記特徴サブマップに含まれる特徴は異なる。 In S201, perform feature extraction processing on the first image to obtain a first feature map of the first image, the first feature map including feature submaps respectively corresponding to a plurality of feature channels, and different The features included in the feature submaps are different.

一部の選択可能な実施例において、畳み込みニューラルネットワークを使用して第１画像に対して特徴抽出処理を行い、第１画像の第１特徴マップを得ることができる。畳み込みニューラルネットワークを利用して第１画像に対して特徴抽出処理を行った後、第１画像の第１特徴マップが得られる。 In some alternative embodiments, a convolutional neural network may be used to perform feature extraction processing on the first image to obtain a first feature map of the first image. After performing feature extraction processing on the first image using a convolutional neural network, a first feature map of the first image is obtained.

例示的に、第１特徴マップは複数のチャネルの特徴サブマップで構成され、複数の特徴サブマップを重ね合わせた後、第１特徴マップとなる。 Illustratively, the first feature map is composed of feature sub-maps of a plurality of channels, and the first feature map is obtained after superimposing the plurality of feature sub-maps.

Ｓ２０２において、複数の特徴サブマップのうちの第１特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の中心点の第１座標情報を決定し、前記第１特徴マップでの前記中心点の第１座標情報及び前記複数の特徴サブマップのうちの第２特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の第１サイズ情報を決定する。 In S202, determining first coordinate information of the center point of the target object on the first feature map based on features included in a first feature submap of a plurality of feature submaps; determining a first size information of the target object at the first feature map based on first coordinate information of the center point at and features contained in a second one of the plurality of feature submaps. .

例示的に、第１特徴マップを構成する複数の特徴サブマップにおいて、第ｉ個のチャネルの特徴サブマップ（つまり、上記第１特徴サブマップ）に含まれる特徴は、第１特徴マップ中の各第１特徴点がターゲット対象の中心点であるか否かを特徴づけるために用いられる。ｓｉｇｍｏｉｄ活性化関数で第１特徴サブマップを活性化処理し、第１特徴サブマップでの第１特徴マップ中の各第１特徴点の特徴値を、ａ１からａ２の間の数値に変換することができる。例示的に、ａ１は、例えば、０であり、ａ２は、例えば、１である。 Exemplarily, among the plurality of feature submaps that make up the first feature map, the features included in the feature submap of the i-th channel (that is, the first feature submap) are each It is used to characterize whether the first feature point is the center point of the target object. activating the first feature submap with a sigmoid activation function to convert the feature value of each first feature point in the first feature map in the first feature submap to a numerical value between a1 and a2; can be done. Illustratively, a1 is 0, for example, and a2 is 1, for example.

ここで、ある第１特徴点については、第１特徴サブマップでの該第１特徴点の特徴値が０から１の間の数値に変換された後、対応する数値が１に向かうほど、それがターゲット対象の中心点に属する確率も大きくなる。 Here, for a certain first feature point, after the feature value of the first feature point in the first feature submap is converted into a numerical value between 0 and 1, the more the corresponding numerical value approaches 1, the more belongs to the center point of the target object.

さらに、第１特徴サブマップでの各第１特徴点の特徴値が０から１の間に変換された後の数値に基づき、各ターゲット対象の中心点の第１特徴マップでの対応する第１特徴点を決定し、決定された第１特徴点の第１座標情報を、第１特徴マップでのターゲット対象の中心点の第１座標情報として決定することができる。 Further, based on the numerical value after the feature value of each first feature point in the first feature submap is transformed between 0 and 1, the corresponding first feature point in the first feature map of the center point of each target object. A feature point may be determined, and first coordinate information of the determined first feature point may be determined as first coordinate information of a center point of the target object on the first feature map.

別の可能な実施形態において、実際の予測プロセスでは、第１特徴サブマップでの第１特徴マップ中の各第１特徴点の特徴値を、０から１の間の数値に変換した後、位置が近い第１特徴点に対応する数値も互いに近い可能性があり、ターゲット対象ごとに唯一の中心点を決定できるために、本発明の実施例はさらに、下記方法を採用して前記第１特徴マップでの前記ターゲット対象の中心点の第１座標情報を決定してもよい。 In another possible embodiment, the actual prediction process involves transforming the feature value of each first feature point in the first feature map in the first feature submap to a number between 0 and 1, then the position Since the numerical values corresponding to first feature points that are close to each other may also be close to each other, and a unique center point can be determined for each target object, embodiments of the present invention further employ the following method to employ the first feature point A first coordinate information of a center point of the target object on a map may be determined.

予め設定されたプーリングサイズ及びプーリングストライドに応じて、前記第１特徴サブマップに対して最大プーリング処理を行い、複数のプーリング値及び複数の前記プーリング値のうちの各々に対応する位置インデックスを得て、前記位置インデックスは前記第１特徴サブマップでの前記プーリング値の位置を識別するために用いられ、前記各プーリング値及び第１閾値に基づき、複数の前記プーリング値から前記中心点に属するターゲットプーリング値を決定し、前記ターゲットプーリング値に対応する位置インデックスに基づき、前記第１特徴マップでの前記中心点の第１座標情報を決定する。 performing a maximum pooling operation on the first feature submap according to a preset pooling size and pooling stride to obtain a plurality of pooling values and a position index corresponding to each of the plurality of pooling values. , the position index is used to identify the position of the pooling value in the first feature submap, and based on the respective pooling value and a first threshold, a target pooling from a plurality of the pooling values belonging to the center point; A value is determined, and a first coordinate information of the center point in the first feature map is determined based on the position index corresponding to the target pooling value.

例示的に、例えば第１特徴サブマップに対して３×３で且つストライドが１の最大プーリング処理を行うことができ、プーリング処理を行う時、第１特徴サブマップでの３×３個ごとの第１特徴点の特徴値に対して、３×３個の第１特徴点の最大応答値及び第１特徴マップでの最大応答値の位置インデックスを決定する。このとき、最大応答値の数量は第１特徴マップのサイズに関連し、例えば、第１特徴マップのサイズが８０×６０×３であると、第１特徴サブマップに対して最大プーリング処理を行った後、得られた最大応答値は合計で８０×６０個であり、且つ各最大応答値について、いずれもその位置インデックスと同一の他の最大応答値が少なくとも１つ存在する可能性がある。さらに、位置インデックスが同一の最大応答値を併合し、Ｍ個の最大応答値、及びＭ個の最大応答値の各々に対応する位置インデックスを得る。さらにＭ個の最大応答値の各々を第１閾値と比較し、ある最大応答値が該第１閾値より大きい場合、当該最大応答値をターゲットプーリング値として決定する。ターゲットプーリング値に対応する位置インデックスは、即ち第１特徴マップでのターゲット対象の中心点の第１座標情報である。 Illustratively, for example, a maximum pooling operation of 3×3 and a stride of 1 can be performed on the first feature submap, and when performing the pooling operation, every 3×3 For the feature value of the first feature point, determine the maximum response value of the 3×3 first feature points and the position index of the maximum response value in the first feature map. At this time, the quantity of the maximum response value is related to the size of the first feature map. After that, there are a total of 80×60 maximum response values obtained, and for each maximum response value, there may be at least one other maximum response value, all of which have the same position index. Further, the maximum response values with the same position index are merged to obtain M maximum response values and a position index corresponding to each of the M maximum response values. Furthermore, each of the M maximum response values is compared with a first threshold, and if a maximum response value is greater than the first threshold, the maximum response value is determined as the target pooling value. The position index corresponding to the target pooling value is the first coordinate information of the center point of the target object in the first feature map.

ここで、第１特徴サブマップに対して活性化処理を行い、第１特徴サブマップでの第１特徴マップ中の各第１特徴点の特徴値を０から１の間の数値に変換してから、最大プーリング処理を行ってもよく、又は第１特徴サブマップに対して最大プーリング処理を直接行ってもよい。 Here, activation processing is performed on the first feature submap, and the feature value of each first feature point in the first feature map in the first feature submap is converted into a numerical value between 0 and 1. , the max pooling process may be performed, or the max pooling process may be performed directly on the first feature submap.

第１特徴サブマップに対して最大プーリング処理を直接行う場合、第１特徴サブマップに対して最大プーリング処理を行った後、活性化関数で各プーリング値に対して活性化処理を行い、各プーリング値を０から１の間の数値に変換してから、０から１の間の数値に変換されたプーリング値及び第１閾値に基づき、複数のプーリング値からターゲット対象の中心点に属するターゲットプーリング値を決定することができる。 When performing the maximum pooling process directly on the first feature submap, after performing the maximum pooling process on the first feature submap, the activation process is performed on each pooling value by an activation function, and each pooling After transforming the value into a number between 0 and 1, based on the pooling value transformed into a number between 0 and 1 and the first threshold, a target pooling value belonging to the center point of the target object from the plurality of pooling values. can be determined.

また、第１特徴サブマップに対して最大プーリング処理を直接行う場合、第１特徴サブマップに対して最大プーリング処理を行った後、さらにプーリング値及び第１閾値に基づき、複数のプーリング値からターゲット対象の中心点に属するターゲットプーリング値を直接決定することもでき、このとき、第１閾値は活性化処理を行う必要がある上記いくつかの例における第１閾値の値と異なる。具体的には、実際の必要に応じて具体的に選択することができる。 Further, when performing the maximum pooling process directly on the first feature submap, after performing the maximum pooling process on the first feature submap, the target It is also possible to directly determine the target pooling value belonging to the center point of interest, where the first threshold is different from the value of the first threshold in some examples above where the activation process needs to be performed. Specifically, it can be specifically selected according to actual needs.

別の例では、第１特徴マップを構成する複数の特徴サブマップにおいて、第ｊ個のチャネルと第ｋ個のチャネルの特徴サブマップ（つまり、上記第２特徴サブマップ）に含まれる特徴は、第１特徴マップでの第１画像のターゲット対象の第１サイズ情報を特徴づけるために用いられる。 In another example, in the plurality of feature submaps that make up the first feature map, the features included in the feature submaps of the jth channel and the kth channel (that is, the second feature submap) are: A first size information of the target object in the first image is used to characterize the first size information in the first feature map.

例示的に、第ｊ個のチャネルの特徴サブマップでの第１特徴マップ中の各第１特徴点の特徴値は、各第１特徴点に対応する第１サイズ情報中の長さの値を特徴づけ、第ｋ個のチャネルの特徴サブマップでの各第１特徴点の特徴値は、各第１特徴点に対応する第１サイズ情報中の幅の値を特徴づける。 Exemplarily, the feature value of each first feature point in the first feature map in the feature submaps of the j-th channel is the length value in the first size information corresponding to each first feature point. Characterizing, the feature value of each first feature point in the feature submap of the kth channel characterizes the width value in the first size information corresponding to each first feature point.

例えば、特徴サブマップの数量が３である場合、ｉは例えば０であり、ｊは例えば１であり、ｋは例えば２である。ｉ、ｊ及びｋの具体的な値は、実際のニューラルネットワーク処理プロセスに応じて設定される。 For example, if the number of feature submaps is 3, i may be 0, j may be 1, and k may be 2, for example. Specific values of i, j and k are set according to the actual neural network processing process.

第１特徴マップでの中心点の第１座標情報を得た後、該第１座標情報に基づき、第２特徴サブマップから、中心点を特徴づける第１特徴点の第２特徴サブマップでの特徴値を読み取り、且つ読み取られた特徴値を第１特徴マップでのターゲット対象の第１サイズ情報として決定する。 After obtaining the first coordinate information of the center point on the first feature map, based on the first coordinate information, from the second feature submap, the first feature point characterizing the center point on the second feature submap. Reading the feature values and determining the read feature values as the first size information of the target object in the first feature map.

Ｓ２０３において、前記第１座標情報及び前記第１サイズ情報に基づき、前記ターゲット画像領域を決定する。 In S203, the target image area is determined based on the first coordinate information and the first size information.

一部の選択可能な実施例において、第１画像に対して特徴抽出処理を行い、第１画像の第１特徴マップを得た後、本発明の実施例はさらに以下の方法を採用して前記ターゲット画像領域を決定してもよい。第１画像中の各画素点と、第１特徴マップ中の第１特徴点との間の位置マッピング関係を生成し、前記第１座標情報、前記第１サイズ情報、及び該位置マッピング関係に基づき、第１画像での中心点の第２座標情報（前記位置マッピング関係及び前記第１座標情報に基づき、第１画像での前記中心点の第２座標情報を決定することができる）、及び第１画像での前記ターゲット対象の第２サイズ情報（前記第１サイズ情報に基づき、第１画像での前記ターゲット対象の第２サイズ情報を決定する）を決定し、そして第１画像での中心点の第２座標情報、及び第１画像でのターゲット対象の第２サイズ情報に基づき、ターゲット画像領域を決定する。 In some alternative embodiments, after performing feature extraction processing on the first image to obtain a first feature map of the first image, embodiments of the present invention further adopt the following method to: A target image region may be determined. generating a positional mapping relationship between each pixel point in the first image and a first feature point in the first feature map, based on the first coordinate information, the first size information, and the positional mapping relationship; , second coordinate information of a center point in a first image (the second coordinate information of the center point in a first image can be determined based on the location mapping relationship and the first coordinate information); determining a second size information of the target object in one image (based on the first size information to determine the second size information of the target object in a first image), and a center point in a first image; and second size information of the target object in the first image.

第２座標情報、及び第２サイズ情報に基づいてターゲット画像領域を決定する時、一つの可能な実施形態では、第２座標情報、及び第２サイズ情報に基づいてターゲット画像領域を直接決定してもよく、別の可能な実施形態では、第２座標情報及び第２サイズ情報に基づき、第１画像からターゲット対象を含む第１領域範囲を決定し、第１領域範囲に基づき、第１画像からターゲット対象を含む第２領域範囲を決定し、第２領域範囲は第１領域範囲より大きく、且つ前記第２領域範囲は前記第１領域範囲を包含し、そして第２領域範囲に基づき、第１画像からターゲット画像領域を決定してもよい。 When determining the target image area based on the second coordinate information and the second size information, one possible embodiment directly determines the target image area based on the second coordinate information and the second size information. Alternatively, in another possible embodiment, based on the second coordinate information and the second size information, a first area extent containing the target object is determined from the first image; determining a second area extent including the target object, the second area extent being greater than the first area extent, and the second area extent encompassing the first area extent; and based on the second area extent, a first A target image region may be determined from the image.

例示的に、本発明の実施例では、前記第１領域範囲の中心点及び前記第１領域範囲の４つの頂点に基づき、中心点から頂点への方向に沿って移動する（即ち、各頂点の各々はいずれも中心点及び他の頂点から離れる方向へ移動する）ことにより、第１領域範囲をベースとして領域範囲を拡張し、各頂点が移動した後、第２領域範囲の４つの頂点の位置を得て、さらに第２領域範囲の４つの頂点の位置に基づいて前記第２領域範囲を得ることができる。ここで、各頂点の移動距離は同一であっても異なっていてもよく、つまり、第１領域範囲をベースとして領域範囲を拡張する過程で、各頂点の周辺の領域の拡張幅／拡張サイズは、同一であっても異なっていてもよく、ここでは限定されない。 Illustratively, in an embodiment of the present invention, based on the center point of the first area extent and the four vertices of the first area extent, move along the direction from the center point to the vertex (i.e., the each moving away from the center point and other vertices) to expand the area range based on the first area range, and after each vertex is moved, the position of the four vertices of the second area range and the second area range can be obtained based on the positions of the four vertices of the second area range. Here, the movement distance of each vertex may be the same or different, that is, in the process of expanding the area range based on the first area range, the expansion width/extension size of the area around each vertex is , which may be the same or different, and are not limited here.

ＩＩＩ：上記Ｓ１０３では、動作検出ネットワークは、例えば、特徴抽出ネットワーク及び前記特徴抽出ネットワークに接続される複数の動作検出ブランチネットワークを含む。ここで、各動作検出ブランチネットワークは一つの動作検出ブランチに対応し、且つ異なる動作検出ブランチネットワークによって検出される動作のカテゴリーが異なる。 III: In S103 above, the motion detection network includes, for example, a feature extraction network and a plurality of motion detection branch networks connected to the feature extraction network. Here, each motion detection branch network corresponds to one motion detection branch, and the categories of motions detected by different motion detection branch networks are different.

複数の動作検出ブランチを有する動作検出ネットワークを使用して、前記ターゲット画像領域に対して動作検出処理を行い、前記ターゲット対象に対応する複数のタイプの第１動作検出結果を得る際に、一部の選択可能な実施形態では、例えば、下記プロセスを採用してもよい。 In performing motion detection processing on the target image region using a motion detection network having a plurality of motion detection branches to obtain a plurality of types of first motion detection results corresponding to the target object, in part: In alternative embodiments of , for example, the following process may be employed.

前記特徴抽出ネットワークを使用して前記ターゲット画像領域に対して特徴抽出処理を行い、前記ターゲット画像領域の第２特徴マップを得て、そして複数の前記動作検出ブランチネットワークを使用して前記第２特徴マップに対してそれぞれ動作検出処理を行い、各前記動作検出ブランチネットワークにそれぞれ対応する第１動作検出結果を得る。 performing feature extraction processing on the target image region using the feature extraction network to obtain a second feature map of the target image region; and using a plurality of the motion detection branch networks to obtain the second feature. A motion detection process is performed on each map to obtain a first motion detection result respectively corresponding to each of the motion detection branch networks.

一部の選択可能な実施例において、複数の前記動作検出ブランチネットワークを使用して前記第２特徴マップに対してそれぞれ動作検出処理を行い、各前記動作検出ブランチネットワークにそれぞれ対応する第１動作検出結果を得ることは、例えば、下記プロセスを採用してもよい。複数の動作検出ブランチネットワークのうちの各々について、前記動作検出ブランチネットワークを使用して前記第２特徴マップに対して動作検出処理を行い、前記ターゲット対象が前記動作検出ブランチネットワークにより検出される動作のカテゴリーをする確率を得て、そして前記確率及び予め決定された第２閾値に基づき、前記動作検出ブランチネットワークに対応する第１動作検出結果を決定する。 In some alternative embodiments, a plurality of the motion detection branch networks are used to respectively perform motion detection processing on the second feature map, and a first motion detection corresponding to each of the motion detection branch networks respectively. Obtaining results may employ, for example, the following processes. For each of a plurality of motion detection branch networks, performing motion detection processing on the second feature map using the motion detection branch network, wherein the target object is motion detected by the motion detection branch network. Obtaining a categorization probability, and determining a first motion detection result corresponding to the motion detection branch network based on the probability and a predetermined second threshold.

例示的に、図３に示すように、本発明の実施例は動作検出ネットワークの具体的な構成の例を提供する。本発明の実施例により提供される動作認識方法を教室での生徒の動作認識に応用する場合、動作検出ブランチネットワークは３つあり、それぞれＡ、Ｂ及びＣであり、動作検出ブランチネットワークＡは起立という動作のカテゴリーを検出し、動作検出ブランチネットワークＢは挙手という動作のカテゴリーを検出し、動作検出ブランチネットワークＣは机にうつ伏せという動作のカテゴリーを検出する。第１画像を取得し、且つ第１画像における各生徒の第１画像でのターゲット画像領域を決定した後、特徴抽出ネットワークＭを使用して、各生徒に対応するターゲット画像領域に対して特徴抽出処理を行い、各生徒に対応する第２特徴マップを得て、動作検出ブランチネットワークＡを使用して第２特徴マップに対して動作検出処理を行い、該生徒が起立動作をする確率を得て、そして該生徒が起立動作をする確率、及び対応する第２閾値に基づき、該生徒が起立動作をする第１動作検出結果を決定し、例えば、動作検出ブランチネットワークＡで得られた生徒が起立動作をする確率が、対応する第２閾値より大きい場合、該生徒が起立動作をしたと決定する。 Illustratively, as shown in FIG. 3, embodiments of the present invention provide examples of specific configurations of motion detection networks. When the action recognition method provided by the embodiment of the present invention is applied to student's action recognition in the classroom, there are three action detection branch networks, respectively A, B and C, and the action detection branch network A stands up. The motion detection branch network B detects the motion category of raising hands, and the motion detection branch network C detects the motion category of lying face down on a desk. After obtaining the first image and determining the target image regions in the first image for each student in the first image, feature extraction network M is used to extract features for the target image regions corresponding to each student. processing to obtain a second feature map corresponding to each student, and using the motion detection branch network A to perform motion detection processing on the second feature map to obtain the probability that the student will stand up. , and based on the probability that the student will stand up and the corresponding second threshold, determine a first motion detection result that the student will stand up, e.g. If the probability of performing the action is greater than a corresponding second threshold, then it is determined that the student has performed a standing action.

同様に、動作検出ブランチネットワークＢを使用して第２特徴マップに対して動作検出処理を行い、該生徒が挙手動作をする確率を得て、該生徒が挙手動作をする確率及び対応する第２閾値に基づき、該生徒が挙手動作をする第１動作検出結果を決定する。動作検出ブランチネットワークＣを使用して第２特徴マップに対して動作検出処理を行い、該生徒が机にうつ伏せる動作をする確率を得て、該生徒が机にうつ伏せる動作をする確率及び対応する第２閾値に基づき、該生徒が机にうつ伏せる動作をする第１動作検出結果を決定する。 Similarly, the motion detection branch network B is used to perform motion detection processing on the second feature map to obtain the probability that the student will raise his hand, and the probability that the student will raise his hand and the corresponding second Based on the threshold, a first motion detection result of the student's hand-raising motion is determined. Perform motion detection processing on the second feature map using the motion detection branch network C, obtain the probability that the student will make the motion of lying down on the desk, and the probability that the student will make the motion of lying down on the desk and the correspondence. Based on the second threshold, the first motion detection result that the student makes a motion of lying face down on the desk is determined.

最終的に、該生徒の起立動作をすること、挙手動作をすること、机にうつ伏せる動作をすることにそれぞれ対応する第１動作検出結果に基づき、該生徒の最終的な第２動作検出結果を決定する。 Finally, the final second motion detection result of the student based on the first motion detection results respectively corresponding to the standing motion, the raising of the hand, and the motion of lying down on the desk. to decide.

例えば、動作検出ブランチネットワークＡで得られた第１動作検出結果が起立していないこと、動作検出ブランチネットワークＢで得られた第１動作検出結果が挙手したこと、動作検出ブランチネットワークＣで得られた第１動作検出結果が机にうつ伏せていないことであると、対応する第２動作検出結果は起立せず、挙手し、且つ机にうつ伏せていないこととなる。 For example, the first motion detection result obtained by the motion detection branch network A is not standing, the first motion detection result obtained by the motion detection branch network B is that the hand is raised, and the motion detection branch network C is obtained. If the first motion detection result is that the person is not lying face down on the desk, the corresponding second motion detection result is that the person is not standing up, raising his hand, and not lying face down on the desk.

ここで、異なる動作検出ブランチネットワークに対応する第２閾値は同一であっても異なっていてもよく、実際の必要に応じて具体的に設定してもよいことに注意されたい。 Here, it should be noted that the second thresholds corresponding to different motion detection branch networks can be the same or different, and can be specifically set according to actual needs.

例示的に、特徴抽出ネットワークに入力されたターゲット画像領域の画像サイズは１１２×１１２とし、特徴抽出ネットワークはターゲット画像領域に対してダウンサンプリングを４回行い、サイズが７×７の第２特徴マップを得て、ここで、ターゲット画像領域に対するダウンサンプリングプロセスは、例えば、ターゲット画像領域に対してストライドが２の畳み込み操作を順に４回行うことである。そして、７×７の第２特徴マップを異なる動作検出ブランチネットワークにそれぞれ入力する。各動作検出ブランチネットワークについて、まず第２特徴マップに対して畳み込み処理を行い、さらに畳み込み処理の結果に対して平均プーリング処理を行い、一次元データを得て、その後、ｓｉｇｍｏｉｄで一次元データを活性化処理し、最終的に動作検出ブランチネットワークに対応する確率を得る。 Exemplarily, the image size of the target image region input to the feature extraction network is 112×112, and the feature extraction network downsamples the target image region four times to obtain a second feature map of size 7×7. where the downsampling process for the target image region is, for example, to perform four sequential convolution operations with a stride of 2 on the target image region. The 7×7 second feature maps are then input to different motion detection branch networks respectively. For each motion detection branch network, first convolution processing is performed on the second feature map, and average pooling processing is performed on the results of the convolution processing to obtain one-dimensional data, and then the one-dimensional data is activated by sigmoid. and finally obtain the probabilities corresponding to the motion detection branch network.

また、関連技術において、ニューラルネットワークを使用して画像に含まれるターゲット対象の動作を検出する前に、一般的に、複数の画像取得装置に由来するサンプル画像でニューラルネットワークをトレーニングし、異なる画像取得装置に由来するサンプル画像は、撮影パラメータの差異により、異なる画像特徴を含み、ニューラルネットワークは、トレーニング過程中に、異なるサンプル画像に由来する異なる特徴を学習できることによって、ニューラルネットワークを使用して画像に含まれるターゲット対象の動作を検出する時、ニューラルネットワークの一般化能力を強化することができる。このようなニューラルネットワークを使用して画像に対して動作検出処理を行う時、ニューラルネットワークは、画像中のターゲット対象がある動作を実行する確率を出力し、該確率と予め設定された確率閾値とを比較し、且つ比較結果に基づき、ターゲット対象に対応する動作検出結果を決定することができる。しかし、異なるカメラの画像取得パラメータが異なるため、異なるカメラで取得された画像の品質に差異があり、画像の品質が異なることにより、含まれる画像の特徴にも差異が存在し、それによって、同一の確率閾値を統一の判定基準として使用して、異なるカメラで取得された画像の動作検出結果を得ても、必ずしも最適な結果ではなく、一部の画像が誤判断されるようになり、検出の精度が低下するという問題が生じる。 Also, in the related art, prior to using a neural network to detect motion of a target object contained in an image, it is common practice to train the neural network on sample images from multiple image acquisition devices to obtain different image acquisitions. Sample images derived from the device contain different image features due to differences in imaging parameters, and the neural network can learn different features derived from different sample images during the training process, thereby using the neural network to improve the image quality. The generalization ability of the neural network can be enhanced when detecting motion of the target object involved. When performing a motion detection process on an image using such a neural network, the neural network outputs a probability that the target object in the image will perform a certain motion, and the probability is combined with a preset probability threshold. and determine a motion detection result corresponding to the target object based on the comparison result. However, due to the different image acquisition parameters of different cameras, there will be differences in the quality of the images acquired by different cameras, and due to the different image quality, there will also be differences in the features of the images involved, thereby making the same Using the probability threshold of 1 as a uniform criterion to obtain motion detection results for images captured by different cameras does not always give optimal results, and some images may be misjudged, resulting in detection errors. A problem arises in that the accuracy of

上記問題を解決するために、本発明の実施例は、第２閾値を決定する具体的な方法をさらに提供する。前記方法は、前記動作検出ネットワークを使用して、第１画像と関連性のある複数枚の第２画像中の各々を分類処理し、前記第２画像ごとの分類予測確率を得ることと、複数枚の前記第２画像にそれぞれ対応する分類予測確率及び複数枚の前記第２画像にそれぞれ対応する予めラベル付けされた実際の分類結果に基づき、第２閾値を決定することと、を含む。 To solve the above problem, embodiments of the present invention further provide specific methods for determining the second threshold. The method uses the motion detection network to classify each of a plurality of second images associated with the first image to obtain a classification prediction probability for each of the second images; determining a second threshold based on predicted classification probabilities corresponding to each of the second images and pre-labeled actual classification results respectively corresponding to a plurality of the second images.

ここで、第１画像と第２画像とは関連性があることは、
前記第１画像と前記第２画像の撮影パラメータの類似度が予め設定された類似度閾値より大きいことと、
複数枚の前記第１画像及び前記第２画像が同一の画像取得装置で取得されることと、のうちの少なくとも１つを含む。 Here, the relationship between the first image and the second image is
the similarity between the imaging parameters of the first image and the second image is greater than a preset similarity threshold;
and acquiring a plurality of the first images and the second images with the same image acquisition device.

このように、第１画像と関連性のある複数枚の第２画像の分類結果に基づき、第２閾値を得て、第１画像を分類処理する過程中で、第１画像と第２画像とは関連性があるため、上記第１分類閾値を分類処理過程中の判定基準の１つとする場合、第２画像の動作検出結果をより高い精度で得ることができ、それにより分類結果の正確率を高めることができる。 In this way, based on the classification results of the plurality of second images related to the first image, the second threshold is obtained, and during the process of classifying the first image, the first image and the second image are classified. is related, if the first classification threshold is taken as one of the judgment criteria during the classification process, the motion detection result of the second image can be obtained with higher accuracy, thereby increasing the accuracy rate of the classification result can increase

第１画像と第２画像とは関連性があることは、例えば、下記の少なくとも１つのことを含む。 The relationship between the first image and the second image includes, for example, at least one of the following.

（１）前記第１画像と前記第２画像の撮影パラメータの類似度が予め設定された類似度閾値より大きい。 (1) The degree of similarity between the imaging parameters of the first image and the second image is greater than a preset similarity threshold.

例えば、異なる画像の撮影パラメータをパラメータベクトルとすることができ、異なる画像の撮影パラメータの類似度は、例えば、異なる画像のパラメータベクトル間のベクトル距離によって特徴づけられることができ、異なる画像のパラメータベクトル間のベクトル距離が予め設定された距離閾値より小さい場合、該異なる画像の撮影パラメータの類似度が予め設定された類似度閾値より大きいことを特徴づける。 For example, the imaging parameters of different images can be parameter vectors, and the similarity of the imaging parameters of different images can be characterized, for example, by the vector distance between the parameter vectors of different images, and the parameter vectors of different images If the vector distance between is less than a preset distance threshold, the similarity of the shooting parameters of the different images is characterized as being greater than a preset similarity threshold.

（２）複数枚の前記第１画像及び前記第２画像が同一の画像取得装置で取得される。 (2) A plurality of the first images and the second images are acquired by the same image acquisition device.

画像取得装置を使用し始める前に、上記ステップＳ１０１～Ｓ１０２で、画像取得装置の第１確率閾値を決定することができ、該画像取得装置が使用された後、決定された第１確率閾値に基づき、取得された第２画像に対して分類処理を行う。 Before starting to use the image capturing device, the first probability threshold of the image capturing device can be determined in the above steps S101-S102, and after the image capturing device is used, the determined first probability threshold Based on this, classification processing is performed on the acquired second image.

第２閾値は、決定された第２閾値で第２画像の分類結果を判別する際に、判別結果の正確率が予め設定された正確率閾値に達するように、決定される必要がある。 The second threshold needs to be determined so that the accuracy rate of the discrimination result reaches the preset accuracy rate threshold when the classification result of the second image is discriminated with the determined second threshold value.

例示的に、下記方法を採用して第２閾値を決定してもよい。 Exemplarily, the following method may be employed to determine the second threshold.

複数の候補閾値を決定し、複数の前記候補閾値中の各々に対して、複数枚の前記第２画像にそれぞれ対応する分類予測確率及び実際の分類結果に基づき、各前記候補閾値に対応する予測正確率を決定し、そして複数の前記候補閾値にそれぞれ対応する予測正確率に基づき、複数の前記候補閾値から前記第２閾値を決定する。 determining a plurality of candidate thresholds, and for each of the plurality of candidate thresholds, predictions corresponding to each of the candidate thresholds based on classification prediction probabilities and actual classification results respectively corresponding to the plurality of second images; An accuracy rate is determined, and the second threshold is determined from the plurality of candidate thresholds based on predicted accuracy rates respectively corresponding to the plurality of candidate thresholds.

例示的に、第２閾値の値取り範囲、及び予め設定された値取りストライドに基づき、値取り範囲内で複数の候補閾値を決定することができる。 Illustratively, a plurality of candidate thresholds can be determined within the bidding range based on the bidding range of the second threshold and the preset bidding stride.

例示的に、動作検出ネットワークを使用して第２画像を分類処理した後、例えば、ｓｉｇｍｏｉｄ活性化関数で分類処理された結果を活性化処理し、分類処理された結果の変化値を０から１の値取り区間の範囲内にすることができ、このとき、分類処理された結果は、第２画像の分類予測確率を特徴づけることができる。それに応じて、第２閾値の値取り範囲が［０，１］となる。０．０５を値取りストライドとする場合、決定された複数の候補閾値はそれぞれ０、０．０５、０．１、０．１５、０．２、０．２５、０．３、０．３５、０．４、０．４５、０．５、０．５５、０．６、０．６５、０．７、０．７５、０．８、０．８５、０．９、０．９５、１であってもよい。 Exemplarily, after classifying the second image using the motion detection network, the result of the classification process is activated with, for example, a sigmoid activation function, and the change value of the result of the classification process is changed from 0 to 1. , where the classified results characterize the classification prediction probabilities of the second image. Accordingly, the range of values for the second threshold becomes [0, 1]. Assuming 0.05 as the valuation stride, the determined candidate thresholds are respectively 0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, at 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1 There may be.

ここで、第２閾値の値取り範囲は、実際の状況に応じて決定され得て、値取りストライドも実際の必要に応じて決定されることができ、例えば、値取りストライドを０．０１、０．０２等に決定してもよく、本実施例ではこれを限定しないことに注意されたい。 Here, the bidding range of the second threshold can be determined according to the actual situation, and the bidding stride can also be determined according to actual needs, for example, the bidding stride can be 0.01, Note that it may be determined to be 0.02 or the like and is not limited to this in the present example.

例示的に、動作検出ネットワークによる第２画像の分類結果には、第２画像中のターゲット対象がある動作をしたこと、又は第２画像中のターゲット対象が当該動作をしていないことを含み、動作検出ネットワークを使用して複数枚の第２画像を分類処理した後、ｓｃｏｒｅ＿ｎで表される第ｎ枚の第２画像の分類予測確率を得ることを仮定する。第２閾値の値取り範囲が［０，１］であり、値取りストライドが０．００１であると仮定すると、０．００１の該ストライドで第２閾値の可能な値であるｔｈｒｄ＝０＋０．００１×ｋをトラバースし、ここで、ｋ∈［０，１０００］である。ｐ回目のトラバースについて、決定された候補閾値はｔｈｒｄ＿ｐ＝０＋０．００１×ｐであり、この候補閾値ｔｈｒｄ＿ｐで、ｓｃｏｒｅ＿ｎがｔｈｒｄ＿ｐより大きいと、第２画像の予測分類結果が、対応する動作をしたことであることを特徴づけ、そうでない場合は、第２画像の予測分類結果が、対応する動作をしていないことであることを特徴づける。 Illustratively, the result of classifying the second image by the motion detection network includes that the target object in the second image made a certain action or that the target object in the second image did not make that action; Suppose that after classifying a plurality of second images using a motion detection network, we obtain the classification prediction probability of the nth second image denoted by score_n. Assuming that the picking range of the second threshold is [0,1] and the picking stride is 0.001, the possible values of the second threshold with the stride of 0.001 are thrd=0+0.001 xk, where kε[0,1000]. For the p-th traversal, the determined candidate threshold is thrd_p=0+0.001×p, with this candidate threshold thrd_p, if score_n is greater than thrd_p, the predicted classification result of the second image has taken the corresponding action. otherwise, the predictive classification result of the second image is that the corresponding action is not performed.

続いて、ｎ枚の第２画像に対応する予測分類結果、及びｎ枚の第２画像にそれぞれ対応する実際の分類結果に基づき、統計して、
ＴＰ：動作を実際にし、且つｔｈｒｄ＿ｐの候補閾値で動作をしたと予測される第２画像の数量、
ＴＮ：動作を実際にし、且つｔｈｒｄ＿ｐの候補閾値で動作をしていないと予測される第２画像の数量、
ＦＰ：動作を実際にしておらず、且つｔｈｒｄ＿ｐの候補閾値で動作をしたと予測される第２画像の数量、及び
ＦＮ：動作を実際にしておらず、且つｔｈｒｄ＿ｐの候補閾値で動作をしていないと予測される第２画像の数量というパラメータを得る。 Subsequently, statistically based on the predicted classification results corresponding to the n second images and the actual classification results corresponding to the n second images respectively,
TP: number of second images predicted to have acted and acted at a candidate threshold of thrd_p;
TN: number of second images predicted to actually have motion and not have motion at the candidate threshold of thrd_p;
FP: the number of second images predicted to have had no motion and have had motion at a candidate threshold of thrd_p; Obtain a parameter, the number of second images that are expected to be absent.

続いて、下記公式（１）～公式（３）に基づいて、ｔｈｒｄ＿ｐの候補閾値に対応する予測正確率Ｆを得る。 Subsequently, the prediction accuracy rate F corresponding to the candidate threshold of thrd_p is obtained based on the following formulas (1) to (3).

全ての候補閾値の予測正確率を得た後、予測正確率Ｆが最大の候補閾値を第２閾値として決定する。 After obtaining the prediction accuracy rates of all the candidate thresholds, the candidate threshold with the highest prediction accuracy rate F is determined as the second threshold.

また、本発明の別の実施例において、第２閾値をより正確に決定することができるために、さらに、第２閾値の値取り範囲を複数の値取り区間に分け、続いて各値取り区間について、複数の第２画像にそれぞれ対応する分類予測確率及び実際の分類結果に基づき、各値取り区間に対応する予測正確率を決定してから、複数の値取り区間にそれぞれ対応する予測正確率に基づき、複数の値取り区間から一つのターゲット値取り区間を決定し、続いて複数のターゲット値取り区間から複数の候補閾値を決定し、上記プロセスに従って各候補閾値に対応する予測正確率を決定してもよく、それにより第２閾値を決定する時の必要な計算量を削減し、計算リソースと計算時間を節約することができる。 In another embodiment of the present invention, in order to more accurately determine the second threshold, the bidding range of the second threshold is further divided into a plurality of bidding intervals, followed by each bidding interval , based on the classification prediction probability and the actual classification result corresponding to each of the plurality of second images, determine the prediction accuracy rate corresponding to each pricing interval, and then determine the prediction accuracy rate corresponding to each of the plurality of pricing intervals determine a target bidding interval from the multiple bidding intervals, then determine multiple candidate thresholds from the multiple target bidding intervals, and determine the prediction accuracy rate corresponding to each candidate threshold according to the above process. may be used, thereby reducing the amount of computation required when determining the second threshold, saving computational resources and computing time.

また、本発明の実施例はさらに段階的接近法を採用して第２閾値を決定してもよい。 Also, embodiments of the present invention may further employ a stepwise approach to determine the second threshold.

上記方法の具体的な実施形態において、各ステップの記述順序は厳しい実行順序を意味して実施プロセスに対する如何なる制限を構成せず、各ステップの具体的な実行順序はその機能と可能な内在的論理で確定されるべきであることが当業者に理解される。 In the specific embodiment of the above method, the description order of each step implies a strict execution order and does not constitute any limitation to the implementation process, and the specific execution order of each step is not limited to its function and possible intrinsic logic. It is understood by those skilled in the art that should be determined by

同一の発明構想に基づき、本発明の実施例では、動作認識方法に対応する動作認識装置をさらに提供し、本発明の実施例における装置が問題を解決する原理が本発明の実施例の上記動作認識方法と同様であるため、装置の実施は方法の実施を参照すればよく、重複部分については、説明を省略する。 Based on the same inventive idea, the embodiment of the present invention further provides a motion recognition device corresponding to the method of motion recognition, and the principle of the device in the embodiment of the present invention to solve the problem is the above operation of the embodiment of the present invention. Since it is the same as the recognition method, the implementation of the device can be referred to the implementation of the method, and redundant descriptions will be omitted.

図４は、本発明の実施例により提供される動作認識装置の模式図であり、前記装置は、取得モジュール４１、認識モジュール４２、検出モジュール４３、及び決定モジュール４４を含み、ここで、
取得モジュール４１は、第１画像を取得するように構成され、
認識モジュール４２は、前記第１画像におけるターゲット対象を含むターゲット画像領域を認識するように構成され、
検出モジュール４３は、複数の動作検出ブランチを有する動作検出ネットワークを使用して、前記ターゲット画像領域に対して動作検出処理を行い、前記ターゲット対象に対応する複数のタイプの第１動作検出結果を得るように構成され、ここで、異なる動作検出ブランチによって検出される動作のカテゴリーが異なり、
決定モジュール４４は、複数の動作検出ブランチにそれぞれ対応する第１動作検出結果に基づき、前記ターゲット対象の第２動作検出結果を決定するように構成される。 FIG. 4 is a schematic diagram of a motion recognition device provided by an embodiment of the present invention, said device comprising an acquisition module 41, a recognition module 42, a detection module 43 and a determination module 44, wherein:
the acquisition module 41 is configured to acquire the first image;
the recognition module 42 is configured to recognize a target image region containing a target object in the first image;
A detection module 43 performs motion detection processing on the target image region using a motion detection network having multiple motion detection branches to obtain multiple types of first motion detection results corresponding to the target object. where the categories of behavior detected by different behavior detection branches are different,
A determination module 44 is configured to determine a second motion detection result of the target object based on first motion detection results respectively corresponding to a plurality of motion detection branches.

一つの可能な実施形態において、前記認識モジュール４２は、前記第１画像に対して特徴抽出処理を行い、前記第１画像の第１特徴マップを得て、前記第１特徴マップは複数の特徴チャネルにそれぞれ対応する特徴サブマップを含み、異なる前記特徴サブマップに含まれる特徴は異なり、複数の特徴サブマップのうちの第１特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の中心点の第１座標情報を決定し、前記第１特徴マップでの前記中心点の第１座標情報及び前記複数の特徴サブマップのうちの第２特徴サブマップに含まれる特徴に基づき、前記第１特徴マップでの前記ターゲット対象の第１サイズ情報を決定し、前記第１座標情報及び前記第１サイズ情報に基づき、前記ターゲット画像領域を決定するように構成される。 In one possible embodiment, the recognition module 42 performs a feature extraction process on the first image to obtain a first feature map of the first image, the first feature map comprising a plurality of feature channels. , wherein features included in different feature submaps are different, and based on features included in a first feature submap of a plurality of feature submaps, the Determining first coordinate information of a center point of a target object, based on the first coordinate information of the center point on the first feature map and features contained in a second one of the plurality of feature submaps. , determining first size information of the target object in the first feature map, and determining the target image region based on the first coordinate information and the first size information.

一つの可能な実施形態において、前記認識モジュール４２は、予め設定されたプーリングサイズ及びプーリングストライドに応じて、前記第１特徴サブマップに対して最大プーリング処理を行い、複数のプーリング値及び複数の前記プーリング値のうちの各々に対応する位置インデックスを得て、前記位置インデックスは前記第１特徴サブマップでの前記プーリング値の位置を識別するために用いられ、前記各プーリング値及び第１閾値に基づき、複数の前記プーリング値から前記中心点に属するターゲットプーリング値を決定し、前記ターゲットプーリング値に対応する位置インデックスに基づき、前記第１特徴マップでの前記中心点の第１座標情報を決定するように構成される。 In one possible embodiment, the recognition module 42 performs a max pooling operation on the first feature submap according to a preset pooling size and pooling stride, and calculates multiple pooling values and multiple pooling values and multiple pooling strides. obtaining a position index corresponding to each of the pooling values, the position index being used to identify the position of the pooling value in the first feature submap, based on each of the pooling values and a first threshold; determining a target pooling value belonging to the center point from a plurality of the pooling values, and determining first coordinate information of the center point in the first feature map based on a position index corresponding to the target pooling value. configured to

一つの可能な実施形態において、前記認識モジュール４２は、前記第１座標情報、前記第１サイズ情報、及び前記第１特徴マップ中の第１特徴点と前記第１画像中の各画素点との間の位置マッピング関係に基づき、前記第１画像での前記中心点の第２座標情報、及び前記第１画像での前記ターゲット対象の第２サイズ情報を決定し、前記第２座標情報及び前記第２サイズ情報に基づき、前記ターゲット画像領域を決定するように構成される。 In one possible embodiment, the recognition module 42 recognizes the first coordinate information, the first size information, and the first feature point in the first feature map and each pixel point in the first image. determining second coordinate information of the center point in the first image and second size information of the target object in the first image based on a positional mapping relationship between the second coordinate information and the second It is configured to determine the target image area based on two size information.

一つの可能な実施形態において、前記認識モジュール４２は、前記第２座標情報及び前記第２サイズ情報に基づき、前記第１画像から前記ターゲット対象を含む第１領域範囲を決定し、前記ターゲット対象を含む第１領域範囲に基づき、前記ターゲット対象を含む第２領域範囲を決定し、前記第２領域範囲は前記第１領域範囲を包含し、前記第２領域範囲に基づき、前記第１画像から前記ターゲット画像領域を決定するように構成される。 In one possible embodiment, the recognition module 42 determines, from the first image, a first area extent containing the target object based on the second coordinate information and the second size information; determining a second area extent containing the target object based on a first area extent containing the target object, wherein the second area extent encompasses the first area extent, and based on the second area extent from the first image the It is configured to determine a target image region.

一つの可能な実施形態において、前記動作検出ネットワークは、特徴抽出ネットワーク及び前記特徴抽出ネットワークに接続される複数の動作検出ブランチネットワークを含み、
前記検出モジュール４３は、前記特徴抽出ネットワークを使用して前記ターゲット画像領域に対して特徴抽出処理を行い、前記ターゲット画像領域の第２特徴マップを得て、そして複数の前記動作検出ブランチネットワークを使用して前記第２特徴マップに対してそれぞれ動作検出処理を行い、各前記動作検出ブランチネットワークにそれぞれ対応する第１動作検出結果を得るように構成される。 In one possible embodiment, the motion detection network comprises a feature extraction network and a plurality of motion detection branch networks connected to the feature extraction network;
The detection module 43 performs feature extraction processing on the target image region using the feature extraction network to obtain a second feature map of the target image region, and uses a plurality of the motion detection branch networks. Then, motion detection processing is performed on each of the second feature maps to obtain first motion detection results corresponding to each of the motion detection branch networks.

一つの可能な実施形態において、前記検出モジュール４３は、複数の動作検出ブランチネットワークのうちの各々について、前記動作検出ブランチネットワークを使用して前記第２特徴マップに対して動作検出処理を行い、前記ターゲット対象が前記動作検出ブランチネットワークにより検出される動作のカテゴリーをする確率を得て、そして前記確率及び予め決定された第２閾値に基づき、前記動作検出ブランチネットワークに対応する第１動作検出結果を決定するように構成される。 In one possible embodiment, the detection module 43, for each of a plurality of motion detection branch networks, performs a motion detection process on the second feature map using the motion detection branch network; Obtaining a probability that a target object performs a category of motion detected by the motion detection branch network, and based on the probability and a predetermined second threshold, determining a first motion detection result corresponding to the motion detection branch network. configured to determine.

装置中の各モジュールの処理フロー、及び各モジュール間のインタラクションフローの説明については、上記方法の実施例における関連する説明を参照すればよく、ここでは詳しく説明しない。 For the description of the processing flow of each module in the device and the interaction flow between each module, please refer to the relevant descriptions in the above method embodiments, which will not be described in detail herein.

本発明の実施例では、前記動作認識装置中の取得モジュール４１、認識モジュール４２、検出モジュール４３、及び決定モジュール４４は、実際のアプリケーションにおいて、いずれも中央処理ユニット（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、デジタル信号プロセッサ（ＤＳＰ：ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、マイクロコントローラーユニット（ＭＣＵ：ＭｉｃｒｏｃｏｎｔｒｏｌｌｅｒＵｎｉｔ）又はフィールドプログラマブルゲートアレイ（ＦＰＧＡ：Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）によって実現されることができる。 In an embodiment of the present invention, the acquisition module 41, the recognition module 42, the detection module 43, and the determination module 44 in the motion recognition device are all implemented as a central processing unit (CPU), a digital It can be realized by a signal processor (DSP), a microcontroller unit (MCU) or a field-programmable gate array (FPGA).

本発明の実施例では、コンピュータ機器をさらに提供し、図５は、本発明の実施例により提供されるコンピュータ機器の構造的模式図であり、前記コンピュータ機器はプロセッサ１１とメモリ１２を含み、前記メモリ１２には、前記プロセッサ１１により実行可能な機械読み取り可能な命令が記憶されており、コンピュータ機器が作動する時、本発明の実施例に記載の動作認識方法のステップを実現するように、前記機械読み取り可能な命令は前記プロセッサによって実行される。 An embodiment of the present invention further provides a computer device, and FIG. 5 is a structural schematic diagram of a computer device provided by an embodiment of the present invention, said computer device comprising a processor 11 and a memory 12, wherein said Machine-readable instructions executable by the processor 11 are stored in the memory 12 to implement the steps of the motion recognition method according to the embodiments of the present invention when the computer equipment operates. Machine-readable instructions are executed by the processor.

上記命令の具体的な実行プロセスについては、本発明の実施例に記載の動作認識方法のステップを参照すればよく、ここでは説明を省略する。 For the specific execution process of the above instructions, please refer to the steps of the motion recognition method described in the embodiments of the present invention, and the description is omitted here.

本発明の実施例は、コンピュータプログラムが記憶され、該コンピュータプログラムがプロセッサにより実行される時、上記方法の実施例に記載の動作認識方法のステップを実行するコンピュータ読み取り可能な記憶媒体をさらに提供する。ここで、該記憶媒体は揮発性又は不揮発性コンピュータ読み取り可能な記憶媒体であってもよい。 An embodiment of the present invention further provides a computer readable storage medium on which a computer program is stored and which, when the computer program is executed by a processor, performs the steps of the motion recognition method described in the above method embodiments. . Here, the storage medium may be a volatile or non-volatile computer readable storage medium.

本発明の実施例により提供する動作認識方法のコンピュータプログラム製品は、プログラムコードが記憶されるコンピュータ読み取り可能な記憶媒体を含み、前記プログラムコードに含まれる命令は、上記方法の実施例に記載の動作認識方法のステップを実行するために用いることができ、具体的には、上記方法の実施例を参照すればよく、ここでは説明を省略する。 A computer program product of a motion recognition method provided by an embodiment of the present invention includes a computer readable storage medium storing program code, wherein instructions contained in the program code are adapted to perform the operations described in the above method embodiments. It can be used to carry out the steps of the recognition method, specifically refer to the above method embodiments, and the description is omitted here.

本発明の実施例は、プロセッサにより実行される時、上記実施例の任意の一つの方法を実現するコンピュータプログラムをさらに提供する。該コンピュータプログラム製品は、具体的には、ハードウェア、ソフトウェア又はそれらの組み合わせにより実現され得る。一つの選択可能な実施例において、前記コンピュータプログラム製品は具体的にはコンピュータ記憶媒体として体現され、別の選択可能な実施例において、コンピュータプログラム製品は具体的には、例えばソフトウェア開発キット（ＳＤＫ：ＳｏｆｔｗａｒｅＤｅｖｅｌｏｐｍｅｎｔＫｉｔ）等のソフトウェア製品として体現される。 An embodiment of the present invention further provides a computer program product that, when executed by a processor, implements the method of any one of the above embodiments. The computer program product may be specifically implemented in hardware, software or a combination thereof. In one alternative embodiment, said computer program product is tangibly embodied as a computer storage medium, and in another alternative embodiment, said computer program product is tangibly embodied as a software development kit (SDK: It is embodied as a software product such as Software Development Kit).

本発明により提供されるいくつの方法又は装置の実施例で開示された特徴は、衝突しない場合には、任意に組み合わせられて、新たな方法の実施例又は装置の実施例を得ることができる。 Features disclosed in several method or apparatus embodiments provided by the present invention may be combined arbitrarily to obtain new method embodiments or apparatus embodiments, provided they do not conflict.

当業者であれば、説明を便利化及び簡潔化するために、上記説明されたシステム及び装置の具体的な動作プロセスは、前記方法の実施例における対応するプロセスを参照できることが明確に理解され、ここでは説明を省略する。本発明により提供されるいくつかの実施例では、開示されたシステム、装置及び方法は、他の形態で実現されることができることを理解すべきである。上述説明された装置の実施例は例示的なものに過ぎず、例えば、前記ユニットの区分は、論理機能の区分に過ぎず、実際に実現する時に別の区分モードがあってもよく、また、例えば、複数のユニット又はコンポーネントは組み合わせられてもよく、又は別のシステムに統合されてもよく、又は一部の特徴が無視されてもよく、もしくは実行されなくてもよい。また、図示又は議論された相互の結合又は直接結合又は通信接続は一部の通信インタフェース、機器又はユニットを介した間接結合又は通信接続であり得、電気的、機械的又は他の形態であり得る。 Those skilled in the art will clearly understand that the specific operating processes of the above-described systems and devices can refer to the corresponding processes in the method embodiments for convenience and brevity of description, Description is omitted here. It should be understood that in some embodiments provided by the present invention, the disclosed systems, devices and methods may be embodied in other forms. The embodiments of the device described above are only exemplary, for example, the division of the units is only the division of logical functions, and there may be other division modes when actually implemented, and For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not performed. Also, mutual couplings or direct couplings or communication connections shown or discussed may be indirect couplings or communication connections through some communication interface, device or unit, and may be electrical, mechanical or otherwise. .

前記の分離部材として説明されたユニットは物理的に分離されたものであってもなくてもよく、ユニットとして示された部材は物理ユニットであってもなくてもよく、一箇所に位置してもよく、又は複数のネットワークユニットに分布されてもよい。実際の必要に応じてその一部又は全てのユニットを選択して本実施例の解決手段の目的を実現できる。 The units described as separate members above may or may not be physically separated, and the members indicated as units may or may not be physical units, and may or may not be located in one place. or distributed over multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

また、本発明の各実施例における各機能ユニットは１つの処理ユニットに統合されてもよく、個々のユニットは単独で物理的に存在してもよく、２つ又は２つ以上のユニットは１つのユニットに統合されてもよい。 Also, each functional unit in each embodiment of the present invention may be integrated into one processing unit, individual units may physically exist alone, and two or more units may be combined into one processing unit. may be integrated into the unit.

前記機能がソフトウェア機能ユニットの形式で実現され且つ独立した製品として販売又は使用される場合、プロセッサが実行可能な非揮発性コンピュータ読み取り可能な記憶媒体に記憶されてもよい。このような理解に基づき、本発明の技術的解決手段は本質的に又は従来技術に寄与する部分又は該技術的解決手段の部分がソフトウェア製品の形で実施されることができ、該コンピュータソフトウェア製品は記憶媒体に記憶され、コンピュータ機器（パーソナルコンピュータ、サーバ、又はネットワーク機器等であってもよい）に本発明の各実施例に記載の方法の全て又は一部のステップを実行させるためのいくつかの命令を含む。前記記憶媒体は、ＵＳＢフラッシュメモリ、モバイルハードディスク、読み取り専用メモリ（ＲＯＭ：Ｒｅａｄ－ＯｎｌｙＭｅｍｏｒｙ）、ランダムアクセスメモリ（ＲＡＭ：ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、磁気ディスク又は光ディスク等のプログラムコードを記憶可能である様々な媒体を含む。 When the functions are implemented in the form of software functional units and sold or used as stand-alone products, they may be stored in a processor-executable non-volatile computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be implemented in the form of a software product essentially or the part contributing to the prior art or the part of the technical solution can be implemented in the form of a computer software product. are stored in a storage medium, and are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. including instructions for The storage medium includes USB flash memory, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk, optical disk, etc. Including media.

最後に、説明すべきものとして以上に記載の実施例は本発明の具体的な実施形態に過ぎず、本発明の技術的解決手段を説明するためのものに過ぎず、それを限定するものではなく、本発明の保護範囲はそれに限定されるものではない。上記実施例を参照しながら本発明を詳細に説明したが、当業者であれば、本発明に開示された技術範囲内に、いかなる当業者は依然として前記実施例に記載の技術的解決手段を修正し、又は変化を容易に想到でき、又はその一部の技術的特徴に対して同等の置換を行うことができるが、これらの修正、変化又は置換は、対応する技術的解決手段の本質を本発明の実施例の技術的解決手段の精神及び範囲から逸脱させなく、いずれも本発明の保護範囲内に含まれるべきであることを理解すべきである。従って、本発明の保護範囲は前記特許請求の範囲の保護範囲に準ずるものとする。 Finally, the above-described examples to be described are only specific embodiments of the present invention, and are only for describing the technical solutions of the present invention, not for limiting it. , the scope of protection of the present invention is not limited thereto. Although the present invention has been described in detail with reference to the above embodiments, any person skilled in the art can still modify the technical solutions described in the above embodiments within the technical scope disclosed in the present invention. or can easily conceive of changes or equivalent replacements for some technical features thereof, but these modifications, changes or replacements do not affect the essence of the corresponding technical solution. It should be understood that all should fall within the protection scope of the present invention without departing from the spirit and scope of the technical solutions in the embodiments of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the above claims.

Claims

A motion recognition method comprising:
obtaining a first image;
recognizing a target image region containing a target object in the first image;
obtaining a plurality of types of first motion detection results corresponding to the target object by performing motion detection processing on the target image region using a motion detection network having a plurality of motion detection branches; where different categories of motion detected by different motion detection branches and
determining a second motion detection result for the target object based on first motion detection results respectively corresponding to a plurality of motion detection branches;
Action recognition method.

Recognizing a target image region containing a target object in the first image comprises:
performing a feature extraction process on the first image to obtain a first feature map of the first image, the first feature map including feature submaps respectively corresponding to a plurality of feature channels; , the features included in different feature submaps are different;
determining first coordinate information of a center point of the target object on the first feature map based on features included in a first feature submap of a plurality of feature submaps; Determining first size information of the target object on the first feature map based on first coordinate information of the center point and features included in a second one of the plurality of feature submaps. ,
2. The method of claim 1, further comprising determining the target image area based on the first coordinate information and the first size information.

Determining first coordinate information of a center point of the target object on the first feature map based on features included in a first feature submap of a plurality of feature submaps;
A plurality of pooling values and a position index corresponding to each of the plurality of pooling values are obtained by performing a maximum pooling process on the first feature submap according to a preset pooling size and pooling stride. obtaining, wherein the position index is used to identify the position of the pooled value in the first feature submap;
determining a target pooling value belonging to the center point from a plurality of the pooling values based on each pooling value and a first threshold;
3. The method of claim 2, comprising determining first coordinate information of the center point in the first feature map based on a position index corresponding to the target pooling value.

Determining the target image area based on the first coordinate information and the first size information includes:
in the first image based on the first coordinate information, the first size information, and a positional mapping relationship between a first feature point in the first feature map and each pixel point in the first image; determining second coordinate information for the center point of and second size information for the target object in the first image;
4. The motion recognition method according to claim 2, further comprising determining the target image area based on the second coordinate information and the second size information.

Determining the target image area based on the second coordinate information and the second size information includes:
determining a first area extent containing the target object from the first image based on the second coordinate information and the second size information;
determining a second area range that includes the target object based on a first area range that includes the target object, wherein the second area range encompasses the first area range;
5. The method of claim 4, comprising determining the target image region from the first image based on the second region extent.

the motion detection network includes a feature extraction network and a plurality of motion detection branch networks connected to the feature extraction network;
obtaining a plurality of types of first motion detection results corresponding to the target object by performing motion detection processing on the target image region using a motion detection network having a plurality of motion detection branches;
performing a feature extraction process on the target image region using the feature extraction network to obtain a second feature map of the target image region;
respectively performing motion detection processing on the second feature map using a plurality of the motion detection branch networks to obtain a first motion detection result respectively corresponding to each of the motion detection branch networks. The motion recognition method according to any one of claims 1 to 5, characterized by:

obtaining a first motion detection result corresponding to each of the motion detection branch networks by performing motion detection processing on each of the second feature maps using a plurality of the motion detection branch networks;
For each of a plurality of motion detection branch networks, performing motion detection processing on the second feature map using the motion detection branch network, such that the target object is detected by the motion detection branch network. obtaining the probability of doing a category of behavior;
determining a first motion detection result corresponding to the motion detection branch network based on the probability and a predetermined second threshold.

A motion recognition device,
an acquisition module configured to acquire a first image;
a recognition module configured to recognize a target image region containing a target object in the first image;
configured to perform motion detection processing on the target image region using a motion detection network having a plurality of motion detection branches to obtain a plurality of types of first motion detection results corresponding to the target object; a detection module in which different categories of motion detected by different motion detection branches are different;
a determination module configured to determine a second motion detection result of the target object based on first motion detection results respectively corresponding to a plurality of motion detection branches;
motion recognition device.

a computer device,
including a memory and a processor;
the memory stores machine-readable instructions executable by the processor;
the processor is used to execute machine-readable instructions stored in the memory;
When the machine-readable instructions are executed by the processor, the processor performs the steps of the action recognition method according to any one of claims 1 to 7,
computer equipment.

A computer readable storage medium,
A computer program is stored on the computer-readable storage medium,
A computer readable storage medium for performing the steps of the action recognition method according to any one of claims 1 to 7, when the computer program is run on a computer device.

A computer program,
causing a computer to execute the action recognition method according to any one of claims 1 to 7;
computer program.