CN113283381A

CN113283381A - Human body action detection method suitable for mobile robot platform

Info

Publication number: CN113283381A
Application number: CN202110659014.4A
Authority: CN
Inventors: 朱文俊; 孙阳; 易阳; 张梦怡
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-08-20
Anticipated expiration: 2041-06-15
Also published as: CN113283381B

Abstract

The invention provides a human body action detection method suitable for a mobile robot platform, which comprises the following steps: step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frame by adopting a character detector and a backbone network respectively¹,A²,…,A^N∈R^CAnd a set of context feature maps X ∈ R^C×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain. The invention models the high-order interaction relation in the form of target character-background environment-target character relation (OCOR), deduces the indirect relation between a plurality of target characters and background environment, and further is more accurateThe method has the advantages of realizing action positioning with high efficiency, being simple and flexible in whole design, fully utilizing information of background environment and other objects and effectively improving the accuracy of target action detection.

Description

Human body action detection method suitable for mobile robot platform

Technical Field

The invention relates to the technical field of robot application, in particular to a human body action detection method suitable for a mobile robot platform.

Background

As an important branch in the field of video understanding, human motion detection technology is being widely applied. At present, barriers of a mobile robot are mostly avoided based on passive modes such as laser radar and infrared induction, and once an emergency situation occurs (for example, a passerby suddenly appears on a traveling road of the robot), the mobile robot can suddenly brake, so that the service life of a robot motor is greatly shortened; meanwhile, under some complex environmental scenes, unsafe things such as stealing, robbery and personnel falling down happen occasionally, and the defects of incomplete monitoring range and low efficiency exist only by means of artificial video monitoring judgment at present. Aiming at the problems, a human body action sensing technology is considered to be carried on a visual platform of the mobile robot, so that the robot can actively avoid obstacles according to the action of the human body, and meanwhile, a more reliable judgment basis can be provided for the safety monitoring of the environment.

Video-based human motion localization and recognition have long been a relatively challenging high-level task in video understanding. The current relatively new technology in this field is to directly establish a model for the paired interaction relationship between two target objects, and then continue to determine the actions of the two target objects, but in reality, the relationship between the objects may not always be presented in a paired manner, and clues that can provide more accurate information often exist in the obscure interaction relationship between the target and the surrounding objects (i.e. a high-order relationship derived from a direct first-order relationship); how to model the high-order interaction relationship, the predecessors do much work, most of them need to add a pre-trained object detector on the basis of the original network, which makes the network structure more complex and has more limitations in use. In order to solve the above problems, the present invention proposes an object person-background environment-object person based relationship network (OCOR-Net) as a technical core. The network models a high-order interaction relation in a target character-background environment-target character relation (OCOR) mode, deduces indirect relations between a plurality of target characters and background environments, and further realizes action positioning identification more accurately and efficiently. Compared with the prior mode, the input of the network only needs the characteristics of the target object and the background environment, the backbone network does not need an object detector with a predefined class, and the whole design is simpler and more flexible; moreover, information of background environment and other objects is fully utilized, and the accuracy of target action detection can be effectively improved.

Disclosure of Invention

The invention aims to overcome the technical defects in the prior art, solve the technical problems and provide a human body motion detection method suitable for a mobile robot platform.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a human body motion detection method suitable for a mobile robot platform comprises the following steps:

step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from key frames by adopting character detector and backbone network¹,A²,…,A^N∈R^CAnd a set of context feature maps X ∈ R^C×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;

generating a code of the first-order target character-background environment relationship: leading the two groups of characteristics acquired in the step one into a character-centered relationship network OCR-Net to generate a first-order target character-background environment OC relationship characteristic FⁱGenerating codes through convolution operation;

step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentⁱIntroducing a high-order relational reasoning operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational characteristic F under the support of a target person-background environment characteristic library OCFB^′i；

Step four, detecting and identifying actions: obtaining the final second-order target character-background environment-target character OCO relation feature mapping

Then, will

And importing an action classifier, classifying and judging the action of the target person, and outputting confidence scores of the actions belonging to the action classes.

In the first step, after a person detector detects key frames of a clipped input video, N person objects are obtained, capturing frames are generated on the key frames, and the capturing frames are also copied to adjacent frames of the key frames; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to R^C×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters¹,A²,…,A^N∈R^CEach of which represents a spatiotemporal representation or action that describes a region of interest.

The spatio-temporal feature quantity comprises pixel information with human and object features;

the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame with 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;

the average pooling can remove useless information and simultaneously reserve background information to the maximum extent;

and amplifying and extracting the characteristic texture information if the maximum pooling is carried out.

In the second step, firstly, each target character feature A is copied through a character-centered relational network OCR-Net¹,A²,…,A^N∈R^CAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relation

Further, the first order OC relationship F of each target person iⁱGo through and get over

Convolution operation is performed, and the generated convolution code is edited.

In the third step, the target character A is selectedⁱ∈R^CAnd the first order relationship between the background environment in space (x, y) is characterized as

i∈{1,…,N},x∈[1,H],y∈[1,W]And then obtaining a high-order relation between paired OC relations at the same spatial position through HRRO learning of a high-order relation inference operator.

The higher order relationship between the paired OC relationships is: two target characters i and j are associated with one another through the same space (x, y), and then it is recorded as

Or

For evaluating the actions of determining the two target characters i and j.

The calculation process of the high-order relational inference operator HRRO is as follows:

mapping a set of first order OC relational features

As an input quantity, through a two-dimensional convolution operation, the output result is

Coding second-order OCO relations of all target characters;

two-dimensional convolution operation will FⁱConversion into a query value QⁱCritical value KⁱAnd FⁱResult values V having the same spatial dimensionsⁱ；Qⁱ、KⁱAnd VⁱFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;

in the formula (1), the reaction mixture is,

query value Q representing target person iⁱAnd key value K of target person j^jThe attention weight generated after the similarity between the attention weight and the attention weight is processed by the softmax function,

representing a query value Q at space (x, y)ⁱ，

Representing the key value K in space (x, y)^j，

Representing the resulting H at space (x, y) without adding hierarchy normalization and disclaimer mechanismsⁱ；

Representing the resulting value V in space (x, y)^jD represents the dimension of the feature map and d is set to 512.

To pair

Performing operations of adding level standardization and abandoning mechanism to obtain Hⁱ；

In the formula (2), ReLU represents a discarding mechanism which utilizes a corrected Linear unit Rectified Linear Units to remove invalid information in the image; dropout represents a method of removing invalid information in an image using ReLU as follows: for the input negative value, the output is all 0 after being corrected, and for the positive value, the output is directly output;

conv 2D represents a two-dimensional convolution operation;

norm represents a normalization operation, here specifically layer normalization, which has the effect of making the data input into the same layer have the same mean and variance.

The target person-background environment characteristic library OCFB is used for storing all background environment information at past and future moments;

firstly, an independent OCO relational network without any other characteristic library is pre-trained, and then first-order OC relational characteristics F are extracted from each target person in a video clip by utilizing the independent OCO relational networkⁱStoring the characters into a target character-background environment characteristic library OCFB; to avoid confusion, these first order relational features stored in the target person-background environment feature library OCFB are redefined to Lⁱ；

From a small time window [ t-w, t + w ] centred on time t]Extracting M OC relation features stored in a feature library

Namely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tⁱW represents a non-fixed time length, and satisfies the time of taking one frame of picture before and after the time t, [ t-W, t + W [ ]]Intra-fetch 3-frame pictureSlicing;

the interaction relationship between the long-term feature and the short-term feature stored in the target person-background environment feature library OCFB is calculated by equation (3):

query value QⁱStill by short-term feature FⁱCalculated to obtain the key value quantity KⁱAnd the resulting value VⁱThen the first order relationship characteristic L stored in the OCFBⁱCalculating to obtain; the concrete formula is shown in (4):

the human detector is fast R-CNN;

the backbone network is I3D;

the definition of the key frame is: the method comprises the steps of (1) indicating a frame where a key action of target motion or change is located in a video;

action categories include watch, talk, stand and walk;

the fixed-size region-of-interest candidate box is generated by uniformly dividing the key frame into 7 × 7 regions.

The invention has the beneficial effects that:

the invention provides a human body action detection method suitable for a mobile robot platform. Compared with the prior art, the input of the network only needs the characteristics of a target object and a background environment, the backbone network does not need an object detector with a predefined class, and the whole design is simpler and more flexible; moreover, information of background environment and other objects is fully utilized, and the accuracy of target action detection can be effectively improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of an overall action detection and recognition network framework based on a target person-background environment-target person relationship according to the present invention;

FIG. 3 is a schematic diagram of an object person-background environment-object person relationship network (OCOR-Net) equipped with an object person-background environment feature library (OCFB) according to the present invention;

FIG. 4 is a comparison diagram of attention area division modeled by different relationships in motion detection according to the present invention.

Detailed Description

The following describes a human body motion detection method suitable for a mobile robot platform in detail with reference to the accompanying drawings and specific implementation methods.

As shown in fig. 1 to 3, a human body motion detection method suitable for a mobile robot platform includes the following steps:

step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frames by using the existing character detector and the backbone network respectively¹,A²,…,A^N∈R^CAnd a set of context feature maps X ∈ R^C ^×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;

Step four, detecting and identifying actions: after the final second-order target character-background environment-target character is obtainedOCO relational feature mapping

Then, will

Specifically, in the first step, after detecting a key frame of a cut input video, a person detector acquires N person object, and generates capture frames on the key frame, and the capture frames are also copied to adjacent frames of the key frame; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to R^C×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters¹,A²,…,A^N∈R^CEach of which represents a spatiotemporal representation or action that describes a region of interest.

Specifically, the human detector is Faster R-CNN;

the backbone network is I3D;

action categories include watch, talk, stand and walk;

Specifically, the spatio-temporal feature quantity includes pixel information having human and object features;

Specifically, in step two, firstly, each target character feature A is copied through a character-centered relationship network OCR-Net¹,A²,…,A^N∈R^CAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relation

Specifically, in step three, the target character A is takenⁱ∈R^CFeatures of the first order relationship between the background environment in space (x, y) are denoted as

i∈{1,…,N},x∈[1,H],y∈[1,W]And then obtaining a high-order relation between paired OC relations at the same spatial position through high-order relation reasoning operator HRRO learning.

Since there are a large number of OC relation features in a clip video,

i∈{1,…,N},x∈ [1,H],y∈[1,W]the paired combination modes of the two are quite a lot, and in order to better utilize the characteristic data, a high-order relational inference operator (HRRO) is introduced into the network design. The operator can learn the high-order relationship between paired OC relationships at the same spatial position, such as: two target characters i and j are connected with one another through the same spatial background information (x, y), and we can refer to the two target characters as

Or

For evaluating their actions.

Specifically, the calculation process of the high-order relational inference operator HRRO is as follows:

mapping a set of first order OC relational features

Coding second-order OCO relations of all target characters;

two-dimensional convolution operation will FⁱConversion into a query value QⁱCritical value KⁱAnd FⁱResult values V having the same spatial dimensionsⁱ； Qⁱ、KⁱAnd VⁱFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;

in the formula (1), the reaction mixture is,

representing a query value Q at space (x, y)ⁱ，

Representing the key value K in space (x, y)^j，

Compared with the common operation, the convolution calculation not only enables the local information to be more gathered, but also enables the data to be more accurate and sensitive in processing.

To obtain better results, mechanisms for hierarchical standardization and discardment may also be added, in particular for

conv 2D represents a two-dimensional convolution operation;

Wherein OCO relationship is characterized by F^′iIs formed by HⁱAnd the previously entered OC feature FⁱObtained by adding the residual errors.

Specifically, in order to enable the derivation process of the OCO relationship to be promoted at any time interval of the imported video, the invention introduces a target person-background environment characteristic library (OCFB) which is used for storing all background environment information at the past and future moments;

Namely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tⁱW represents a non-fixed time length, and satisfies the time of taking one frame of picture before and after the time t, [ t-W, t + W [ ]]3 frames of pictures are taken in total;

in a preferred embodiment of the present invention, as shown in fig. 4, the result of the model established based on the relationship between the target person and the background environment and the target person can identify the action of listening performed by the person in the lower block and the action of reading performed by the person in the upper block based on the relationship between the background environment and the persons. This is not achievable with models built with other relationships.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A human body motion detection method suitable for a mobile robot platform is characterized by comprising the following steps:

step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frame by adopting a character detector and a backbone network respectively¹,A²,…,A^N∈R^CAnd a set of context feature maps X ∈ R^C×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;

step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentⁱIntroducing a high-order relational inference operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational feature F 'under the support of a target person-background environment feature library OCFB'ⁱ；

Step four, detecting and identifying actions: obtaining the final second-order target character-background environment-target character OCO relational feature mapping

Then, will

2. The human motion detection method for mobile robot platform according to claim 1,

3. The human motion detection method suitable for the mobile robot platform according to claim 2, wherein the spatiotemporal feature quantity comprises pixel information with human and object features;

the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame of 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;

4. The human motion detection method suitable for the mobile robot platform according to claim 2,

5. The human body motion detection method suitable for the mobile robot platform according to claim 4,

And then obtaining a high-order relation between paired OC relations at the same spatial position through the HRRO learning of a high-order relation inference operator.

6. The human motion detection method for a mobile robot platform of claim 5,

Or

For evaluating the actions of determining the two target characters i and j.

7. The human motion detection method for a mobile robot platform of claim 6,

mapping a set of first order OC relational features

Coding second-order OCO relations of all target characters;

in the formula (1), the reaction mixture is,

representing a query value Q at space (x, y)ⁱ，

Representing the key value K in space (x, y)^j，

8. The human motion detection method for a mobile robotic platform of claim 7,

to pair

conv 2D represents a two-dimensional convolution operation;

9. The human motion detection method for a mobile robotic platform of claim 7,

Namely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tⁱW represents a non-fixed time length, and satisfies the time of taking a frame of picture before and after the time t, [ t-W, t + W]3 frames of pictures are taken in total;

10. the human motion detection method of claim 2, wherein the human detector is Faster R-CNN;

the backbone network is I3D;

action categories include watch, talk, stand and walk;