CN113283381A - Human body action detection method suitable for mobile robot platform - Google Patents

Human body action detection method suitable for mobile robot platform Download PDF

Info

Publication number
CN113283381A
CN113283381A CN202110659014.4A CN202110659014A CN113283381A CN 113283381 A CN113283381 A CN 113283381A CN 202110659014 A CN202110659014 A CN 202110659014A CN 113283381 A CN113283381 A CN 113283381A
Authority
CN
China
Prior art keywords
value
order
background environment
target
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110659014.4A
Other languages
Chinese (zh)
Other versions
CN113283381B (en
Inventor
朱文俊
孙阳
易阳
张梦怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN202110659014.4A priority Critical patent/CN113283381B/en
Publication of CN113283381A publication Critical patent/CN113283381A/en
Application granted granted Critical
Publication of CN113283381B publication Critical patent/CN113283381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human body action detection method suitable for a mobile robot platform, which comprises the following steps: step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frame by adopting a character detector and a backbone network respectively1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain. The invention models the high-order interaction relation in the form of target character-background environment-target character relation (OCOR), deduces the indirect relation between a plurality of target characters and background environment, and further is more accurateThe method has the advantages of realizing action positioning with high efficiency, being simple and flexible in whole design, fully utilizing information of background environment and other objects and effectively improving the accuracy of target action detection.

Description

Human body action detection method suitable for mobile robot platform
Technical Field
The invention relates to the technical field of robot application, in particular to a human body action detection method suitable for a mobile robot platform.
Background
As an important branch in the field of video understanding, human motion detection technology is being widely applied. At present, barriers of a mobile robot are mostly avoided based on passive modes such as laser radar and infrared induction, and once an emergency situation occurs (for example, a passerby suddenly appears on a traveling road of the robot), the mobile robot can suddenly brake, so that the service life of a robot motor is greatly shortened; meanwhile, under some complex environmental scenes, unsafe things such as stealing, robbery and personnel falling down happen occasionally, and the defects of incomplete monitoring range and low efficiency exist only by means of artificial video monitoring judgment at present. Aiming at the problems, a human body action sensing technology is considered to be carried on a visual platform of the mobile robot, so that the robot can actively avoid obstacles according to the action of the human body, and meanwhile, a more reliable judgment basis can be provided for the safety monitoring of the environment.
Video-based human motion localization and recognition have long been a relatively challenging high-level task in video understanding. The current relatively new technology in this field is to directly establish a model for the paired interaction relationship between two target objects, and then continue to determine the actions of the two target objects, but in reality, the relationship between the objects may not always be presented in a paired manner, and clues that can provide more accurate information often exist in the obscure interaction relationship between the target and the surrounding objects (i.e. a high-order relationship derived from a direct first-order relationship); how to model the high-order interaction relationship, the predecessors do much work, most of them need to add a pre-trained object detector on the basis of the original network, which makes the network structure more complex and has more limitations in use. In order to solve the above problems, the present invention proposes an object person-background environment-object person based relationship network (OCOR-Net) as a technical core. The network models a high-order interaction relation in a target character-background environment-target character relation (OCOR) mode, deduces indirect relations between a plurality of target characters and background environments, and further realizes action positioning identification more accurately and efficiently. Compared with the prior mode, the input of the network only needs the characteristics of the target object and the background environment, the backbone network does not need an object detector with a predefined class, and the whole design is simpler and more flexible; moreover, information of background environment and other objects is fully utilized, and the accuracy of target action detection can be effectively improved.
Disclosure of Invention
The invention aims to overcome the technical defects in the prior art, solve the technical problems and provide a human body motion detection method suitable for a mobile robot platform.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a human body motion detection method suitable for a mobile robot platform comprises the following steps:
step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from key frames by adopting character detector and backbone network1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;
generating a code of the first-order target character-background environment relationship: leading the two groups of characteristics acquired in the step one into a character-centered relationship network OCR-Net to generate a first-order target character-background environment OC relationship characteristic FiGenerating codes through convolution operation;
step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentiIntroducing a high-order relational reasoning operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational characteristic F under the support of a target person-background environment characteristic library OCFB′i
Step four, detecting and identifying actions: obtaining the final second-order target character-background environment-target character OCO relation feature mapping
Figure BDA0003114553980000031
Then, will
Figure BDA0003114553980000032
And importing an action classifier, classifying and judging the action of the target person, and outputting confidence scores of the actions belonging to the action classes.
In the first step, after a person detector detects key frames of a clipped input video, N person objects are obtained, capturing frames are generated on the key frames, and the capturing frames are also copied to adjacent frames of the key frames; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to RC×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters1,A2,…,AN∈RCEach of which represents a spatiotemporal representation or action that describes a region of interest.
The spatio-temporal feature quantity comprises pixel information with human and object features;
the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame with 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;
the average pooling can remove useless information and simultaneously reserve background information to the maximum extent;
and amplifying and extracting the characteristic texture information if the maximum pooling is carried out.
In the second step, firstly, each target character feature A is copied through a character-centered relational network OCR-Net1,A2,…,AN∈RCAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relation
Figure BDA0003114553980000041
Further, the first order OC relationship F of each target person iiGo through and get over
Figure BDA0003114553980000042
Convolution operation is performed, and the generated convolution code is edited.
In the third step, the target character A is selectedi∈RCAnd the first order relationship between the background environment in space (x, y) is characterized as
Figure BDA0003114553980000043
i∈{1,…,N},x∈[1,H],y∈[1,W]And then obtaining a high-order relation between paired OC relations at the same spatial position through HRRO learning of a high-order relation inference operator.
The higher order relationship between the paired OC relationships is: two target characters i and j are associated with one another through the same space (x, y), and then it is recorded as
Figure BDA0003114553980000044
Or
Figure BDA0003114553980000045
For evaluating the actions of determining the two target characters i and j.
The calculation process of the high-order relational inference operator HRRO is as follows:
mapping a set of first order OC relational features
Figure BDA0003114553980000046
As an input quantity, through a two-dimensional convolution operation, the output result is
Figure BDA0003114553980000047
Coding second-order OCO relations of all target characters;
two-dimensional convolution operation will FiConversion into a query value QiCritical value KiAnd FiResult values V having the same spatial dimensionsi;Qi、KiAnd ViFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;
Figure RE-GDA00031447769200000410
in the formula (1), the reaction mixture is,
Figure RE-GDA00031447769200000411
query value Q representing target person iiAnd key value K of target person jjThe attention weight generated after the similarity between the attention weight and the attention weight is processed by the softmax function,
Figure RE-GDA00031447769200000412
representing a query value Q at space (x, y)i
Figure RE-GDA00031447769200000413
Representing the key value K in space (x, y)j
Figure RE-GDA0003144776920000051
Representing the resulting H at space (x, y) without adding hierarchy normalization and disclaimer mechanismsi
Figure RE-GDA0003144776920000052
Representing the resulting value V in space (x, y)jD represents the dimension of the feature map and d is set to 512.
To pair
Figure BDA0003114553980000056
Performing operations of adding level standardization and abandoning mechanism to obtain Hi
Figure RE-GDA0003144776920000054
In the formula (2), ReLU represents a discarding mechanism which utilizes a corrected Linear unit Rectified Linear Units to remove invalid information in the image; dropout represents a method of removing invalid information in an image using ReLU as follows: for the input negative value, the output is all 0 after being corrected, and for the positive value, the output is directly output;
conv 2D represents a two-dimensional convolution operation;
norm represents a normalization operation, here specifically layer normalization, which has the effect of making the data input into the same layer have the same mean and variance.
The target person-background environment characteristic library OCFB is used for storing all background environment information at past and future moments;
firstly, an independent OCO relational network without any other characteristic library is pre-trained, and then first-order OC relational characteristics F are extracted from each target person in a video clip by utilizing the independent OCO relational networkiStoring the characters into a target character-background environment characteristic library OCFB; to avoid confusion, these first order relational features stored in the target person-background environment feature library OCFB are redefined to Li
From a small time window [ t-w, t + w ] centred on time t]Extracting M OC relation features stored in a feature library
Figure BDA0003114553980000055
Namely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tiW represents a non-fixed time length, and satisfies the time of taking one frame of picture before and after the time t, [ t-W, t + W [ ]]Intra-fetch 3-frame pictureSlicing;
the interaction relationship between the long-term feature and the short-term feature stored in the target person-background environment feature library OCFB is calculated by equation (3):
Figure BDA0003114553980000061
query value QiStill by short-term feature FiCalculated to obtain the key value quantity KiAnd the resulting value ViThen the first order relationship characteristic L stored in the OCFBiCalculating to obtain; the concrete formula is shown in (4):
Figure RE-GDA0003144776920000062
the human detector is fast R-CNN;
the backbone network is I3D;
the definition of the key frame is: the method comprises the steps of (1) indicating a frame where a key action of target motion or change is located in a video;
action categories include watch, talk, stand and walk;
the fixed-size region-of-interest candidate box is generated by uniformly dividing the key frame into 7 × 7 regions.
The invention has the beneficial effects that:
the invention provides a human body action detection method suitable for a mobile robot platform. Compared with the prior art, the input of the network only needs the characteristics of a target object and a background environment, the backbone network does not need an object detector with a predefined class, and the whole design is simpler and more flexible; moreover, information of background environment and other objects is fully utilized, and the accuracy of target action detection can be effectively improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of an overall action detection and recognition network framework based on a target person-background environment-target person relationship according to the present invention;
FIG. 3 is a schematic diagram of an object person-background environment-object person relationship network (OCOR-Net) equipped with an object person-background environment feature library (OCFB) according to the present invention;
FIG. 4 is a comparison diagram of attention area division modeled by different relationships in motion detection according to the present invention.
Detailed Description
The following describes a human body motion detection method suitable for a mobile robot platform in detail with reference to the accompanying drawings and specific implementation methods.
As shown in fig. 1 to 3, a human body motion detection method suitable for a mobile robot platform includes the following steps:
step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frames by using the existing character detector and the backbone network respectively1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC ×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;
generating a code of the first-order target character-background environment relationship: leading the two groups of characteristics acquired in the step one into a character-centered relationship network OCR-Net to generate a first-order target character-background environment OC relationship characteristic FiGenerating codes through convolution operation;
step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentiIntroducing a high-order relational reasoning operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational characteristic F under the support of a target person-background environment characteristic library OCFB′i
Step four, detecting and identifying actions: after the final second-order target character-background environment-target character is obtainedOCO relational feature mapping
Figure BDA0003114553980000071
Then, will
Figure BDA0003114553980000072
And importing an action classifier, classifying and judging the action of the target person, and outputting confidence scores of the actions belonging to the action classes.
Specifically, in the first step, after detecting a key frame of a cut input video, a person detector acquires N person object, and generates capture frames on the key frame, and the capture frames are also copied to adjacent frames of the key frame; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to RC×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters1,A2,…,AN∈RCEach of which represents a spatiotemporal representation or action that describes a region of interest.
Specifically, the human detector is Faster R-CNN;
the backbone network is I3D;
the definition of the key frame is: the method comprises the steps of (1) indicating a frame where a key action of target motion or change is located in a video;
action categories include watch, talk, stand and walk;
the fixed-size region-of-interest candidate box is generated by uniformly dividing the key frame into 7 × 7 regions.
Specifically, the spatio-temporal feature quantity includes pixel information having human and object features;
the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame with 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;
the average pooling can remove useless information and simultaneously reserve background information to the maximum extent;
and amplifying and extracting the characteristic texture information if the maximum pooling is carried out.
Specifically, in step two, firstly, each target character feature A is copied through a character-centered relationship network OCR-Net1,A2,…,AN∈RCAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relation
Figure BDA0003114553980000091
Further, the first order OC relationship F of each target person iiGo through and get over
Figure BDA0003114553980000092
Convolution operation is performed, and the generated convolution code is edited.
Specifically, in step three, the target character A is takeni∈RCFeatures of the first order relationship between the background environment in space (x, y) are denoted as
Figure BDA0003114553980000093
i∈{1,…,N},x∈[1,H],y∈[1,W]And then obtaining a high-order relation between paired OC relations at the same spatial position through high-order relation reasoning operator HRRO learning.
Since there are a large number of OC relation features in a clip video,
Figure BDA0003114553980000094
i∈{1,…,N},x∈ [1,H],y∈[1,W]the paired combination modes of the two are quite a lot, and in order to better utilize the characteristic data, a high-order relational inference operator (HRRO) is introduced into the network design. The operator can learn the high-order relationship between paired OC relationships at the same spatial position, such as: two target characters i and j are connected with one another through the same spatial background information (x, y), and we can refer to the two target characters as
Figure BDA0003114553980000095
Or
Figure BDA0003114553980000096
For evaluating their actions.
Specifically, the calculation process of the high-order relational inference operator HRRO is as follows:
mapping a set of first order OC relational features
Figure BDA0003114553980000097
As an input quantity, through a two-dimensional convolution operation, the output result is
Figure BDA0003114553980000098
Coding second-order OCO relations of all target characters;
two-dimensional convolution operation will FiConversion into a query value QiCritical value KiAnd FiResult values V having the same spatial dimensionsi; Qi、KiAnd ViFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;
Figure RE-GDA0003144776920000101
in the formula (1), the reaction mixture is,
Figure RE-GDA0003144776920000102
query value Q representing target person iiAnd key value K of target person jjThe attention weight generated after the similarity between the attention weight and the attention weight is processed by the softmax function,
Figure RE-GDA0003144776920000103
representing a query value Q at space (x, y)i
Figure RE-GDA0003144776920000104
Representing the key value K in space (x, y)j
Figure RE-GDA0003144776920000105
Representing the resulting H at space (x, y) without adding hierarchy normalization and disclaimer mechanismsi
Figure RE-GDA0003144776920000106
Representing the resulting value V in space (x, y)jD represents the dimension of the feature map and d is set to 512.
Compared with the common operation, the convolution calculation not only enables the local information to be more gathered, but also enables the data to be more accurate and sensitive in processing.
To obtain better results, mechanisms for hierarchical standardization and discardment may also be added, in particular for
Figure BDA0003114553980000109
Performing operations of adding level standardization and abandoning mechanism to obtain Hi
Figure RE-GDA0003144776920000107
In the formula (2), ReLU represents a discarding mechanism which utilizes a corrected Linear unit Rectified Linear Units to remove invalid information in the image; dropout represents a method of removing invalid information in an image using ReLU as follows: for the input negative value, the output is all 0 after being corrected, and for the positive value, the output is directly output;
conv 2D represents a two-dimensional convolution operation;
norm represents a normalization operation, here specifically layer normalization, which has the effect of making the data input into the same layer have the same mean and variance.
Wherein OCO relationship is characterized by F′iIs formed by HiAnd the previously entered OC feature FiObtained by adding the residual errors.
Specifically, in order to enable the derivation process of the OCO relationship to be promoted at any time interval of the imported video, the invention introduces a target person-background environment characteristic library (OCFB) which is used for storing all background environment information at the past and future moments;
firstly, an independent OCO relational network without any other characteristic library is pre-trained, and then first-order OC relational characteristics F are extracted from each target person in a video clip by utilizing the independent OCO relational networkiStoring the characters into a target character-background environment characteristic library OCFB; to avoid confusion, these first order relational features stored in the target person-background environment feature library OCFB are redefined to Li
From a small time window [ t-w, t + w ] centred on time t]Extracting M OC relation features stored in a feature library
Figure BDA0003114553980000111
Namely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tiW represents a non-fixed time length, and satisfies the time of taking one frame of picture before and after the time t, [ t-W, t + W [ ]]3 frames of pictures are taken in total;
the interaction relationship between the long-term feature and the short-term feature stored in the target person-background environment feature library OCFB is calculated by equation (3):
Figure BDA0003114553980000112
query value QiStill by short-term feature FiCalculated to obtain the key value quantity KiAnd the resulting value ViThen the first order relationship characteristic L stored in the OCFBiCalculating to obtain; the concrete formula is shown in (4):
Figure RE-GDA0003144776920000113
in a preferred embodiment of the present invention, as shown in fig. 4, the result of the model established based on the relationship between the target person and the background environment and the target person can identify the action of listening performed by the person in the lower block and the action of reading performed by the person in the upper block based on the relationship between the background environment and the persons. This is not achievable with models built with other relationships.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (10)

1. A human body motion detection method suitable for a mobile robot platform is characterized by comprising the following steps:
step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frame by adopting a character detector and a backbone network respectively1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;
generating a code of the first-order target character-background environment relationship: leading the two groups of characteristics acquired in the step one into a character-centered relationship network OCR-Net to generate a first-order target character-background environment OC relationship characteristic FiGenerating codes through convolution operation;
step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentiIntroducing a high-order relational inference operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational feature F 'under the support of a target person-background environment feature library OCFB'i
Step four, detecting and identifying actions: obtaining the final second-order target character-background environment-target character OCO relational feature mapping
Figure FDA0003114553970000011
Then, will
Figure FDA0003114553970000012
And importing an action classifier, classifying and judging the action of the target person, and outputting confidence scores of the actions belonging to the action classes.
2. The human motion detection method for mobile robot platform according to claim 1,
in the first step, after a person detector detects key frames of a clipped input video, N person objects are obtained, capturing frames are generated on the key frames, and the capturing frames are also copied to adjacent frames of the key frames; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to RC×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters1,A2,…,AN∈RCEach of which represents a spatiotemporal representation or action that describes a region of interest.
3. The human motion detection method suitable for the mobile robot platform according to claim 2, wherein the spatiotemporal feature quantity comprises pixel information with human and object features;
the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame of 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;
the average pooling can remove useless information and simultaneously reserve background information to the maximum extent;
and amplifying and extracting the characteristic texture information if the maximum pooling is carried out.
4. The human motion detection method suitable for the mobile robot platform according to claim 2,
in the second step, firstly, each target character feature A is copied through a character-centered relational network OCR-Net1,A2,…,AN∈RCAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relation
Figure FDA0003114553970000021
Further, the first order OC relationship F of each target person iiGo through and get over
Figure FDA0003114553970000022
Convolution operation is performed, and the generated convolution code is edited.
5. The human body motion detection method suitable for the mobile robot platform according to claim 4,
in the third step, the target character A is selectedi∈RCAnd the first order relationship between the background environment in space (x, y) is characterized as
Figure FDA0003114553970000031
Figure FDA0003114553970000032
And then obtaining a high-order relation between paired OC relations at the same spatial position through the HRRO learning of a high-order relation inference operator.
6. The human motion detection method for a mobile robot platform of claim 5,
the higher order relationship between the paired OC relationships is: two target characters i and j are associated with one another through the same space (x, y), and then it is recorded as
Figure FDA00031145539700000313
Or
Figure FDA0003114553970000033
For evaluating the actions of determining the two target characters i and j.
7. The human motion detection method for a mobile robot platform of claim 6,
the calculation process of the high-order relational inference operator HRRO is as follows:
mapping a set of first order OC relational features
Figure RE-FDA0003144776910000035
As an input quantity, through a two-dimensional convolution operation, the output result is
Figure RE-FDA0003144776910000036
Figure RE-FDA0003144776910000037
Coding second-order OCO relations of all target characters;
two-dimensional convolution operation will FiConversion into a query value QiCritical value KiAnd FiResult values V having the same spatial dimensionsi;Qi、KiAnd ViFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;
Figure RE-FDA0003144776910000038
in the formula (1), the reaction mixture is,
Figure RE-FDA0003144776910000039
query value Q representing target person iiAnd key value K of target person jjThe attention weight generated after the similarity between the attention weight and the attention weight is processed by the softmax function,
Figure RE-FDA00031447769100000310
representing a query value Q at space (x, y)i
Figure RE-FDA00031447769100000311
Representing the key value K in space (x, y)j
Figure RE-FDA00031447769100000312
Representing the resulting H at space (x, y) without adding hierarchy normalization and disclaimer mechanismsi
Figure RE-FDA00031447769100000313
Representing the resulting value V in space (x, y)jD represents the dimension of the feature map and d is set to 512.
8. The human motion detection method for a mobile robotic platform of claim 7,
to pair
Figure RE-FDA0003144776910000041
Performing operations of adding level standardization and abandoning mechanism to obtain Hi
Figure RE-FDA0003144776910000042
In the formula (2), ReLU represents a discarding mechanism which utilizes a corrected Linear unit Rectified Linear Units to remove invalid information in the image; dropout represents a method of removing invalid information in an image using ReLU as follows: for the input negative value, the output is all 0 after being corrected, and for the positive value, the output is directly output;
conv 2D represents a two-dimensional convolution operation;
norm represents a normalization operation, here specifically layer normalization, which has the effect of making the data input into the same layer have the same mean and variance.
9. The human motion detection method for a mobile robotic platform of claim 7,
the target person-background environment characteristic library OCFB is used for storing all background environment information at past and future moments;
firstly, an independent OCO relational network without any other characteristic library is pre-trained, and then first-order OC relational characteristics F are extracted from each target person in a video clip by utilizing the independent OCO relational networkiStoring the characters into a target character-background environment characteristic library OCFB; to avoid confusion, these first order relational features stored in the target person-background environment feature library OCFB are redefined to Li
From a small time window [ t-w, t + w ] centred on time t]Extracting M OC relation features stored in a feature library
Figure RE-FDA0003144776910000043
Figure RE-FDA0003144776910000044
Namely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tiW represents a non-fixed time length, and satisfies the time of taking a frame of picture before and after the time t, [ t-W, t + W]3 frames of pictures are taken in total;
the interaction relationship between the long-term feature and the short-term feature stored in the target person-background environment feature library OCFB is calculated by equation (3):
Figure RE-FDA0003144776910000045
query value QiStill by short-term feature FiCalculated to obtain the key value quantity KiAnd the resulting value ViThen the first order relationship characteristic L stored in the OCFBiCalculating to obtain; the concrete formula is shown in (4):
Figure RE-FDA0003144776910000051
10. the human motion detection method of claim 2, wherein the human detector is Faster R-CNN;
the backbone network is I3D;
the definition of the key frame is: the method comprises the steps of (1) indicating a frame where a key action of target motion or change is located in a video;
action categories include watch, talk, stand and walk;
the fixed-size region-of-interest candidate box is generated by uniformly dividing the key frame into 7 × 7 regions.
CN202110659014.4A 2021-06-15 2021-06-15 Human body action detection method suitable for mobile robot platform Active CN113283381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110659014.4A CN113283381B (en) 2021-06-15 2021-06-15 Human body action detection method suitable for mobile robot platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110659014.4A CN113283381B (en) 2021-06-15 2021-06-15 Human body action detection method suitable for mobile robot platform

Publications (2)

Publication Number Publication Date
CN113283381A true CN113283381A (en) 2021-08-20
CN113283381B CN113283381B (en) 2024-04-05

Family

ID=77284429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110659014.4A Active CN113283381B (en) 2021-06-15 2021-06-15 Human body action detection method suitable for mobile robot platform

Country Status (1)

Country Link
CN (1) CN113283381B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492581A (en) * 2018-11-09 2019-03-19 中国石油大学(华东) A kind of human motion recognition method based on TP-STG frame
CN110765967A (en) * 2019-10-30 2020-02-07 腾讯科技(深圳)有限公司 Action recognition method based on artificial intelligence and related device
CN111209897A (en) * 2020-03-09 2020-05-29 腾讯科技(深圳)有限公司 Video processing method, device and storage medium
CN112364757A (en) * 2020-11-09 2021-02-12 大连理工大学 Human body action recognition method based on space-time attention mechanism
CN112464875A (en) * 2020-12-09 2021-03-09 南京大学 Method and device for detecting human-object interaction relationship in video
WO2021042547A1 (en) * 2019-09-04 2021-03-11 平安科技(深圳)有限公司 Behavior identification method, device and computer-readable storage medium
WO2021073311A1 (en) * 2019-10-15 2021-04-22 华为技术有限公司 Image recognition method and apparatus, computer-readable storage medium and chip

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492581A (en) * 2018-11-09 2019-03-19 中国石油大学(华东) A kind of human motion recognition method based on TP-STG frame
WO2021042547A1 (en) * 2019-09-04 2021-03-11 平安科技(深圳)有限公司 Behavior identification method, device and computer-readable storage medium
WO2021073311A1 (en) * 2019-10-15 2021-04-22 华为技术有限公司 Image recognition method and apparatus, computer-readable storage medium and chip
CN110765967A (en) * 2019-10-30 2020-02-07 腾讯科技(深圳)有限公司 Action recognition method based on artificial intelligence and related device
CN111209897A (en) * 2020-03-09 2020-05-29 腾讯科技(深圳)有限公司 Video processing method, device and storage medium
CN112364757A (en) * 2020-11-09 2021-02-12 大连理工大学 Human body action recognition method based on space-time attention mechanism
CN112464875A (en) * 2020-12-09 2021-03-09 南京大学 Method and device for detecting human-object interaction relationship in video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭论正 等: "基于pLSA模型的人体动作识别", 国防科技大学学报, no. 05 *

Also Published As

Publication number Publication date
CN113283381B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
US11144786B2 (en) Information processing apparatus, method for controlling information processing apparatus, and storage medium
CN109819208B (en) Intensive population security monitoring management method based on artificial intelligence dynamic monitoring
Vishnu et al. Human fall detection in surveillance videos using fall motion vector modeling
EP3633615A1 (en) Deep learning network and average drift-based automatic vessel tracking method and system
Charfi et al. Optimized spatio-temporal descriptors for real-time fall detection: comparison of support vector machine and Adaboost-based classification
US7831087B2 (en) Method for visual-based recognition of an object
CN113139437B (en) Helmet wearing inspection method based on YOLOv3 algorithm
CN112818925A (en) Urban building and crown identification method
CN113743260B (en) Pedestrian tracking method under condition of dense pedestrian flow of subway platform
CN113516664A (en) Visual SLAM method based on semantic segmentation dynamic points
CN112861785A (en) Shielded pedestrian re-identification method based on example segmentation and image restoration
CN113781519A (en) Target tracking method and target tracking device
CN116758475A (en) Energy station abnormal behavior early warning method based on multi-source image recognition and deep learning
Hermina et al. A Novel Approach to Detect Social Distancing Among People in College Campus
CN113781563B (en) Mobile robot loop detection method based on deep learning
CN113256731A (en) Target detection method and device based on monocular vision
CN112668493A (en) Reloading pedestrian re-identification, positioning and tracking system based on GAN and deep learning
CN112418096A (en) Method and device for detecting falling and robot
CN113283381A (en) Human body action detection method suitable for mobile robot platform
CN116740607A (en) Video processing method and device, electronic equipment and storage medium
Dimas et al. Self-supervised soft obstacle detection for safe navigation of visually impaired people
CN115147921B (en) Multi-domain information fusion-based key region target abnormal behavior detection and positioning method
CN112541403B (en) Indoor personnel falling detection method by utilizing infrared camera
CN116030335A (en) Visual positioning method and system based on indoor building framework constraint
KR20190049100A (en) System and Method for Managing Unexpected Situation in Tunnel

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant