CN113283381A - Human body action detection method suitable for mobile robot platform - Google Patents
Human body action detection method suitable for mobile robot platform Download PDFInfo
- Publication number
- CN113283381A CN113283381A CN202110659014.4A CN202110659014A CN113283381A CN 113283381 A CN113283381 A CN 113283381A CN 202110659014 A CN202110659014 A CN 202110659014A CN 113283381 A CN113283381 A CN 113283381A
- Authority
- CN
- China
- Prior art keywords
- value
- order
- background environment
- target
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 43
- 238000001514 detection method Methods 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 27
- 230000003993 interaction Effects 0.000 claims abstract description 8
- 238000011176 pooling Methods 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 230000007774 longterm Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 239000011541 reaction mixture Substances 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 4
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a human body action detection method suitable for a mobile robot platform, which comprises the following steps: step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frame by adopting a character detector and a backbone network respectively1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain. The invention models the high-order interaction relation in the form of target character-background environment-target character relation (OCOR), deduces the indirect relation between a plurality of target characters and background environment, and further is more accurateThe method has the advantages of realizing action positioning with high efficiency, being simple and flexible in whole design, fully utilizing information of background environment and other objects and effectively improving the accuracy of target action detection.
Description
Technical Field
The invention relates to the technical field of robot application, in particular to a human body action detection method suitable for a mobile robot platform.
Background
As an important branch in the field of video understanding, human motion detection technology is being widely applied. At present, barriers of a mobile robot are mostly avoided based on passive modes such as laser radar and infrared induction, and once an emergency situation occurs (for example, a passerby suddenly appears on a traveling road of the robot), the mobile robot can suddenly brake, so that the service life of a robot motor is greatly shortened; meanwhile, under some complex environmental scenes, unsafe things such as stealing, robbery and personnel falling down happen occasionally, and the defects of incomplete monitoring range and low efficiency exist only by means of artificial video monitoring judgment at present. Aiming at the problems, a human body action sensing technology is considered to be carried on a visual platform of the mobile robot, so that the robot can actively avoid obstacles according to the action of the human body, and meanwhile, a more reliable judgment basis can be provided for the safety monitoring of the environment.
Video-based human motion localization and recognition have long been a relatively challenging high-level task in video understanding. The current relatively new technology in this field is to directly establish a model for the paired interaction relationship between two target objects, and then continue to determine the actions of the two target objects, but in reality, the relationship between the objects may not always be presented in a paired manner, and clues that can provide more accurate information often exist in the obscure interaction relationship between the target and the surrounding objects (i.e. a high-order relationship derived from a direct first-order relationship); how to model the high-order interaction relationship, the predecessors do much work, most of them need to add a pre-trained object detector on the basis of the original network, which makes the network structure more complex and has more limitations in use. In order to solve the above problems, the present invention proposes an object person-background environment-object person based relationship network (OCOR-Net) as a technical core. The network models a high-order interaction relation in a target character-background environment-target character relation (OCOR) mode, deduces indirect relations between a plurality of target characters and background environments, and further realizes action positioning identification more accurately and efficiently. Compared with the prior mode, the input of the network only needs the characteristics of the target object and the background environment, the backbone network does not need an object detector with a predefined class, and the whole design is simpler and more flexible; moreover, information of background environment and other objects is fully utilized, and the accuracy of target action detection can be effectively improved.
Disclosure of Invention
The invention aims to overcome the technical defects in the prior art, solve the technical problems and provide a human body motion detection method suitable for a mobile robot platform.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a human body motion detection method suitable for a mobile robot platform comprises the following steps:
step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from key frames by adopting character detector and backbone network1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;
generating a code of the first-order target character-background environment relationship: leading the two groups of characteristics acquired in the step one into a character-centered relationship network OCR-Net to generate a first-order target character-background environment OC relationship characteristic FiGenerating codes through convolution operation;
step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentiIntroducing a high-order relational reasoning operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational characteristic F under the support of a target person-background environment characteristic library OCFB′i;
Step four, detecting and identifying actions: obtaining the final second-order target character-background environment-target character OCO relation feature mappingThen, willAnd importing an action classifier, classifying and judging the action of the target person, and outputting confidence scores of the actions belonging to the action classes.
In the first step, after a person detector detects key frames of a clipped input video, N person objects are obtained, capturing frames are generated on the key frames, and the capturing frames are also copied to adjacent frames of the key frames; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to RC×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters1,A2,…,AN∈RCEach of which represents a spatiotemporal representation or action that describes a region of interest.
The spatio-temporal feature quantity comprises pixel information with human and object features;
the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame with 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;
the average pooling can remove useless information and simultaneously reserve background information to the maximum extent;
and amplifying and extracting the characteristic texture information if the maximum pooling is carried out.
In the second step, firstly, each target character feature A is copied through a character-centered relational network OCR-Net1,A2,…,AN∈RCAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relationFurther, the first order OC relationship F of each target person iiGo through and get overConvolution operation is performed, and the generated convolution code is edited.
In the third step, the target character A is selectedi∈RCAnd the first order relationship between the background environment in space (x, y) is characterized asi∈{1,…,N},x∈[1,H],y∈[1,W]And then obtaining a high-order relation between paired OC relations at the same spatial position through HRRO learning of a high-order relation inference operator.
The higher order relationship between the paired OC relationships is: two target characters i and j are associated with one another through the same space (x, y), and then it is recorded asOrFor evaluating the actions of determining the two target characters i and j.
The calculation process of the high-order relational inference operator HRRO is as follows:
mapping a set of first order OC relational featuresAs an input quantity, through a two-dimensional convolution operation, the output result isCoding second-order OCO relations of all target characters;
two-dimensional convolution operation will FiConversion into a query value QiCritical value KiAnd FiResult values V having the same spatial dimensionsi;Qi、KiAnd ViFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;
in the formula (1), the reaction mixture is,query value Q representing target person iiAnd key value K of target person jjThe attention weight generated after the similarity between the attention weight and the attention weight is processed by the softmax function,representing a query value Q at space (x, y)i,Representing the key value K in space (x, y)j,Representing the resulting H at space (x, y) without adding hierarchy normalization and disclaimer mechanismsi;Representing the resulting value V in space (x, y)jD represents the dimension of the feature map and d is set to 512.
In the formula (2), ReLU represents a discarding mechanism which utilizes a corrected Linear unit Rectified Linear Units to remove invalid information in the image; dropout represents a method of removing invalid information in an image using ReLU as follows: for the input negative value, the output is all 0 after being corrected, and for the positive value, the output is directly output;
conv 2D represents a two-dimensional convolution operation;
norm represents a normalization operation, here specifically layer normalization, which has the effect of making the data input into the same layer have the same mean and variance.
The target person-background environment characteristic library OCFB is used for storing all background environment information at past and future moments;
firstly, an independent OCO relational network without any other characteristic library is pre-trained, and then first-order OC relational characteristics F are extracted from each target person in a video clip by utilizing the independent OCO relational networkiStoring the characters into a target character-background environment characteristic library OCFB; to avoid confusion, these first order relational features stored in the target person-background environment feature library OCFB are redefined to Li;
From a small time window [ t-w, t + w ] centred on time t]Extracting M OC relation features stored in a feature libraryNamely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tiW represents a non-fixed time length, and satisfies the time of taking one frame of picture before and after the time t, [ t-W, t + W [ ]]Intra-fetch 3-frame pictureSlicing;
the interaction relationship between the long-term feature and the short-term feature stored in the target person-background environment feature library OCFB is calculated by equation (3):
query value QiStill by short-term feature FiCalculated to obtain the key value quantity KiAnd the resulting value ViThen the first order relationship characteristic L stored in the OCFBiCalculating to obtain; the concrete formula is shown in (4):
the human detector is fast R-CNN;
the backbone network is I3D;
the definition of the key frame is: the method comprises the steps of (1) indicating a frame where a key action of target motion or change is located in a video;
action categories include watch, talk, stand and walk;
the fixed-size region-of-interest candidate box is generated by uniformly dividing the key frame into 7 × 7 regions.
The invention has the beneficial effects that:
the invention provides a human body action detection method suitable for a mobile robot platform. Compared with the prior art, the input of the network only needs the characteristics of a target object and a background environment, the backbone network does not need an object detector with a predefined class, and the whole design is simpler and more flexible; moreover, information of background environment and other objects is fully utilized, and the accuracy of target action detection can be effectively improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of an overall action detection and recognition network framework based on a target person-background environment-target person relationship according to the present invention;
FIG. 3 is a schematic diagram of an object person-background environment-object person relationship network (OCOR-Net) equipped with an object person-background environment feature library (OCFB) according to the present invention;
FIG. 4 is a comparison diagram of attention area division modeled by different relationships in motion detection according to the present invention.
Detailed Description
The following describes a human body motion detection method suitable for a mobile robot platform in detail with reference to the accompanying drawings and specific implementation methods.
As shown in fig. 1 to 3, a human body motion detection method suitable for a mobile robot platform includes the following steps:
step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frames by using the existing character detector and the backbone network respectively1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC ×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;
generating a code of the first-order target character-background environment relationship: leading the two groups of characteristics acquired in the step one into a character-centered relationship network OCR-Net to generate a first-order target character-background environment OC relationship characteristic FiGenerating codes through convolution operation;
step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentiIntroducing a high-order relational reasoning operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational characteristic F under the support of a target person-background environment characteristic library OCFB′i;
Step four, detecting and identifying actions: after the final second-order target character-background environment-target character is obtainedOCO relational feature mappingThen, willAnd importing an action classifier, classifying and judging the action of the target person, and outputting confidence scores of the actions belonging to the action classes.
Specifically, in the first step, after detecting a key frame of a cut input video, a person detector acquires N person object, and generates capture frames on the key frame, and the capture frames are also copied to adjacent frames of the key frame; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to RC×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters1,A2,…,AN∈RCEach of which represents a spatiotemporal representation or action that describes a region of interest.
Specifically, the human detector is Faster R-CNN;
the backbone network is I3D;
the definition of the key frame is: the method comprises the steps of (1) indicating a frame where a key action of target motion or change is located in a video;
action categories include watch, talk, stand and walk;
the fixed-size region-of-interest candidate box is generated by uniformly dividing the key frame into 7 × 7 regions.
Specifically, the spatio-temporal feature quantity includes pixel information having human and object features;
the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame with 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;
the average pooling can remove useless information and simultaneously reserve background information to the maximum extent;
and amplifying and extracting the characteristic texture information if the maximum pooling is carried out.
Specifically, in step two, firstly, each target character feature A is copied through a character-centered relationship network OCR-Net1,A2,…,AN∈RCAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relationFurther, the first order OC relationship F of each target person iiGo through and get overConvolution operation is performed, and the generated convolution code is edited.
Specifically, in step three, the target character A is takeni∈RCFeatures of the first order relationship between the background environment in space (x, y) are denoted asi∈{1,…,N},x∈[1,H],y∈[1,W]And then obtaining a high-order relation between paired OC relations at the same spatial position through high-order relation reasoning operator HRRO learning.
Since there are a large number of OC relation features in a clip video,i∈{1,…,N},x∈ [1,H],y∈[1,W]the paired combination modes of the two are quite a lot, and in order to better utilize the characteristic data, a high-order relational inference operator (HRRO) is introduced into the network design. The operator can learn the high-order relationship between paired OC relationships at the same spatial position, such as: two target characters i and j are connected with one another through the same spatial background information (x, y), and we can refer to the two target characters asOrFor evaluating their actions.
Specifically, the calculation process of the high-order relational inference operator HRRO is as follows:
mapping a set of first order OC relational featuresAs an input quantity, through a two-dimensional convolution operation, the output result isCoding second-order OCO relations of all target characters;
two-dimensional convolution operation will FiConversion into a query value QiCritical value KiAnd FiResult values V having the same spatial dimensionsi; Qi、KiAnd ViFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;
in the formula (1), the reaction mixture is,query value Q representing target person iiAnd key value K of target person jjThe attention weight generated after the similarity between the attention weight and the attention weight is processed by the softmax function,representing a query value Q at space (x, y)i,Representing the key value K in space (x, y)j,Representing the resulting H at space (x, y) without adding hierarchy normalization and disclaimer mechanismsi;Representing the resulting value V in space (x, y)jD represents the dimension of the feature map and d is set to 512.
Compared with the common operation, the convolution calculation not only enables the local information to be more gathered, but also enables the data to be more accurate and sensitive in processing.
To obtain better results, mechanisms for hierarchical standardization and discardment may also be added, in particular forPerforming operations of adding level standardization and abandoning mechanism to obtain Hi;
In the formula (2), ReLU represents a discarding mechanism which utilizes a corrected Linear unit Rectified Linear Units to remove invalid information in the image; dropout represents a method of removing invalid information in an image using ReLU as follows: for the input negative value, the output is all 0 after being corrected, and for the positive value, the output is directly output;
conv 2D represents a two-dimensional convolution operation;
norm represents a normalization operation, here specifically layer normalization, which has the effect of making the data input into the same layer have the same mean and variance.
Wherein OCO relationship is characterized by F′iIs formed by HiAnd the previously entered OC feature FiObtained by adding the residual errors.
Specifically, in order to enable the derivation process of the OCO relationship to be promoted at any time interval of the imported video, the invention introduces a target person-background environment characteristic library (OCFB) which is used for storing all background environment information at the past and future moments;
firstly, an independent OCO relational network without any other characteristic library is pre-trained, and then first-order OC relational characteristics F are extracted from each target person in a video clip by utilizing the independent OCO relational networkiStoring the characters into a target character-background environment characteristic library OCFB; to avoid confusion, these first order relational features stored in the target person-background environment feature library OCFB are redefined to Li;
From a small time window [ t-w, t + w ] centred on time t]Extracting M OC relation features stored in a feature libraryNamely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tiW represents a non-fixed time length, and satisfies the time of taking one frame of picture before and after the time t, [ t-W, t + W [ ]]3 frames of pictures are taken in total;
the interaction relationship between the long-term feature and the short-term feature stored in the target person-background environment feature library OCFB is calculated by equation (3):
query value QiStill by short-term feature FiCalculated to obtain the key value quantity KiAnd the resulting value ViThen the first order relationship characteristic L stored in the OCFBiCalculating to obtain; the concrete formula is shown in (4):
in a preferred embodiment of the present invention, as shown in fig. 4, the result of the model established based on the relationship between the target person and the background environment and the target person can identify the action of listening performed by the person in the lower block and the action of reading performed by the person in the upper block based on the relationship between the background environment and the persons. This is not achievable with models built with other relationships.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (10)
1. A human body motion detection method suitable for a mobile robot platform is characterized by comprising the following steps:
step one, acquiring characteristic quantity: editing the input video, and extracting N target character features A from the key frame by adopting a character detector and a backbone network respectively1,A2,…,AN∈RCAnd a set of context feature maps X ∈ RC×H×WWherein C represents a channel, H represents a height, and W represents a width; r represents a number domain;
generating a code of the first-order target character-background environment relationship: leading the two groups of characteristics acquired in the step one into a character-centered relationship network OCR-Net to generate a first-order target character-background environment OC relationship characteristic FiGenerating codes through convolution operation;
step three, deducing a high-order relation: OC relation characteristic F of first-order target person-background environmentiIntroducing a high-order relational inference operator HRRO, and calculating and deducing a second-order target person-background environment-target person OCO relational feature F 'under the support of a target person-background environment feature library OCFB'i;
Step four, detecting and identifying actions: obtaining the final second-order target character-background environment-target character OCO relational feature mappingThen, willAnd importing an action classifier, classifying and judging the action of the target person, and outputting confidence scores of the actions belonging to the action classes.
2. The human motion detection method for mobile robot platform according to claim 1,
in the first step, after a person detector detects key frames of a clipped input video, N person objects are obtained, capturing frames are generated on the key frames, and the capturing frames are also copied to adjacent frames of the key frames; meanwhile, the backbone network extracts the space-time characteristic quantity from the input video clip and performs average pooling operation on the space-time characteristic quantity, thereby obtaining the background environment characteristic mapping X belonging to RC×H×WPerforming maximum space pooling operation on the obtained background environment feature mapping, performing region of interest (ROI) alignment operation on the background environment feature mapping by combining N capture frames obtained before, generating a region of interest candidate frame with a fixed size, and further generating a series of features A of N target characters1,A2,…,AN∈RCEach of which represents a spatiotemporal representation or action that describes a region of interest.
3. The human motion detection method suitable for the mobile robot platform according to claim 2, wherein the spatiotemporal feature quantity comprises pixel information with human and object features;
the operation method of the average pooling comprises the following steps: selecting four points on the key frame by using a selected frame with 2 x 2 pixel points, then taking the average value of the pixel values of the four points as the pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the average value of the four pixel points, and traversing the whole picture by the selected frame according to the process; the operation method of the maximum space pooling comprises the following steps: selecting four points on the average pooled picture by using a selected frame of 2 x 2 pixel points, then taking the value with the maximum pixel value in the four points as a pixel result value after processing, namely reducing the original four pixel points into one pixel point, wherein the value of the pixel point is the maximum value of the four pixel points, and traversing the whole picture by using the selected frame according to the process;
the average pooling can remove useless information and simultaneously reserve background information to the maximum extent;
and amplifying and extracting the characteristic texture information if the maximum pooling is carried out.
4. The human motion detection method suitable for the mobile robot platform according to claim 2,
in the second step, firstly, each target character feature A is copied through a character-centered relational network OCR-Net1,A2,…,AN∈RCAnd connecting them to the space position of each H x W background environment feature to form a series of feature maps with serial relationFurther, the first order OC relationship F of each target person iiGo through and get overConvolution operation is performed, and the generated convolution code is edited.
5. The human body motion detection method suitable for the mobile robot platform according to claim 4,
in the third step, the target character A is selectedi∈RCAnd the first order relationship between the background environment in space (x, y) is characterized as And then obtaining a high-order relation between paired OC relations at the same spatial position through the HRRO learning of a high-order relation inference operator.
6. The human motion detection method for a mobile robot platform of claim 5,
7. The human motion detection method for a mobile robot platform of claim 6,
the calculation process of the high-order relational inference operator HRRO is as follows:
mapping a set of first order OC relational featuresAs an input quantity, through a two-dimensional convolution operation, the output result is Coding second-order OCO relations of all target characters;
two-dimensional convolution operation will FiConversion into a query value QiCritical value KiAnd FiResult values V having the same spatial dimensionsi;Qi、KiAnd ViFor three attention scores of attention weight, three attention scores of each spatial position are respectively and independently calculated;
in the formula (1), the reaction mixture is,query value Q representing target person iiAnd key value K of target person jjThe attention weight generated after the similarity between the attention weight and the attention weight is processed by the softmax function,representing a query value Q at space (x, y)i,Representing the key value K in space (x, y)j,Representing the resulting H at space (x, y) without adding hierarchy normalization and disclaimer mechanismsi;Representing the resulting value V in space (x, y)jD represents the dimension of the feature map and d is set to 512.
8. The human motion detection method for a mobile robotic platform of claim 7,
In the formula (2), ReLU represents a discarding mechanism which utilizes a corrected Linear unit Rectified Linear Units to remove invalid information in the image; dropout represents a method of removing invalid information in an image using ReLU as follows: for the input negative value, the output is all 0 after being corrected, and for the positive value, the output is directly output;
conv 2D represents a two-dimensional convolution operation;
norm represents a normalization operation, here specifically layer normalization, which has the effect of making the data input into the same layer have the same mean and variance.
9. The human motion detection method for a mobile robotic platform of claim 7,
the target person-background environment characteristic library OCFB is used for storing all background environment information at past and future moments;
firstly, an independent OCO relational network without any other characteristic library is pre-trained, and then first-order OC relational characteristics F are extracted from each target person in a video clip by utilizing the independent OCO relational networkiStoring the characters into a target character-background environment characteristic library OCFB; to avoid confusion, these first order relational features stored in the target person-background environment feature library OCFB are redefined to Li;
From a small time window [ t-w, t + w ] centred on time t]Extracting M OC relation features stored in a feature library Namely a long-term characteristic, and a short-term characteristic is a first-order OC relation characteristic F at the time tiW represents a non-fixed time length, and satisfies the time of taking a frame of picture before and after the time t, [ t-W, t + W]3 frames of pictures are taken in total;
the interaction relationship between the long-term feature and the short-term feature stored in the target person-background environment feature library OCFB is calculated by equation (3):
query value QiStill by short-term feature FiCalculated to obtain the key value quantity KiAnd the resulting value ViThen the first order relationship characteristic L stored in the OCFBiCalculating to obtain; the concrete formula is shown in (4):
10. the human motion detection method of claim 2, wherein the human detector is Faster R-CNN;
the backbone network is I3D;
the definition of the key frame is: the method comprises the steps of (1) indicating a frame where a key action of target motion or change is located in a video;
action categories include watch, talk, stand and walk;
the fixed-size region-of-interest candidate box is generated by uniformly dividing the key frame into 7 × 7 regions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110659014.4A CN113283381B (en) | 2021-06-15 | 2021-06-15 | Human body action detection method suitable for mobile robot platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110659014.4A CN113283381B (en) | 2021-06-15 | 2021-06-15 | Human body action detection method suitable for mobile robot platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113283381A true CN113283381A (en) | 2021-08-20 |
CN113283381B CN113283381B (en) | 2024-04-05 |
Family
ID=77284429
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110659014.4A Active CN113283381B (en) | 2021-06-15 | 2021-06-15 | Human body action detection method suitable for mobile robot platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113283381B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492581A (en) * | 2018-11-09 | 2019-03-19 | 中国石油大学(华东) | A kind of human motion recognition method based on TP-STG frame |
CN110765967A (en) * | 2019-10-30 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Action recognition method based on artificial intelligence and related device |
CN111209897A (en) * | 2020-03-09 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Video processing method, device and storage medium |
CN112364757A (en) * | 2020-11-09 | 2021-02-12 | 大连理工大学 | Human body action recognition method based on space-time attention mechanism |
CN112464875A (en) * | 2020-12-09 | 2021-03-09 | 南京大学 | Method and device for detecting human-object interaction relationship in video |
WO2021042547A1 (en) * | 2019-09-04 | 2021-03-11 | 平安科技(深圳)有限公司 | Behavior identification method, device and computer-readable storage medium |
WO2021073311A1 (en) * | 2019-10-15 | 2021-04-22 | 华为技术有限公司 | Image recognition method and apparatus, computer-readable storage medium and chip |
-
2021
- 2021-06-15 CN CN202110659014.4A patent/CN113283381B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492581A (en) * | 2018-11-09 | 2019-03-19 | 中国石油大学(华东) | A kind of human motion recognition method based on TP-STG frame |
WO2021042547A1 (en) * | 2019-09-04 | 2021-03-11 | 平安科技(深圳)有限公司 | Behavior identification method, device and computer-readable storage medium |
WO2021073311A1 (en) * | 2019-10-15 | 2021-04-22 | 华为技术有限公司 | Image recognition method and apparatus, computer-readable storage medium and chip |
CN110765967A (en) * | 2019-10-30 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Action recognition method based on artificial intelligence and related device |
CN111209897A (en) * | 2020-03-09 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Video processing method, device and storage medium |
CN112364757A (en) * | 2020-11-09 | 2021-02-12 | 大连理工大学 | Human body action recognition method based on space-time attention mechanism |
CN112464875A (en) * | 2020-12-09 | 2021-03-09 | 南京大学 | Method and device for detecting human-object interaction relationship in video |
Non-Patent Citations (1)
Title |
---|
谭论正 等: "基于pLSA模型的人体动作识别", 国防科技大学学报, no. 05 * |
Also Published As
Publication number | Publication date |
---|---|
CN113283381B (en) | 2024-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11144786B2 (en) | Information processing apparatus, method for controlling information processing apparatus, and storage medium | |
CN109819208B (en) | Intensive population security monitoring management method based on artificial intelligence dynamic monitoring | |
Vishnu et al. | Human fall detection in surveillance videos using fall motion vector modeling | |
EP3633615A1 (en) | Deep learning network and average drift-based automatic vessel tracking method and system | |
Charfi et al. | Optimized spatio-temporal descriptors for real-time fall detection: comparison of support vector machine and Adaboost-based classification | |
US7831087B2 (en) | Method for visual-based recognition of an object | |
CN113139437B (en) | Helmet wearing inspection method based on YOLOv3 algorithm | |
CN112818925A (en) | Urban building and crown identification method | |
CN113743260B (en) | Pedestrian tracking method under condition of dense pedestrian flow of subway platform | |
CN113516664A (en) | Visual SLAM method based on semantic segmentation dynamic points | |
CN112861785A (en) | Shielded pedestrian re-identification method based on example segmentation and image restoration | |
CN113781519A (en) | Target tracking method and target tracking device | |
CN116758475A (en) | Energy station abnormal behavior early warning method based on multi-source image recognition and deep learning | |
Hermina et al. | A Novel Approach to Detect Social Distancing Among People in College Campus | |
CN113781563B (en) | Mobile robot loop detection method based on deep learning | |
CN113256731A (en) | Target detection method and device based on monocular vision | |
CN112668493A (en) | Reloading pedestrian re-identification, positioning and tracking system based on GAN and deep learning | |
CN112418096A (en) | Method and device for detecting falling and robot | |
CN113283381A (en) | Human body action detection method suitable for mobile robot platform | |
CN116740607A (en) | Video processing method and device, electronic equipment and storage medium | |
Dimas et al. | Self-supervised soft obstacle detection for safe navigation of visually impaired people | |
CN115147921B (en) | Multi-domain information fusion-based key region target abnormal behavior detection and positioning method | |
CN112541403B (en) | Indoor personnel falling detection method by utilizing infrared camera | |
CN116030335A (en) | Visual positioning method and system based on indoor building framework constraint | |
KR20190049100A (en) | System and Method for Managing Unexpected Situation in Tunnel |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |