CN111523421A - Multi-user behavior detection method and system based on deep learning and fusion of various interaction information - Google Patents

Multi-user behavior detection method and system based on deep learning and fusion of various interaction information Download PDF

Info

Publication number
CN111523421A
CN111523421A CN202010289689.XA CN202010289689A CN111523421A CN 111523421 A CN111523421 A CN 111523421A CN 202010289689 A CN202010289689 A CN 202010289689A CN 111523421 A CN111523421 A CN 111523421A
Authority
CN
China
Prior art keywords
interaction
human
module
video
modeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010289689.XA
Other languages
Chinese (zh)
Other versions
CN111523421B (en
Inventor
汤佳俊
夏锦
牟芯志
庞博
卢策吾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010289689.XA priority Critical patent/CN111523421B/en
Publication of CN111523421A publication Critical patent/CN111523421A/en
Application granted granted Critical
Publication of CN111523421B publication Critical patent/CN111523421B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A behavior detection network is trained by constructing a video library with labels as a sample set, the trained network processes a video to be detected, and detection of object behaviors in an area is realized according to a final output vector. The invention fully considers the complexity of human behaviors, integrates the interactive relation between the human behaviors and other people, objects and long-term memory information while considering the self motion of the human, and effectively improves the precision of video behavior detection.

Description

Multi-user behavior detection method and system based on deep learning and fusion of various interaction information
Technical Field
The invention relates to a technology in the field of artificial intelligence video identification, in particular to a method and a system for detecting multi-user behaviors based on deep learning and fusion of various interaction information.
Background
Computer vision aims to process various visual tasks by using computer programs, and multimedia such as images, videos and the like is often involved. The convolutional neural network is a deep learning technology widely applied to computer vision tasks, and obtains more universal deep robust representations by training filter parameters in image convolution operation, wherein the representations are high-dimensional vectors or matrixes and can be used for behavior detection or classification, namely, the positions of people appearing in videos are detected, and the respective behaviors are judged.
The existing behavior detection technology generally extracts the representation of a video through a three-dimensional convolutional neural network by detecting a human boundary frame, extracts the regional representation of a human from the video representation through a linear interpolation mode according to the human boundary frame, and finally carries out final judgment through the regional representation of the human. The method has the defects that only the action change of a single person in the boundary box is considered, the interactive information between the person and other persons or objects is not utilized, and more complex interactive behaviors, such as door opening, television watching, conversation with other persons and the like, cannot be accurately detected.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a multi-user behavior detection method and system based on deep learning and fusion of various interaction information.
The invention is realized by the following technical scheme:
the invention relates to a multi-user behavior detection method based on deep learning and fusion of various interactive information.
The video library with the labels is obtained by the following method: after the videos in the sample set are labeled at equal intervals, the video size is normalized, and each labeled frame is cut into a plurality of segments, for example: for each labeled frame, the content of 32 frames before and after the frame is taken as an intermediate frame, and a corresponding 64-frame segment is obtained.
The content of the equal interval labels comprises: the bounding box for each person in the frame and the behavior that each person has individually occurred during the 1.5 second time interval before and after the frame.
The boundary frame is obtained by adopting a series of mature image object detection algorithms such as but not limited to a Faster-R convolutional neural network, a YOLO and the like, and each detected label frame simultaneously has the boundary frame and the category of various objects appearing on the frame.
The behavior detection network comprises: the video representation extraction system comprises a three-dimensional convolutional neural network for extracting video representations, a representation extraction module with a memory pool, a multi-interaction relationship modeling fusion network, three full-connection layers and a sigmoid regression layer, wherein: the three-dimensional convolutional neural network extracts video representations according to input video clips and outputs the video representations to the representation extraction module, the representation extraction module utilizes RoIAlign to perform linear interpolation on each bounding box area on the video representations and obtains the area representations of people and objects through pooling, meanwhile, the representation extraction module obtains memory representations through a memory pool, the multi-interaction relation modeling fusion network performs modeling fusion on the area representations of people and objects and the memory representations to obtain robust behavior representations, and prediction probabilities of all categories are obtained through a full connection layer and a sigmoid regression layer.
The three-dimensional convolutional neural network adopts but is not limited to a video representation extraction network commonly used by an I3D network, a SlowFast network, a C3D network and the like.
According to different contents in the boundary box area, the characterization extraction module can obtain the area characterization of the person and the area characterization of the object.
And the memory pool obtains the memory representation by splicing the region representations of the people in the historical segment of the current segment according to the region representations of the people in the region representations of the people and the objects in each video segment.
The multi-interaction relationship modeling fusion network comprises: two human-human interaction modeling modules for receiving the region representation of the person, two human interaction modeling modules for respectively receiving the region representation of the person and the region representation of the object, and two human memory modeling interaction modules for respectively receiving the region representation and the memory representation of the person, wherein: the system comprises a first human-human interaction modeling module, a first human memory modeling interaction module, a second human-human interaction modeling module, a second human figure interaction modeling module and a second human memory modeling interaction module, wherein the first human-human interaction modeling module, the first human memory modeling interaction module, the second human-human interaction modeling module, the second human figure interaction modeling module and the second human memory modeling interaction module are sequentially connected and transmit the region representation of the human which is sequentially enhanced, each interaction modeling module models one interaction relationship among human-human interaction, human figure interaction and human memory interaction, the interaction relationships are fused with the region representation of the human and then transmitted to the next module, and the finally output region representation of the human comprehensively fuses human-human interaction, human figure interaction and human.
The human-human interaction is as follows: interaction between different agents in the same video segment.
The character interaction is as follows: the interaction between the person and the object in the same video segment.
The human memory interaction is as follows: interaction between an agent in the current segment and an agent in an adjacent segment of longer history.
The modeling means that:
Figure BDA0002449931220000021
wherein: q and K are respectively two representations of input, WQ,WK1,WK2,WOIs the weight of the fully connected layer, d is KWK1Of (c) is calculated.
According to the difference of input representation K, the module processes different interaction relations: the value of K comprises a human region representation, an object region representation and a memory representation, and the corresponding modeling module sequentially processes human-human interaction, human interaction and memory interaction correspondingly and outputs corresponding representations fused with the type interaction information; when the six modules are connected in series, the output of the last modeling module is further used as the input of the next Q, and finally, a plurality of different interaction relations are fused.
The three fully-connected layers include two hidden layers and an output layer.
The sigmoid regression layer comprises a sigmoid function and a cross entropy loss function, the output vector of the output layer can obtain the prediction probability of each category through the sigmoid layer, and the cross entropy loss function is used for training the whole network.
The training is as follows: taking the samples in the sample set, the corresponding object boundary frames and the region representation of people adjacent to the video clip in the memory pool arranged in the representation extraction module as the input of the behavior detection network, adjusting network parameters by adopting a cross entropy loss function and combining a back propagation BP algorithm, and updating the region representation of the people in the video clip into the memory pool.
The processing of the video to be detected comprises the following steps: and inputting the video to be detected into an object detection algorithm and the trained behavior detection network, and obtaining the final prediction probability of each behavior by using a sigmoid regression layer.
Technical effects
The invention integrally solves the technical problem of detecting the behavior of each person appearing in the long video, namely: for people appearing in a certain frame in a video, a bounding box of each person needs to be given, and behaviors of each person respectively occur in a short time before and after the frame; compared with the prior art, the invention fully considers the complexity of human behaviors, integrates the interactive relation between the human behaviors and other people, objects and long-term memory information while considering the self motion of the human, and effectively improves the precision of video behavior detection.
Drawings
FIG. 1 is a flow chart of the network training of the present invention;
FIG. 2 is a flow chart of testing a video under test according to the present invention;
FIG. 3 is a schematic diagram of an interactive modeling module of the present invention;
in the figure: n represents the number of people in the video clip, and N' represents the number of interactive objects in the video clip, namely the number of regional representations of people or the number of regional representations of objects or the number of all people in the memory representations;
FIG. 4 is a schematic diagram of a multiple interaction relationship modeling fusion network according to the present invention;
in the figure, each small rectangle represents an interaction modeling module, the input at the left side is Q, the input at the lower side is K, and different interactions are modeled according to different K.
Detailed Description
The embodiment relates to a multi-user behavior detection system based on deep learning and fusion of various interaction information, which comprises: training sample acquisition module, object detection module, the action detection network module that fuses multiple interaction, wherein: the method comprises the steps that a sample of a training sample acquisition module and an object detection frame of an object detection module are used as input of a behavior detection network module, the behavior detection network obtains models of area characteristics and memory characteristics of people and objects by utilizing boundary frame areas of the people and the objects after training, multi-classification judgment is further carried out on the characteristics, the object detection module detects the people and the objects in a video to be detected, and the behavior detection network module further tests and deduces to obtain judgment on behaviors of each person in the video according to a detection result of the object detection module.
The behavior detection network comprises: the three-dimensional convolutional neural network is used for extracting video representation, the representation extraction module, the multi-interaction relationship modeling fusion network, three full connection layers and a sigmoid regression layer, wherein: the three-dimensional convolutional neural network extracts video representations according to input video clips and outputs the video representations to the representation extraction module, the representation extraction module performs linear interpolation on each bounding box region on the video representations by means of RoIAlign and performs pooling to obtain region representations of people and objects, meanwhile, a memory pool in the representation extraction module obtains memory representations, the multi-interaction relation modeling fusion network performs corresponding modeling fusion according to the region representations of the people and the objects and the memory representations to obtain robust behavior representations, and prediction probabilities of all categories are obtained through a full connection layer and a sigmoid regression layer.
As shown in fig. 1, the behavior detection network specifically implements training through the following steps:
step 1, initializing a three-dimensional convolution neural network, and initializing by using weights pre-trained on other video behavior classification data sets.
In this embodiment, a SlowFast network is adopted as a specific structure of the three-dimensional convolutional neural network, the network is pre-trained on a Kinetics behavior classification data set at first, and the pre-trained weight is used to initialize the structure of the three-dimensional convolutional network. For other part of parameters in the behavior detection network, some different small random numbers are used for initialization.
Other scenarios may use certain behavioral classification datasets for pre-training, such as Kinetics, UCF, HMDB, etc.
Step 2, initializing a memory pool which is arranged in the representation extraction module and used for providing a long-term memory representation of a video clip: in the initial phase of training, initialization is performed using a vector of all zeros.
And 3, data processing and reading:
step 3.1: long videos of different scenes are collected and labeled at one second intervals. After a certain frame is labeled, labeling is performed on the next frame which is one second away from the certain frame, and so on. The content of the annotation includes the bounding box of all people on the image of the frame and the behavior category of each person occurring within 1.5 seconds before and after the frame. In this embodiment, an ava (atomic Visual action) data set is used as the data set for verifying the validity of the method of the present invention.
Step 3.2: for each frame with labels, an object detection algorithm is run on the frame to detect common object classes appearing in the frame, wherein the object classes should not include people. In this embodiment, the fast R-convolutional neural network algorithm is adopted as the object detection module in this embodiment.
Step 3.3: for each frame with labels, a video clip of 64 frames before and after the frame is extracted and normalized to 256 × 464 (height × width), and the video clip input into the behavior detection network is a tensor of 64 × 256 × 464 × 3, where 3 is an RGB color channel.
Step 3.4: and randomly disordering all video clips in the sample set to increase the randomness during training. The sample set contains multiple long videos, so the video segments used in training may come from different long videos. A video segment is randomly drawn from the sample set for training in each iteration.
Step 3.5: and inputting the video clip and the boundary box of the person on the corresponding intermediate frame, the behavior category of the person and the boundary box of the object detected by the detection algorithm into the behavior detection network.
Step 4, training iteration:
step 4.1: inputting the video clips randomly selected from the sample set into a three-dimensional convolution network to obtain a tensor characterized by 16 multiplied by 29 multiplied by 2304 (height multiplied by width multiplied by depth) of the whole video clip; and (3) interpolating on the representation of the video segment to obtain a tensor of 7 multiplied by 2304 by using a representation extraction module according to the bounding box of the people and the objects, and further pooling to obtain a vector with 2304 dimensions representing the region representation of each people or the objects.
Step 4.2; and (4) updating the region representation of the person in the current segment obtained in the step (4.1) into a memory pool, and judging: and when the memory pool has no representation of the segment, directly storing the segment, otherwise, deleting the representation of the segment in the memory pool and updating the representation of the segment into the representation extracted in the iteration.
Step 4.3: reading the characterization of the historical segment from the memory pool to form a memory characterization: the method comprises the steps that region representations of people in video clips from different long videos exist in a memory pool, and the representations of the regions of people in 30 video clips which belong to the same long video as a current clip and are within 30 seconds before the current clip are read from the memory pool; all tokens are spliced to form a memory token of the segment, and when 5 person region tokens exist in each video segment, and 30 × 5 ═ 150, the dimension of the memory token is a token tensor of 150 × 2304.
Step 4.4: person to be examinedThe area characterization, the area characterization of the object, and the memory characterization are input into a multi-interaction modeling fusion network, each module in the multi-interaction modeling fusion network is respectively used for modeling different interactions, as shown in fig. 3, specifically:
Figure BDA0002449931220000051
wherein: q and K are respectively two representations of input, WQ,WK1,WK2,WOIs the weight of the fully connected layer, the dimensions are all 1024 × 1024, d is KWK1The dimension of each type of representation input into the structure is reduced to 1024 dimensions by a full connection layer, and finally the multi-interaction relation modeling fusion network obtains the behavior representations of multiple persons, namely N × 1024, wherein N represents the number of persons in the segment.
Step 4.5: and (4) inputting the behavior representations of the multiple persons obtained in the step (4.4) into three full-connection layers, namely two hidden layers and an output layer, and obtaining the value of the loss function through a sigmoid regression layer, wherein the weights of the two hidden layers are 1024 multiplied by 1024, the weight scale of the output layer is 1024 multiplied by C, C is the total number of related behavior categories, and the AVA data set is 80. And optimizing the whole behavior detection network by using a BP algorithm according to the loss function.
The optimized parameter part comprises: parameters in a three-dimensional convolutional neural network, parameters in each interactive modeling module in a multi-interactive relation modeling fusion network and parameters in three full-connection layers.
And 5: and (4) when the optimization in the step 4.5 reaches the maximum times, terminating the training, otherwise, returning to the step 4.1 to continue training iteration.
As shown in fig. 2, the test infers that the method comprises the following steps:
step i: and acquiring the video to be detected.
Step ii: segmenting and normalizing the video to be detected: video segments containing 64 frames are continuously extracted from the video to be tested, each segment is normalized to 256 × 464 (height × width), and the starting time interval between the next segment and the previous segment is one second. And (5) sequentially inputting the video clips into the behavior detection network trained in the step 5 according to the time sequence.
Step iii, performing test inference on the video to be tested, specifically comprising:
step a: and reading the processed video segment from the data input module, and running an object detection algorithm on an intermediate frame of the segment to detect people and common objects appearing in the frame. In this embodiment, the fast R-convolutional neural network algorithm is adopted as the object detection module in this embodiment.
Step b: the video clip is input into a three-dimensional convolutional network, and the representation of the video clip is obtained as a 16 × 29 × 2304 (height × width × depth) tensor. And (3) interpolating on the representation of the video segment by using a representation extraction module according to the bounding boxes of the human and the object by using RoIAlign to obtain a tensor of 7 multiplied by 2304, and further pooling to obtain a vector with 2304 dimensions of the region representation of the human and the region representation of the object.
Step c: and saving the regional characterization of the people in the segment into a memory pool.
Step d: and reading the human region representations of the first 30 tested video segments of the segment from the memory pool, and splicing to form the memory representation of the current segment. And if the representation of a certain tested video segment does not exist or does not belong to the same segment of the video to be tested with the current segment, supplementing by using a zero vector. Assuming that there are 5 person region tokens in each video segment, 30 × 5 ═ 150, the dimension of the memory token is the token tensor of 150 × 2304.
Step e: and inputting the region representation of the person, the region representation of the object and the memory representation into the multi-interaction relation modeling fusion network. The output obtains the behavior representation of each person in the segment, the dimensionality is N multiplied by 1024, and N represents the number of the persons in the segment.
Step f: behavior characterization obtains the probability that each person takes different behaviors through three full-connection layers and sigmoid function layers, and one dimensionality is an NxC matrix, wherein: n is the number of people in the segment and C is the number of behavior classes, 80 in the AVA dataset. Each value in the matrix is a decimal number from 0 to 1, representing the decision probability that a person performs a certain action in the segment. And when the probability is larger than the threshold value, judging that the action occurs, otherwise, judging that the action does not occur. There are situations where more than one action occurs simultaneously.
Step iv: and when the last segment of the video to be tested is processed, ending or processing other videos to be tested, or returning to the step 3 to continue processing the next video segment.
In this embodiment, verification is performed on a verification video of an AVA data set, and performance evaluation data of the method is shown in table 1 according to different selection of thresholds in the test step f. The performance evaluation standard is that when a certain detection result is consistent with a certain label behavior class and the border frames IoU (interaction over Union) of the detection result and the label behavior class are larger than 0.5, the detection result is considered to be correct. IoU is calculated as the ratio of the area of the union of the areas occupied by the two boxes to the area of the intersection of the areas occupied by the two boxes.
TABLE 1 test Performance evaluation data
Threshold value 0.3 0.4 0.5 0.6
Recall rate 42.45% 33.85% 26.76% 20.82%
Accuracy of measurement 31.61% 42.94% 56.63% 72.84%
The threshold in table 1 is determined to be a behavior when the prediction probability obtained for the behavior during detection of the video to be detected is greater than the threshold, and the effect of the detection-type task is determined according to two indexes, namely recall rate and accuracy rate: recall rate
Figure BDA0002449931220000071
Accuracy of measurement
Figure BDA0002449931220000072
Wherein TP is the number of true positives, FN is the number of false negatives, that is, the proportion of the samples predicted to be judged to have a certain type of behavior in the total number of the samples actually having the certain type of behavior, and FP is the number of false positives, that is, the proportion of the samples predicted to be judged to have the certain type of behavior in the samples actually having the certain type of behavior.
On the basis of considering the recall rate and the accuracy rate together, the visual result analysis is integrated, and the selected threshold value in the embodiment is 0.5. Below this threshold, the detection performance data across all 80 classes of verification video in the AVA dataset is: the recall rate is 26.76%, and the accuracy is 42.94%. The detection performance evaluation data on 10 most common types of verification videos in the AVA data set are as follows: recall 63.64%, accuracy 76.51%, i.e.: in 10000 persons to be detected with behaviors, 6364 person boundary frames and behaviors can be correctly detected; of 10000 detected, 7651 had correct bounding boxes and behavior.
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (11)

1. A multi-user behavior detection method based on deep learning and fusion of various interactive information is characterized in that a behavior detection network is trained by constructing a video library with labels as a sample set, the trained network processes a video to be detected, and the detection of object behaviors in an area is realized according to a final output vector;
the behavior detection network comprises: the video representation extraction system comprises a three-dimensional convolutional neural network for extracting video representations, a representation extraction module with a memory pool, a multi-interaction relationship modeling fusion network, three full-connection layers and a sigmoid regression layer, wherein: the three-dimensional convolutional neural network extracts video representations according to input video clips and outputs the video representations to the representation extraction module, the representation extraction module utilizes RoIAlign to perform linear interpolation on each bounding box area on the video representations and obtains the area representations of people and objects through pooling, meanwhile, the representation extraction module obtains memory representations through a memory pool, the multi-interaction relation modeling fusion network performs modeling fusion on the area representations of people and objects and the memory representations to obtain robust behavior representations, and prediction probabilities of all categories are obtained through a full connection layer and a sigmoid regression layer.
2. The method of claim 1, wherein the tagged video library is obtained by: and after the videos in the sample set are labeled at equal intervals, normalizing the sizes of the videos, and cutting the videos into a plurality of segments according to each labeled frame.
3. The method of claim 1, wherein said equally spaced annotations comprise: the bounding box for each person in the frame and the behavior that each person has individually occurred during the 1.5 second time interval before and after the frame.
4. The method of claim 1, wherein the three-dimensional convolutional neural network is selected from the group consisting of an I3D network, a SlowFast network, and a C3D network.
5. The method of claim 1, wherein the memory pool is based on the person region representation in the region representations of the person and the object in each video segment, and the memory representations are obtained by concatenating the person region representations in the historical segment of the current segment.
6. The method of claim 1, wherein the multiple interaction relationship modeling fusion network comprises: two human-human interaction modeling modules for receiving the region representation of the person, two human interaction modeling modules for respectively receiving the region representation of the person and the region representation of the object, and two human memory modeling interaction modules for respectively receiving the region representation and the memory representation of the person, wherein: the system comprises a first human-human interaction modeling module, a first human memory modeling interaction module, a second human-human interaction modeling module, a second human character interaction modeling module and a second human memory modeling interaction module, wherein the first human-human interaction modeling module, the first human memory modeling interaction module, the second human memory modeling interaction module and the second human memory modeling interaction module are sequentially connected and transmit region representations of human which are sequentially enhanced, each interaction modeling module models one interaction relationship among human interaction, human character interaction and human memory interaction, the interaction relationships are fused with the region representations of the human and then transmitted to the next module, and the finally output region representations of the human comprehensively fuse human interaction, human character interaction and human memory interaction relationships, namely the finally output robust behavior representations;
the human-human interaction is as follows: interaction between different agents in the same video segment;
the character interaction is as follows: interaction between an agent and an object in the same video clip;
the human memory interaction is as follows: interaction between an agent in the current segment and an agent in an adjacent segment of longer history.
7. The method of claim 6, wherein said modeling is by:
Figure FDA0002449931210000021
Figure FDA0002449931210000022
wherein: q and K are respectively two representations of input, WQ,WK1,WK2,WOIs the weight of the fully connected layer, d is KWK1Dimension (d);
according to the difference of input representation K, the module processes different interaction relations: the value of K comprises a human region representation, an object region representation and a memory representation, and the corresponding modeling module sequentially processes human-human interaction, human interaction and memory interaction correspondingly and outputs corresponding representations fused with the type interaction information; when the six modules are connected in series, the output of the last modeling module is further used as the input of the next Q, and finally, a plurality of different interaction relations are fused.
8. The method of claim 1, wherein the three fully-connected layers include two hidden layers and an output layer.
9. The method as claimed in claim 1, wherein the sigmoid regression layer comprises a sigmoid function and a cross entropy loss function, the output vector of the output layer can obtain the prediction probability of each category through the sigmoid layer, and the cross entropy loss function is used for training the whole network.
10. The method of claim 1, wherein the training is: taking the samples in the sample set, the corresponding object boundary frames and the human region representation of the adjacent video segments in the memory pool arranged in the representation extraction module as the input of the behavior detection network, adjusting network parameters by adopting a cross entropy loss function and combining a back propagation BP algorithm, and updating the human region representation in the video segments into the memory pool.
11. A multi-person behavior detection system according to the method of any of claims 1 to 10, comprising: training sample acquisition module, object detection module, the action detection network module that fuses multiple interaction, wherein: the method comprises the steps that a sample of a training sample acquisition module and an object detection frame of an object detection module are used as input of a behavior detection network module, the behavior detection network obtains models of area characteristics and memory characteristics of people and objects by utilizing boundary frame areas of the people and the objects after training, multi-classification judgment is further carried out on the characteristics, the object detection module detects the people and the objects in a video to be detected, and the behavior detection network module further tests and deduces to obtain judgment on behaviors of each person in the video according to a detection result of the object detection module.
CN202010289689.XA 2020-04-14 2020-04-14 Multi-person behavior detection method and system based on deep learning fusion of various interaction information Active CN111523421B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010289689.XA CN111523421B (en) 2020-04-14 2020-04-14 Multi-person behavior detection method and system based on deep learning fusion of various interaction information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010289689.XA CN111523421B (en) 2020-04-14 2020-04-14 Multi-person behavior detection method and system based on deep learning fusion of various interaction information

Publications (2)

Publication Number Publication Date
CN111523421A true CN111523421A (en) 2020-08-11
CN111523421B CN111523421B (en) 2023-05-19

Family

ID=71902656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010289689.XA Active CN111523421B (en) 2020-04-14 2020-04-14 Multi-person behavior detection method and system based on deep learning fusion of various interaction information

Country Status (1)

Country Link
CN (1) CN111523421B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183313A (en) * 2020-09-27 2021-01-05 武汉大学 SlowFast-based power operation field action identification method
CN112861848A (en) * 2020-12-18 2021-05-28 上海交通大学 Visual relation detection method and system based on known action conditions
CN113378676A (en) * 2021-06-01 2021-09-10 上海大学 Method for detecting figure interaction in image based on multi-feature fusion
CN114359791A (en) * 2021-12-16 2022-04-15 北京信智文科技有限公司 Group macaque appetite detection method based on Yolo v5 network and SlowFast network
CN114764899A (en) * 2022-04-12 2022-07-19 华南理工大学 Method for predicting next interactive object based on transform first visual angle

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230017072A1 (en) * 2021-07-08 2023-01-19 Google Llc Systems And Methods For Improved Video Understanding

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
WO2018137357A1 (en) * 2017-01-24 2018-08-02 北京大学 Target detection performance optimization method
CN109409307A (en) * 2018-11-02 2019-03-01 深圳龙岗智能视听研究院 A kind of Online Video behavioral value system and method based on space-time contextual analysis
CN110135319A (en) * 2019-05-09 2019-08-16 广州大学 A kind of anomaly detection method and its system
CN110378233A (en) * 2019-06-20 2019-10-25 上海交通大学 A kind of double branch's method for detecting abnormality based on crowd behaviour priori knowledge
CN110598598A (en) * 2019-08-30 2019-12-20 西安理工大学 Double-current convolution neural network human behavior identification method based on finite sample set

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018137357A1 (en) * 2017-01-24 2018-08-02 北京大学 Target detection performance optimization method
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN109409307A (en) * 2018-11-02 2019-03-01 深圳龙岗智能视听研究院 A kind of Online Video behavioral value system and method based on space-time contextual analysis
CN110135319A (en) * 2019-05-09 2019-08-16 广州大学 A kind of anomaly detection method and its system
CN110378233A (en) * 2019-06-20 2019-10-25 上海交通大学 A kind of double branch's method for detecting abnormality based on crowd behaviour priori knowledge
CN110598598A (en) * 2019-08-30 2019-12-20 西安理工大学 Double-current convolution neural network human behavior identification method based on finite sample set

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
魏鹏: "基于RGB和深度信息融合的双人交互行为识别", 辽宁石油化工大学硕士论文 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183313A (en) * 2020-09-27 2021-01-05 武汉大学 SlowFast-based power operation field action identification method
CN112861848A (en) * 2020-12-18 2021-05-28 上海交通大学 Visual relation detection method and system based on known action conditions
CN112861848B (en) * 2020-12-18 2022-04-08 上海交通大学 Visual relation detection method and system based on known action conditions
CN113378676A (en) * 2021-06-01 2021-09-10 上海大学 Method for detecting figure interaction in image based on multi-feature fusion
CN114359791A (en) * 2021-12-16 2022-04-15 北京信智文科技有限公司 Group macaque appetite detection method based on Yolo v5 network and SlowFast network
CN114359791B (en) * 2021-12-16 2023-08-01 北京信智文科技有限公司 Group macaque appetite detection method based on Yolo v5 network and SlowFast network
CN114764899A (en) * 2022-04-12 2022-07-19 华南理工大学 Method for predicting next interactive object based on transform first visual angle
CN114764899B (en) * 2022-04-12 2024-03-22 华南理工大学 Method for predicting next interaction object based on transformation first view angle

Also Published As

Publication number Publication date
CN111523421B (en) 2023-05-19

Similar Documents

Publication Publication Date Title
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
CN111523421B (en) Multi-person behavior detection method and system based on deep learning fusion of various interaction information
CN110728209B (en) Gesture recognition method and device, electronic equipment and storage medium
CN110717431B (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
US11640714B2 (en) Video panoptic segmentation
CN110929622A (en) Video classification method, model training method, device, equipment and storage medium
US11544510B2 (en) System and method for multi-modal image classification
CN112861917B (en) Weak supervision target detection method based on image attribute learning
CN110378911B (en) Weak supervision image semantic segmentation method based on candidate region and neighborhood classifier
US11501110B2 (en) Descriptor learning method for the detection and location of objects in a video
CN112580458B (en) Facial expression recognition method, device, equipment and storage medium
CN111738054A (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
Gorokhovatskyi et al. Explanation of CNN image classifiers with hiding parts
CN112257665A (en) Image content recognition method, image recognition model training method, and medium
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
CN111985333B (en) Behavior detection method based on graph structure information interaction enhancement and electronic device
CN115410059B (en) Remote sensing image part supervision change detection method and device based on contrast loss
CN113537206A (en) Pushed data detection method and device, computer equipment and storage medium
CN116110005A (en) Crowd behavior attribute counting method, system and product
CN113610080B (en) Cross-modal perception-based sensitive image identification method, device, equipment and medium
CN115410131A (en) Method for intelligently classifying short videos
CN114022698A (en) Multi-tag behavior identification method and device based on binary tree structure
CN111242114B (en) Character recognition method and device
CN114239569A (en) Analysis method and device for evaluation text and computer readable storage medium
CN114255377A (en) Differential commodity detection and classification method for intelligent container

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant