CN112131944B - Video behavior recognition method and system - Google Patents

Video behavior recognition method and system Download PDF

Info

Publication number
CN112131944B
CN112131944B CN202010845486.4A CN202010845486A CN112131944B CN 112131944 B CN112131944 B CN 112131944B CN 202010845486 A CN202010845486 A CN 202010845486A CN 112131944 B CN112131944 B CN 112131944B
Authority
CN
China
Prior art keywords
roi
target object
behavior
video
behavior recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010845486.4A
Other languages
Chinese (zh)
Other versions
CN112131944A (en
Inventor
李岩山
刘燕
谢维信
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202010845486.4A priority Critical patent/CN112131944B/en
Publication of CN112131944A publication Critical patent/CN112131944A/en
Application granted granted Critical
Publication of CN112131944B publication Critical patent/CN112131944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video behavior recognition method and a system, which are used for carrying out multistage feature extraction on a video to be recognized; the ROI of the target object is initially detected by using a depth full convolution network; fine tuning the ROI by using a Markov random field to obtain a ROI set of a final target object; and simultaneously carrying out single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object. The invention not only considers the consistency of the time sequence information in the group, but also considers the difference of the individual time sequence information, and the single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the single person behavior characteristics with discrimination and improving the recognition precision; the ROI matching recursive convolutional network can fuse and propagate information of single ROI in time domain, which is an effective method for solving the problem of video behavior recognition.

Description

Video behavior recognition method and system
Technical Field
The invention relates to the technical field of behavior recognition, in particular to a video behavior recognition method and system.
Background
In recent years, the behavior recognition algorithm is developed rapidly, and the group behavior recognition based on deep learning also achieves good effects. At present, a technology based on a deep learning method obtains good recognition performance in group behavior recognition. However, all studies are currently conducted using group videos as a whole, and individual behavior recognition, which is as important as group behavior, is ignored. Because the group behaviors are not simple superposition of individual behaviors, the time sequence information of individual behaviors and the interaction relation among individuals are comprehensively defined to obtain specific group behaviors, only group behaviors are considered to extract group characteristics, single person behaviors are not considered, the generation of single person characteristics is not facilitated, the time sequence information and the context information among individuals cannot be fully considered to extract the group behavior characteristics, and the structural requirement of the single person behaviors cannot be met. The existing algorithm has serious defects in the aspect of the problems, and the time sequence information and the context information of a single person are not fully considered, so that the improvement of the identification accuracy is affected.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the defect that the identification accuracy is affected by the fact that the single time sequence information and the single context information are not fully considered in the video behavior identification method in the prior art, so that the video behavior identification method and the video behavior identification system are provided.
In order to achieve the above purpose, the present invention provides the following technical solutions:
in a first aspect, an embodiment of the present invention provides a video behavior recognition method, including the steps of:
carrying out multistage feature extraction on the video to be identified;
the ROI of the target object is initially detected by using a depth full convolution network;
fine tuning the ROI by using a Markov random field to obtain a ROI set of a final target object;
simultaneously performing single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, using the ROI matching recursive convolutional network to conduct time sequence modeling of the group behavior, and generating a prediction result of the group behavior.
In one embodiment, the process of performing multi-level feature extraction on a video to be identified includes:
connecting a plurality of intermediate feature graphs of the video to be identified in series by utilizing a multistage full convolution network to generate dense features;
dense feature scaling to fixed dimension size by bilinear interpolation operationsWhere H is the pixel height and W is the pixel width.
In one embodiment, a process for initial detection of an ROI of a target object using a deep-full convolutional network, comprising:
performing target object detection on the video to be identified by using a depth full convolution network, taking a target object area as an ROI, and generating a set of ROI coordinates with corresponding confidence scores;
on the premise that dense features F are output in a given multistage feature extraction stage, a dense feature map B and a dense feature map P are generated for a single person target region, wherein the dense feature map B represents ROI coordinates of a person in a scene relative to a position in an image for encoding, and the dense feature map P represents probability of the person in the image containing the ROI as a target object.
In one embodiment, the process of obtaining the initial feature a of the video to be identified includes: after multi-stage feature extraction is carried out on the video to be identified, the ROI is initially detected by using a depth full convolution network; fine tuning of the ROI is performed using a markov random field to obtain a final set of ROIs as initial features a.
In one embodiment, the process of fine tuning the ROI using a Markov random field to obtain a set of ROIs for a final target object comprises:
converting dense feature map B into global image coordinates to obtain dense bounding box feature map B, defining Markov random fields on dense bounding box feature map B, for each hypothetical coordinateTwo hidden variables are introduced, two Gaussian polynomials +.> and />
Encoding the real coordinates of the detection result of the target object as X i 、A i Assigning the detection result toCorresponding hypothesized coordinates of (a) and defining +.>The joint distribution of (2) is formula (1):
wherein σ is a fixed standard deviation parameter;
the boundary frame coordinate prediction of the target object ROI is generated by modeling in a formula (1), and each position coordinate on the feature map FAll belong to a real detection coordinate j;
average field approximation by calculating edge distribution decomposition distribution to find A i and Xi Is calculated according to formula (2):
wherein The variation parameters of gaussian distribution and class distribution, respectively, cat represents the connection;
based on the KL divergence between equation (2) and equation (3), the decomposition distribution and the joint distribution are minimized so that the parameters for computing the edge distribution Q (), perform the fixed point update of equation (3):
wherein ,is the number of iterations, will->Re-parameterization to give->From an initial value mu 0 Starting until formula (3) reaches convergence; to determine the variation parameter of the Gaussian distribution +.>For iteration number +.>The smooth update is performed using equation (4):
wherein λ is a damping parameter;
and iterating until all coordinates are allocated by using a preset iteration scheme, using the number of allocated coordinates as a confidence score, reserving the ROI coordinates with the confidence score being greater than a preset threshold value, and obtaining N groups of reliable detection coordinates as a final ROI set.
In an embodiment, for single person behavior recognition, performing time sequence reasoning on an ROI time sequence of a target object, and obtaining a prediction result of single person behavior by accessing two fully connected layers and a Softmax layer, including:
setting a main region r which contains the ROI region of the target object to be identified, and simultaneously setting two secondary regions S as context clues for reasoning the behavior of the main region r;
based on the ROI set of each image in the video, calculating the score of the ROI set through two full-connection layers and the maximum pooling, and finally predicting a single action label through a Softmax layer;
performing behavior recognition on a target object in a time t video image I, wherein t comprises a ROI of the target object as a main region r, and the ROI comprising the target object in t-1, t-2 is used as a secondary region S for reasoning the behavior of the t frame target object, I t Is the current time t image, r is I t Comprising the target objectA main area in which the score of the action α of the target object is defined as formula (5):
wherein phi (r; I) t ) Is from I t Feature vector extracted from the main region r, Φ (s 1 ;I t-1) and Φ(s2 ;I t-2 ) Is from I t-1 ,I t-2 Feature vectors extracted from the secondary region S, S 1 ,S 2 Represents I t-1 ,I t-2 The ROI of the frame target object, and />The weights of the ROI area of the target object belonging to the action alpha in the current time t and t-1 and t-2 are represented respectively, and max represents the maximum value, wherein the characteristic vector phi (·) and the weight +_> and />Is obtained by using random gradient descent training;
and selecting the largest score from the S area of the time t-1 and the time t-2 through maximum value pooling, adding the score of the main area to obtain a final score, converting the final score into posterior probability at a Softmax layer, and predicting a single action label to obtain a single prediction result.
In one embodiment, for group behavior identification, a process of generating a predicted result of group behavior using a ROI-matched recursive convolutional network for time-sequential modeling of group behavior, comprising:
for image t in video, based on generating N sets of ROIsSmooth from dense feature map F by bilinear interpolation t Extracting a characteristic representation of a fixed size +.>
Representing featuresBy means of the fully connected layer, a more compact +.>As an input to the ROI-based matching recurrent neural network, where De is the number of features in the hidden state;
calculating the Euclidean distance between single ROI coordinate positions at video time sequences t and t-1, updating the hidden state according to the Euclidean distance as in equation (6), using the coordinates b of the ROI when the coordinate positions are given t ,b t-1 Matching recursive convolutional network representations as region-matching ROIs by closest matchingUpdate->Expressed by the formula (7):
in hidden representation h t And carrying out maximum value pooling, and obtaining a prediction result of group behavior identification by using a softMax classifier.
In a second aspect, an embodiment of the present invention provides a video behavior recognition system, including:
the feature extraction module is used for carrying out multistage feature extraction on the video to be identified;
the ROI initial detection module is used for initially detecting the ROI of the target object by using the depth full convolution network;
the ROI fine tuning module is used for conducting fine tuning on the ROI by utilizing a Markov random field to obtain an ROI set of a final target object;
the behavior recognition module is used for respectively and simultaneously carrying out single person behavior recognition and group behavior recognition based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, using the ROI matching recursive convolutional network to conduct time sequence modeling of the group behavior, and generating a prediction result of the group behavior.
In a third aspect, an embodiment of the present invention provides a computer readable storage medium storing computer instructions for causing a computer to execute the video behavior recognition method of the first aspect of the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer apparatus, including: the video behavior recognition method comprises the steps of storing computer instructions in a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores the computer instructions, and the processor executes the computer instructions to execute the video behavior recognition method according to the first aspect of the embodiment of the invention.
The technical scheme of the invention has the following advantages:
the invention provides a video behavior recognition method and a system, which are used for carrying out multistage feature extraction on a video to be recognized; the ROI of the target object is initially detected by using a depth full convolution network; fine tuning the ROI by using a Markov random field to obtain a ROI set of a final target object; and simultaneously carrying out single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object. The invention not only considers the consistency of the time sequence information in the group, but also considers the difference of the individual time sequence information, and the single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the single person behavior characteristics with discrimination and improving the recognition precision; the ROI matching recursive convolutional network can fuse and propagate information of single ROI in time domain, which is an effective method for solving the problem of video behavior recognition.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a video behavior recognition method according to an embodiment of the present invention;
FIG. 2 is a workflow diagram of one specific example of a video behavior recognition method in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a multi-stage full convolution network extracting multi-stage features according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of detecting and trimming an ROI according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of single person behavior recognition based on ROI timing reasoning in an embodiment of the present invention;
FIG. 6 is a schematic diagram of population behavior recognition based on ROI matching in an embodiment of the present invention;
FIG. 7 is a block diagram showing a specific example of a video behavior recognition system according to an embodiment of the present invention;
fig. 8 is a composition diagram of a specific example of a computer device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Example 1
The embodiment of the invention provides a video behavior recognition method, which can be applied to various video behavior recognition and other scenes, wherein a typical application scene is sports video understanding, sports tactics automatic analysis and the like, and sports video is important media data, and has wide audience groups and huge application prospects and is widely focused by academics and industry. With the popularity of mobile devices and the internet, people have also moved from direct viewing and simple browsing to the need for diversification of sports videos, such as highlight summaries, specific event detection, program customization services, video content editing, etc., which rely on understanding and behavior recognition of sports videos. In sports such as baseball, football, tennis, volleyball, etc., behavior recognition includes both a single person performing a series of actions to perform a task, i.e., personal behavior recognition, and a plurality of persons scattered in a large space, working together to perform a common task, i.e., group behavior recognition. Because the group behaviors are not simple superposition of individual behaviors, the time sequence information of individual behaviors and the interaction relation among individuals are comprehensively defined to obtain specific group behaviors, only group behaviors are considered to extract group characteristics, single person behaviors are not considered, the generation of single person characteristics is not facilitated, the time sequence information and the context information among individuals cannot be fully considered to extract the group behavior characteristics, and the structural requirement of the single person behaviors cannot be met. Based on this, the embodiment of the invention provides a video behavior recognition method based on a sensing interest region (Region ofInterest, abbreviated as ROI), and the recognition flow frame is shown in fig. 1, so that not only is the consistency of time sequence information inside a group considered, but also the difference of individual time sequence information is considered, and meanwhile, the recognition accuracy of single person and group behaviors is improved. As shown in fig. 2, the method specifically includes the following steps:
step S10: and carrying out multistage feature extraction on the video to be identified.
One of the challenges faced by the need to process two recognition tasks simultaneously, namely group behavior recognition and personal behavior recognition, in embodiments of the present invention is that features useful for one recognition task may be inefficient for another task. In the group behavior recognition task, since single person behavior detection needs to infer the behavior type of the target athlete, and further detailed features are needed to distinguish group behaviors, multistage features need to be extracted, and the multistage features refer to features shared among multiple tasks. In order to solve the problem, the embodiment of the invention utilizes a multi-stage full convolution network (Muti Fully Convolutional Networks, MFCN for short) to perform multi-stage feature extraction, as shown in fig. 3, and utilizes the multi-stage full convolution network to connect a plurality of intermediate feature graphs of the video to be identified in series to generate dense features; dense feature scaling to fixed dimension size by bilinear interpolation operationsWhere H is the pixel height and W is the pixel width.
Step S20: and initially detecting the ROI of the target object by using the depth full convolution network.
In the embodiment of the invention, volleyball video behavior recognition is taken as an example, the input volleyball video is required to be subjected to target detection by taking a player as a target object, and the player area is taken as an interested area, namely the ROI. It is necessary to detect specific positions of these athletes in the imageI.e. a set of ROI coordinates with corresponding confidence scores is generated.
From the slaveThe mapping to B, P is a Deep Full Convolutional Network (DFCN), corresponding to the process of DFCN in FIG. 4, consisting of two 3X 3 convolutional layers containing 512 filters and a Shortcut connection heapAnd (5) stacking. The Shortcut is proposed in DenseNet to directly transfer shallow information to deep layers, so that the gradient divergence problem in a depth model is solved, parameters are reduced and computational complexity is reduced by dividing a network into blocks and limiting the number of output channels of each layer.
Output at a given multi-stage feature extraction stage as(here, t is omitted later for convenience of distinction), two dense feature maps +.> and />Representing the coordinates of the ROI in the scene where the person is encoded relative to the position in the image, i.e. the coordinates of the region of interest; />Representing the probability of encoding the portion of the image containing the athlete to generate a segmentation Mask, i.e., whether the ROI is an athlete.
Step S30: fine-tuning of the ROI is performed using a markov random field to obtain a set of ROIs of the final target object.
The goal of the fine tuning of the ROI area according to the embodiments of the present invention is to remove the bounding box of the detection repetition and determine the final accurate ROI. The classical approach to eliminate duplicate ROIs is to use non-maximal suppression (NMS) to generate the confidence score. This approach has two drawbacks, firstly, if the number of ROIs is large, the rescaling stage can be very expensive; second, the NMS method itself is not optimal and is susceptible to greedy decisions. To avoid the disadvantages of using NMS, this embodiment constructs a Markov Random Field (MRF) based ROI trim of volleyball video as shown in fig. 4.
The process of fine tuning the ROI using the markov random field is specifically:
converting dense feature map B to globalObtaining a dense boundary frame feature map B by image coordinates, defining a Markov random field on the dense boundary frame feature map B, and assuming coordinates for eachTwo hidden variables are introduced, two Gaussian polynomials +.> and />
Encoding the real coordinates of the detection result of the target object as X i 、A i Assigning the detection result toCorresponding hypothesized coordinates of (a) and defining +.>The joint distribution of (2) is formula (1):
wherein σ is a fixed standard deviation parameter;
the boundary frame coordinate prediction of the target object ROI is generated by modeling in a formula (1), and each position coordinate on the feature map FAll belong to a real detection coordinate j; j may be equal to i. At this real coordinate, the observed value x i Should be consistent with observed value X j The distance is not far;
average field approximation by calculating edge distribution decomposition distribution to find A i and Xi Is calculated according to formula (2):
wherein The variation parameters of gaussian distribution and class distribution, respectively, cat represents the connection;
based on the KL divergence between equation (2) and equation (3), the decomposition distribution and the joint distribution are minimized so that the parameters for computing the edge distribution Q (), perform the fixed point update of equation (3):
wherein ,is the number of iterations, will->Re-parameterization to give->From an initial value mu 0 Starting until formula (3) reaches convergence; in practical experiments, from initialized μ 0 Starting to estimate B, consider onlySegmentation probability P i > ρ, where ρ is a fixed threshold;
to determine the variation parameters of the Gaussian distributionFor iteration number +.>The smooth update is performed using equation (4):
wherein λ is a damping parameter;
the embodiment of the invention uses a simple iteration scheme similar to that used in Hough forest to identify them, firstly finds out the hypothesis of assigning the most coordinates, secondly considers which positions are deleted, then iterates until all coordinates are assigned, uses the number of assigned coordinates as confidence scores, reserves the ROI coordinates with the confidence scores larger than a preset threshold value as a final ROI set, obtains N groups of reliable detection results, and encodes a boundary box asN depends on the number of teleoperators and will generally be less than the number of athletes.
Step S40:
simultaneously performing single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, using the ROI matching recursive convolutional network to conduct time sequence modeling of the group behavior, and generating a prediction result of the group behavior.
For single person behavior recognition, performing time sequence reasoning on the ROI time sequence of the target object, setting a main region r containing the ROI region of the target object to be recognized, and simultaneously setting two secondary regions S as context clues for reasoning the behavior of the main region r;
based on the ROI set of each image in the video, calculating the score of the ROI set through two full-connection layers and the maximum pooling, and finally predicting a single action label through a Softmax layer;
performing behavior recognition on a target object in a time t video image I, wherein t comprises a ROI of the target object as a main region r, and the ROI comprising the target object in t-1 and t-2 is used as a secondary region S for reasoning t frames of the target objectBehavior of I t Is the current time t image, r is I t The main area of the target object is included, and the score of the action alpha of the target object is defined as formula (5):
wherein phi (r; I) t ) Is from I t Feature vector extracted from the main region r, Φ (s 1 ;I t-1) and Φ(s2 ;I t-2 ) Is from I t-1 ,I t-2 Feature vectors extracted from the secondary region S, S 1 ,S 2 Represents I t-1 ,I t-2 The ROI of the frame target object, and />The weights of the ROI area of the target object belonging to the action alpha in the current time t and t-1 and t-2 are represented respectively, and max represents the maximum value, wherein the characteristic vector phi (·) and the weight +_> and />Is obtained by using random gradient descent training;
selecting the largest score from the S area of the time t-1, t-2 and adding the score of the main area through maximum value pooling to obtain a final score, converting the final score into posterior probability at a Softmax layer, and predicting a single action label to obtain a single prediction result
Regarding group behavior recognition, timing information is a very important feature, and it is proposed that ROI-match RNN can fuse and propagate information of a single ROI in a time domain for group behavior recognition based on an ROI-match recurrent neural network (ROI-match RNN), and the network structure thereof is shown in fig. 6. The specific identification process comprises the following steps:
for image t in video, based on generating N sets of ROIsSmooth from dense feature map F by bilinear interpolation t Extracting a characteristic representation of a fixed size +.>
Representing featuresBy means of the fully connected layer, a more compact +.>As an input to the ROI-based matching recurrent neural network, where De is the number of features in the hidden state;
embodiments of the present invention use a gating loop unit (Gated Recurrent Unit, GRU for short) for each ROI in the time series, which means hidden state, during training and testing, no track allocation can be accessedAndnot necessarily related to the same person, to solve this problem, the Euclidean distance between the single ROI coordinate positions is calculated at video time sequences t and t-1, the hidden state is updated according to the Euclidean distance as in equation (6), and the coordinate b of the ROI is used when the coordinate position is given t ,b t-1 Matching the recursive convolutional network representation as region-matching ROI by the closest match +.>Update->Expressed by the formula (7):
by e t Instead of the bounding box coordinates b t The model is made more robust to detection of lost or misassignment, so that it is not necessary to find a nearest neighbor for hidden state update. To obtain final prediction results of group behaviorIn hidden representation h t Maximum pooling (Max pool) was performed first, and then a SoftMax classifier was used to obtain predictive tags for population behavior identification +.>
The video behavior recognition method provided by the embodiment of the invention,
not only the consistency of the time sequence information in the group is considered, but also the difference of the individual time sequence information is considered. The single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the distinguishing single person behavior characteristics, improves recognition accuracy, can fuse and spread the single person ROI information in the time domain based on the ROI matching recursion convolution network, and is an effective method for solving the volleyball video behavior recognition problem.
Example 2
An embodiment of the present invention provides a video behavior recognition system, as shown in fig. 7, including:
the feature extraction module 10 is used for carrying out multi-level feature extraction on the video to be identified. This module performs the method described in step S10 in embodiment 1, and will not be described here.
The ROI initial detection module 20 is configured to initially detect the ROI of the target object by using the depth full convolution network. This module performs the method described in step S20 in embodiment 1, and will not be described here.
And an ROI fine tuning module 30, configured to perform fine tuning of the ROI by using a markov random field, so as to obtain an ROI set of the final target object. This module performs the method described in step 30 in embodiment 1, and will not be described here.
A behavior recognition module 40 for simultaneously performing single person behavior recognition and group behavior recognition based on the ROI set of the final target object, respectively; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, performing time sequence modeling of group behaviors by using an ROI matching recursion convolution network to generate a prediction result of the group behaviors; this module performs the method described in step 40 in embodiment 1, and will not be described here.
The video behavior recognition system provided by the embodiment of the invention not only considers the consistency of the time sequence information in the group, but also considers the difference of the individual time sequence information. The single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the distinguishing single person behavior characteristics, improves recognition accuracy, can fuse and spread the single person ROI information in the time domain based on the ROI matching recursion convolution network, and is an effective method for solving the volleyball video behavior recognition problem.
Example 3
Embodiments of the present invention provide a computer device, as shown in fig. 8, which may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or otherwise, fig. 8 being an example of a connection via a bus.
The processor 51 may be a central processing unit (Central Processing Unit, CPU). The processor 51 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 52 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as corresponding program instructions/modules in embodiments of the present invention. The processor 51 executes various functional applications of the processor and data processing by running non-transitory software programs, instructions, and modules stored in the memory 52, that is, implements the video behavior recognition method in the above-described method embodiment 1.
Memory 52 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by the processor 51, etc. In addition, memory 52 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 52 may optionally include memory located remotely from processor 51, which may be connected to processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, mobile communication networks, and combinations thereof.
One or more modules are stored in the memory 52 that, when executed by the processor 51, perform the video behavior recognition method of embodiment 1.
The details of the above computer device may be correspondingly understood by referring to the corresponding related descriptions and effects in embodiment 1, and will not be repeated here.
It will be appreciated by those skilled in the art that a program implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, and when executed, may include the above-described embodiment method flow. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims (6)

1. A method for identifying video behavior, comprising the steps of:
carrying out multistage feature extraction on the video to be identified;
the ROI of the target object is initially detected by using a depth full convolution network;
fine tuning the ROI by using a Markov random field to obtain a ROI set of a final target object;
simultaneously performing single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, performing time sequence modeling of group behaviors by using an ROI matching recursion convolution network to generate a prediction result of the group behaviors;
the process for identifying the single person behavior, carrying out time sequence reasoning on the ROI time sequence of the target object, and obtaining a prediction result of the single person behavior by accessing two full-connection layers and a Softmax layer comprises the following steps:
setting a main areaWherein the ROI area containing the target object to be identified is provided with two secondary areas simultaneouslyAs contextual clues for reasoning about main region +.>Behavior of (2);
based on the ROI set of each image in the video, calculating the score of the ROI set through two full-connection layers and the maximum pooling, and finally predicting a single action label through a Softmax layer;
for timePerforming behavior recognition on target objects in video image I, < >>ROI comprising the target object as main region +.>Will->The ROI of the target object is included as a secondary region +.>For reasoning->Behavior of frame target object,/->Is the current time +.>Image (S)/(S)>Is->Comprises a main area of the target object, and the action of the target object is->The score of (2) is defined as equation (5):
(5)
wherein ,is from->Middle main area->Extracted feature vector, < > and-> and />Is from->,Middle minor region->Extracted feature vector, < > and->Represents->,/>ROI, & gt of frame target object> and />Respectively represent the current time +.> and />The ROI area of the target object belonging to the action->Weight of->Representing the feature vector +.>Weight-> and />Is obtained by using random gradient descent training;
time pooling by maximaIs->The score of the largest region is selected and added with the score of the main region to obtain the final score, and the final score is converted into posterior probability at a Softmax layerAnd predicting a single action label to obtain a single prediction result.
2. The method for identifying video behavior according to claim 1, wherein the process of performing multi-level feature extraction on the video to be identified comprises:
connecting a plurality of intermediate feature graphs of the video to be identified in series by utilizing a multistage full convolution network to generate dense features;
dense feature scaling to fixed dimension size by bilinear interpolation operationsWhere H is the pixel height and W is the pixel width.
3. The method of claim 2, wherein the process of initially detecting the ROI of the target object using a depth-full convolution network comprises:
performing target object detection on the video to be identified by using a depth full convolution network, taking a target object area as an ROI, and generating a set of ROI coordinates with corresponding confidence scores;
on the premise that dense features F are output in a given multistage feature extraction stage, a dense feature map B and a dense feature map P are generated for a single person target region, wherein the dense feature map B represents ROI coordinates of a person in a scene relative to a position in an image for encoding, and the dense feature map P represents probability of the person in the image containing the ROI as a target object.
4. A video behavior recognition system, comprising:
the feature extraction module is used for carrying out multistage feature extraction on the video to be identified;
the ROI initial detection module is used for initially detecting the ROI of the target object by using the depth full convolution network;
the ROI fine tuning module is used for conducting fine tuning on the ROI by utilizing a Markov random field to obtain an ROI set of a final target object;
the behavior recognition module is used for respectively and simultaneously carrying out single person behavior recognition and group behavior recognition based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, performing time sequence modeling of group behaviors by using an ROI matching recursion convolution network to generate a prediction result of the group behaviors;
the process for identifying the single person behavior, carrying out time sequence reasoning on the ROI time sequence of the target object, and obtaining a prediction result of the single person behavior by accessing two full-connection layers and a Softmax layer comprises the following steps:
setting a main areaWherein the ROI area containing the target object to be identified is provided with two secondary areas simultaneouslyAs contextual clues for reasoning about main region +.>Behavior of (2);
based on the ROI set of each image in the video, calculating the score of the ROI set through two full-connection layers and the maximum pooling, and finally predicting a single action label through a Softmax layer;
for timePerforming behavior recognition on target objects in video image I, < >>ROI comprising the target object as main region +.>Will->The ROI of the target object is included as a secondary region +.>For reasoning->Behavior of frame target object,/->Is the current time +.>Image (S)/(S)>Is->Comprises a main area of the target object, and the action of the target object is->The score of (2) is defined as equation (5):
(5)
wherein ,is from->Middle main area->Extracted feature vector, < > and-> and />Is from->,Middle minor region->Extracted feature vector, < > and->Represents->,/>ROI, & gt of frame target object> and />Respectively represent the current time +.> and />The ROI area of the target object belonging to the action->Weight of->Representing the feature vector +.>Weight-> and />Is obtained by using random gradient descent training;
time pooling by maximaIs->And adding the largest score of the region selection and the score of the main region to obtain a final score, converting the final score into posterior probability at a Softmax layer, and predicting a single action label to obtain a single prediction result.
5. A computer-readable storage medium storing computer instructions for causing the computer to perform the video behavior recognition method of any one of claims 1-3.
6. A computer device, comprising: a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory storing computer instructions, the processor executing the computer instructions to perform the video behavior recognition method of any one of claims 1-3.
CN202010845486.4A 2020-08-20 2020-08-20 Video behavior recognition method and system Active CN112131944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010845486.4A CN112131944B (en) 2020-08-20 2020-08-20 Video behavior recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010845486.4A CN112131944B (en) 2020-08-20 2020-08-20 Video behavior recognition method and system

Publications (2)

Publication Number Publication Date
CN112131944A CN112131944A (en) 2020-12-25
CN112131944B true CN112131944B (en) 2023-10-17

Family

ID=73850455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010845486.4A Active CN112131944B (en) 2020-08-20 2020-08-20 Video behavior recognition method and system

Country Status (1)

Country Link
CN (1) CN112131944B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111838A (en) * 2021-04-25 2021-07-13 上海商汤智能科技有限公司 Behavior recognition method and device, equipment and storage medium
CN114298183B (en) * 2021-12-20 2024-04-05 江西洪都航空工业集团有限责任公司 Intelligent recognition method for flight actions
CN116469155A (en) * 2022-01-11 2023-07-21 北京大学 Complex action recognition method and device based on learnable Markov logic network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017015947A1 (en) * 2015-07-30 2017-02-02 Xiaogang Wang A system and a method for object tracking
CN106529467A (en) * 2016-11-07 2017-03-22 南京邮电大学 Group behavior identification method based on multi-feature fusion
CN110796081A (en) * 2019-10-29 2020-02-14 深圳龙岗智能视听研究院 Group behavior identification method based on relational graph analysis
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017015947A1 (en) * 2015-07-30 2017-02-02 Xiaogang Wang A system and a method for object tracking
CN106529467A (en) * 2016-11-07 2017-03-22 南京邮电大学 Group behavior identification method based on multi-feature fusion
CN110796081A (en) * 2019-10-29 2020-02-14 深圳龙岗智能视听研究院 Group behavior identification method based on relational graph analysis
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion

Also Published As

Publication number Publication date
CN112131944A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
WO2020221298A1 (en) Text detection model training method and apparatus, text region determination method and apparatus, and text content determination method and apparatus
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
Wang et al. Detect globally, refine locally: A novel approach to saliency detection
CN109977262B (en) Method and device for acquiring candidate segments from video and processing equipment
CN110321813B (en) Cross-domain pedestrian re-identification method based on pedestrian segmentation
CN112131944B (en) Video behavior recognition method and system
CN111476302B (en) fast-RCNN target object detection method based on deep reinforcement learning
CN109426805B (en) Method, apparatus and computer program product for object detection
CN111161311A (en) Visual multi-target tracking method and device based on deep learning
EP1934941B1 (en) Bi-directional tracking using trajectory segment analysis
Zhao et al. Closely coupled object detection and segmentation
Esmaeili et al. Fast-at: Fast automatic thumbnail generation using deep neural networks
CN111783576A (en) Pedestrian re-identification method based on improved YOLOv3 network and feature fusion
CN111027493A (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN111222487B (en) Video target behavior identification method and electronic equipment
Gu et al. Multiple stream deep learning model for human action recognition
Zhu et al. A novel recursive Bayesian learning-based method for the efficient and accurate segmentation of video with dynamic background
CN106157330B (en) Visual tracking method based on target joint appearance model
Dai et al. Tan: Temporal aggregation network for dense multi-label action recognition
Vainstein et al. Modeling video activity with dynamic phrases and its application to action recognition in tennis videos
Liu et al. 3d-queryis: A query-based framework for 3d instance segmentation
CN111291785A (en) Target detection method, device, equipment and storage medium
CN108257148B (en) Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking
Şah et al. Review and evaluation of player detection methods in field sports: Comparing conventional and deep learning based methods
CN113762041A (en) Video classification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant