CN112131944B

CN112131944B - Video behavior recognition method and system

Info

Publication number: CN112131944B
Application number: CN202010845486.4A
Authority: CN
Inventors: 李岩山; 刘燕; 谢维信
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2023-10-17
Anticipated expiration: 2040-08-20
Also published as: CN112131944A

Abstract

The invention discloses a video behavior recognition method and a system, which are used for carrying out multistage feature extraction on a video to be recognized; the ROI of the target object is initially detected by using a depth full convolution network; fine tuning the ROI by using a Markov random field to obtain a ROI set of a final target object; and simultaneously carrying out single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object. The invention not only considers the consistency of the time sequence information in the group, but also considers the difference of the individual time sequence information, and the single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the single person behavior characteristics with discrimination and improving the recognition precision; the ROI matching recursive convolutional network can fuse and propagate information of single ROI in time domain, which is an effective method for solving the problem of video behavior recognition.

Description

Video behavior recognition method and system

Technical Field

The invention relates to the technical field of behavior recognition, in particular to a video behavior recognition method and system.

Background

In recent years, the behavior recognition algorithm is developed rapidly, and the group behavior recognition based on deep learning also achieves good effects. At present, a technology based on a deep learning method obtains good recognition performance in group behavior recognition. However, all studies are currently conducted using group videos as a whole, and individual behavior recognition, which is as important as group behavior, is ignored. Because the group behaviors are not simple superposition of individual behaviors, the time sequence information of individual behaviors and the interaction relation among individuals are comprehensively defined to obtain specific group behaviors, only group behaviors are considered to extract group characteristics, single person behaviors are not considered, the generation of single person characteristics is not facilitated, the time sequence information and the context information among individuals cannot be fully considered to extract the group behavior characteristics, and the structural requirement of the single person behaviors cannot be met. The existing algorithm has serious defects in the aspect of the problems, and the time sequence information and the context information of a single person are not fully considered, so that the improvement of the identification accuracy is affected.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defect that the identification accuracy is affected by the fact that the single time sequence information and the single context information are not fully considered in the video behavior identification method in the prior art, so that the video behavior identification method and the video behavior identification system are provided.

In order to achieve the above purpose, the present invention provides the following technical solutions:

in a first aspect, an embodiment of the present invention provides a video behavior recognition method, including the steps of:

carrying out multistage feature extraction on the video to be identified;

the ROI of the target object is initially detected by using a depth full convolution network;

fine tuning the ROI by using a Markov random field to obtain a ROI set of a final target object;

simultaneously performing single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, using the ROI matching recursive convolutional network to conduct time sequence modeling of the group behavior, and generating a prediction result of the group behavior.

In one embodiment, the process of performing multi-level feature extraction on a video to be identified includes:

connecting a plurality of intermediate feature graphs of the video to be identified in series by utilizing a multistage full convolution network to generate dense features;

dense feature scaling to fixed dimension size by bilinear interpolation operationsWhere H is the pixel height and W is the pixel width.

In one embodiment, a process for initial detection of an ROI of a target object using a deep-full convolutional network, comprising:

performing target object detection on the video to be identified by using a depth full convolution network, taking a target object area as an ROI, and generating a set of ROI coordinates with corresponding confidence scores;

on the premise that dense features F are output in a given multistage feature extraction stage, a dense feature map B and a dense feature map P are generated for a single person target region, wherein the dense feature map B represents ROI coordinates of a person in a scene relative to a position in an image for encoding, and the dense feature map P represents probability of the person in the image containing the ROI as a target object.

In one embodiment, the process of obtaining the initial feature a of the video to be identified includes: after multi-stage feature extraction is carried out on the video to be identified, the ROI is initially detected by using a depth full convolution network; fine tuning of the ROI is performed using a markov random field to obtain a final set of ROIs as initial features a.

In one embodiment, the process of fine tuning the ROI using a Markov random field to obtain a set of ROIs for a final target object comprises:

converting dense feature map B into global image coordinates to obtain dense bounding box feature map B, defining Markov random fields on dense bounding box feature map B, for each hypothetical coordinateTwo hidden variables are introduced, two Gaussian polynomials +.> and />

Encoding the real coordinates of the detection result of the target object as X _i 、A _i Assigning the detection result toCorresponding hypothesized coordinates of (a) and defining +.>The joint distribution of (2) is formula (1):

wherein σ is a fixed standard deviation parameter;

the boundary frame coordinate prediction of the target object ROI is generated by modeling in a formula (1), and each position coordinate on the feature map FAll belong to a real detection coordinate j;

average field approximation by calculating edge distribution decomposition distribution to find A _i and X_i Is calculated according to formula (2):

wherein The variation parameters of gaussian distribution and class distribution, respectively, cat represents the connection;

based on the KL divergence between equation (2) and equation (3), the decomposition distribution and the joint distribution are minimized so that the parameters for computing the edge distribution Q (), perform the fixed point update of equation (3):

wherein ,is the number of iterations, will->Re-parameterization to give->From an initial value mu ⁰ Starting until formula (3) reaches convergence; to determine the variation parameter of the Gaussian distribution +.>For iteration number +.>The smooth update is performed using equation (4):

wherein λ is a damping parameter;

and iterating until all coordinates are allocated by using a preset iteration scheme, using the number of allocated coordinates as a confidence score, reserving the ROI coordinates with the confidence score being greater than a preset threshold value, and obtaining N groups of reliable detection coordinates as a final ROI set.

In an embodiment, for single person behavior recognition, performing time sequence reasoning on an ROI time sequence of a target object, and obtaining a prediction result of single person behavior by accessing two fully connected layers and a Softmax layer, including:

setting a main region r which contains the ROI region of the target object to be identified, and simultaneously setting two secondary regions S as context clues for reasoning the behavior of the main region r;

based on the ROI set of each image in the video, calculating the score of the ROI set through two full-connection layers and the maximum pooling, and finally predicting a single action label through a Softmax layer;

performing behavior recognition on a target object in a time t video image I, wherein t comprises a ROI of the target object as a main region r, and the ROI comprising the target object in t-1, t-2 is used as a secondary region S for reasoning the behavior of the t frame target object, I _t Is the current time t image, r is I _t Comprising the target objectA main area in which the score of the action α of the target object is defined as formula (5):

wherein phi (r; I) _t ) Is from I _t Feature vector extracted from the main region r, Φ (s ₁ ；I _t-1) and Φ(s₂ ；I _t-2 ) Is from I _t-1 ，I _t-2 Feature vectors extracted from the secondary region S, S ₁ ，S ₂ Represents I _t-1 ，I _t-2 The ROI of the frame target object, and />The weights of the ROI area of the target object belonging to the action alpha in the current time t and t-1 and t-2 are represented respectively, and max represents the maximum value, wherein the characteristic vector phi (·) and the weight +_> and />Is obtained by using random gradient descent training;

and selecting the largest score from the S area of the time t-1 and the time t-2 through maximum value pooling, adding the score of the main area to obtain a final score, converting the final score into posterior probability at a Softmax layer, and predicting a single action label to obtain a single prediction result.

In one embodiment, for group behavior identification, a process of generating a predicted result of group behavior using a ROI-matched recursive convolutional network for time-sequential modeling of group behavior, comprising:

for image t in video, based on generating N sets of ROIsSmooth from dense feature map F by bilinear interpolation ^t Extracting a characteristic representation of a fixed size +.>

Representing featuresBy means of the fully connected layer, a more compact +.>As an input to the ROI-based matching recurrent neural network, where De is the number of features in the hidden state;

calculating the Euclidean distance between single ROI coordinate positions at video time sequences t and t-1, updating the hidden state according to the Euclidean distance as in equation (6), using the coordinates b of the ROI when the coordinate positions are given ^t ，b ^t-1 Matching recursive convolutional network representations as region-matching ROIs by closest matchingUpdate->Expressed by the formula (7):

in hidden representation h ^t And carrying out maximum value pooling, and obtaining a prediction result of group behavior identification by using a softMax classifier.

In a second aspect, an embodiment of the present invention provides a video behavior recognition system, including:

the feature extraction module is used for carrying out multistage feature extraction on the video to be identified;

the ROI initial detection module is used for initially detecting the ROI of the target object by using the depth full convolution network;

the ROI fine tuning module is used for conducting fine tuning on the ROI by utilizing a Markov random field to obtain an ROI set of a final target object;

the behavior recognition module is used for respectively and simultaneously carrying out single person behavior recognition and group behavior recognition based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, using the ROI matching recursive convolutional network to conduct time sequence modeling of the group behavior, and generating a prediction result of the group behavior.

In a third aspect, an embodiment of the present invention provides a computer readable storage medium storing computer instructions for causing a computer to execute the video behavior recognition method of the first aspect of the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer apparatus, including: the video behavior recognition method comprises the steps of storing computer instructions in a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores the computer instructions, and the processor executes the computer instructions to execute the video behavior recognition method according to the first aspect of the embodiment of the invention.

The technical scheme of the invention has the following advantages:

the invention provides a video behavior recognition method and a system, which are used for carrying out multistage feature extraction on a video to be recognized; the ROI of the target object is initially detected by using a depth full convolution network; fine tuning the ROI by using a Markov random field to obtain a ROI set of a final target object; and simultaneously carrying out single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object. The invention not only considers the consistency of the time sequence information in the group, but also considers the difference of the individual time sequence information, and the single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the single person behavior characteristics with discrimination and improving the recognition precision; the ROI matching recursive convolutional network can fuse and propagate information of single ROI in time domain, which is an effective method for solving the problem of video behavior recognition.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a video behavior recognition method according to an embodiment of the present invention;

FIG. 2 is a workflow diagram of one specific example of a video behavior recognition method in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-stage full convolution network extracting multi-stage features according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of detecting and trimming an ROI according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of single person behavior recognition based on ROI timing reasoning in an embodiment of the present invention;

FIG. 6 is a schematic diagram of population behavior recognition based on ROI matching in an embodiment of the present invention;

FIG. 7 is a block diagram showing a specific example of a video behavior recognition system according to an embodiment of the present invention;

fig. 8 is a composition diagram of a specific example of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1

The embodiment of the invention provides a video behavior recognition method, which can be applied to various video behavior recognition and other scenes, wherein a typical application scene is sports video understanding, sports tactics automatic analysis and the like, and sports video is important media data, and has wide audience groups and huge application prospects and is widely focused by academics and industry. With the popularity of mobile devices and the internet, people have also moved from direct viewing and simple browsing to the need for diversification of sports videos, such as highlight summaries, specific event detection, program customization services, video content editing, etc., which rely on understanding and behavior recognition of sports videos. In sports such as baseball, football, tennis, volleyball, etc., behavior recognition includes both a single person performing a series of actions to perform a task, i.e., personal behavior recognition, and a plurality of persons scattered in a large space, working together to perform a common task, i.e., group behavior recognition. Because the group behaviors are not simple superposition of individual behaviors, the time sequence information of individual behaviors and the interaction relation among individuals are comprehensively defined to obtain specific group behaviors, only group behaviors are considered to extract group characteristics, single person behaviors are not considered, the generation of single person characteristics is not facilitated, the time sequence information and the context information among individuals cannot be fully considered to extract the group behavior characteristics, and the structural requirement of the single person behaviors cannot be met. Based on this, the embodiment of the invention provides a video behavior recognition method based on a sensing interest region (Region ofInterest, abbreviated as ROI), and the recognition flow frame is shown in fig. 1, so that not only is the consistency of time sequence information inside a group considered, but also the difference of individual time sequence information is considered, and meanwhile, the recognition accuracy of single person and group behaviors is improved. As shown in fig. 2, the method specifically includes the following steps:

step S10: and carrying out multistage feature extraction on the video to be identified.

One of the challenges faced by the need to process two recognition tasks simultaneously, namely group behavior recognition and personal behavior recognition, in embodiments of the present invention is that features useful for one recognition task may be inefficient for another task. In the group behavior recognition task, since single person behavior detection needs to infer the behavior type of the target athlete, and further detailed features are needed to distinguish group behaviors, multistage features need to be extracted, and the multistage features refer to features shared among multiple tasks. In order to solve the problem, the embodiment of the invention utilizes a multi-stage full convolution network (Muti Fully Convolutional Networks, MFCN for short) to perform multi-stage feature extraction, as shown in fig. 3, and utilizes the multi-stage full convolution network to connect a plurality of intermediate feature graphs of the video to be identified in series to generate dense features; dense feature scaling to fixed dimension size by bilinear interpolation operationsWhere H is the pixel height and W is the pixel width.

Step S20: and initially detecting the ROI of the target object by using the depth full convolution network.

In the embodiment of the invention, volleyball video behavior recognition is taken as an example, the input volleyball video is required to be subjected to target detection by taking a player as a target object, and the player area is taken as an interested area, namely the ROI. It is necessary to detect specific positions of these athletes in the imageI.e. a set of ROI coordinates with corresponding confidence scores is generated.

From the slaveThe mapping to B, P is a Deep Full Convolutional Network (DFCN), corresponding to the process of DFCN in FIG. 4, consisting of two 3X 3 convolutional layers containing 512 filters and a Shortcut connection heapAnd (5) stacking. The Shortcut is proposed in DenseNet to directly transfer shallow information to deep layers, so that the gradient divergence problem in a depth model is solved, parameters are reduced and computational complexity is reduced by dividing a network into blocks and limiting the number of output channels of each layer.

Output at a given multi-stage feature extraction stage as(here, t is omitted later for convenience of distinction), two dense feature maps +.> and />Representing the coordinates of the ROI in the scene where the person is encoded relative to the position in the image, i.e. the coordinates of the region of interest; />Representing the probability of encoding the portion of the image containing the athlete to generate a segmentation Mask, i.e., whether the ROI is an athlete.

Step S30: fine-tuning of the ROI is performed using a markov random field to obtain a set of ROIs of the final target object.

The goal of the fine tuning of the ROI area according to the embodiments of the present invention is to remove the bounding box of the detection repetition and determine the final accurate ROI. The classical approach to eliminate duplicate ROIs is to use non-maximal suppression (NMS) to generate the confidence score. This approach has two drawbacks, firstly, if the number of ROIs is large, the rescaling stage can be very expensive; second, the NMS method itself is not optimal and is susceptible to greedy decisions. To avoid the disadvantages of using NMS, this embodiment constructs a Markov Random Field (MRF) based ROI trim of volleyball video as shown in fig. 4.

The process of fine tuning the ROI using the markov random field is specifically:

converting dense feature map B to globalObtaining a dense boundary frame feature map B by image coordinates, defining a Markov random field on the dense boundary frame feature map B, and assuming coordinates for eachTwo hidden variables are introduced, two Gaussian polynomials +.> and />

wherein σ is a fixed standard deviation parameter;

the boundary frame coordinate prediction of the target object ROI is generated by modeling in a formula (1), and each position coordinate on the feature map FAll belong to a real detection coordinate j; j may be equal to i. At this real coordinate, the observed value x _i Should be consistent with observed value X _j The distance is not far;

wherein ,is the number of iterations, will->Re-parameterization to give->From an initial value mu ⁰ Starting until formula (3) reaches convergence; in practical experiments, from initialized μ ⁰ Starting to estimate B, consider onlySegmentation probability P _i > ρ, where ρ is a fixed threshold;

to determine the variation parameters of the Gaussian distributionFor iteration number +.>The smooth update is performed using equation (4):

wherein λ is a damping parameter;

the embodiment of the invention uses a simple iteration scheme similar to that used in Hough forest to identify them, firstly finds out the hypothesis of assigning the most coordinates, secondly considers which positions are deleted, then iterates until all coordinates are assigned, uses the number of assigned coordinates as confidence scores, reserves the ROI coordinates with the confidence scores larger than a preset threshold value as a final ROI set, obtains N groups of reliable detection results, and encodes a boundary box asN depends on the number of teleoperators and will generally be less than the number of athletes.

Step S40:

For single person behavior recognition, performing time sequence reasoning on the ROI time sequence of the target object, setting a main region r containing the ROI region of the target object to be recognized, and simultaneously setting two secondary regions S as context clues for reasoning the behavior of the main region r;

performing behavior recognition on a target object in a time t video image I, wherein t comprises a ROI of the target object as a main region r, and the ROI comprising the target object in t-1 and t-2 is used as a secondary region S for reasoning t frames of the target objectBehavior of I _t Is the current time t image, r is I _t The main area of the target object is included, and the score of the action alpha of the target object is defined as formula (5):

selecting the largest score from the S area of the time t-1, t-2 and adding the score of the main area through maximum value pooling to obtain a final score, converting the final score into posterior probability at a Softmax layer, and predicting a single action label to obtain a single prediction result

Regarding group behavior recognition, timing information is a very important feature, and it is proposed that ROI-match RNN can fuse and propagate information of a single ROI in a time domain for group behavior recognition based on an ROI-match recurrent neural network (ROI-match RNN), and the network structure thereof is shown in fig. 6. The specific identification process comprises the following steps:

embodiments of the present invention use a gating loop unit (Gated Recurrent Unit, GRU for short) for each ROI in the time series, which means hidden state, during training and testing, no track allocation can be accessedAndnot necessarily related to the same person, to solve this problem, the Euclidean distance between the single ROI coordinate positions is calculated at video time sequences t and t-1, the hidden state is updated according to the Euclidean distance as in equation (6), and the coordinate b of the ROI is used when the coordinate position is given ^t ，b ^t-1 Matching the recursive convolutional network representation as region-matching ROI by the closest match +.>Update->Expressed by the formula (7):

by e ^t Instead of the bounding box coordinates b ^t The model is made more robust to detection of lost or misassignment, so that it is not necessary to find a nearest neighbor for hidden state update. To obtain final prediction results of group behaviorIn hidden representation h ^t Maximum pooling (Max pool) was performed first, and then a SoftMax classifier was used to obtain predictive tags for population behavior identification +.>

The video behavior recognition method provided by the embodiment of the invention,

not only the consistency of the time sequence information in the group is considered, but also the difference of the individual time sequence information is considered. The single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the distinguishing single person behavior characteristics, improves recognition accuracy, can fuse and spread the single person ROI information in the time domain based on the ROI matching recursion convolution network, and is an effective method for solving the volleyball video behavior recognition problem.

Example 2

An embodiment of the present invention provides a video behavior recognition system, as shown in fig. 7, including:

the feature extraction module 10 is used for carrying out multi-level feature extraction on the video to be identified. This module performs the method described in step S10 in embodiment 1, and will not be described here.

The ROI initial detection module 20 is configured to initially detect the ROI of the target object by using the depth full convolution network. This module performs the method described in step S20 in embodiment 1, and will not be described here.

And an ROI fine tuning module 30, configured to perform fine tuning of the ROI by using a markov random field, so as to obtain an ROI set of the final target object. This module performs the method described in step 30 in embodiment 1, and will not be described here.

A behavior recognition module 40 for simultaneously performing single person behavior recognition and group behavior recognition based on the ROI set of the final target object, respectively; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, performing time sequence modeling of group behaviors by using an ROI matching recursion convolution network to generate a prediction result of the group behaviors; this module performs the method described in step 40 in embodiment 1, and will not be described here.

The video behavior recognition system provided by the embodiment of the invention not only considers the consistency of the time sequence information in the group, but also considers the difference of the individual time sequence information. The single person behavior recognition based on the ROI time sequence reasoning is beneficial to better extracting the distinguishing single person behavior characteristics, improves recognition accuracy, can fuse and spread the single person ROI information in the time domain based on the ROI matching recursion convolution network, and is an effective method for solving the volleyball video behavior recognition problem.

Example 3

Embodiments of the present invention provide a computer device, as shown in fig. 8, which may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or otherwise, fig. 8 being an example of a connection via a bus.

The processor 51 may be a central processing unit (Central Processing Unit, CPU). The processor 51 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 52 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as corresponding program instructions/modules in embodiments of the present invention. The processor 51 executes various functional applications of the processor and data processing by running non-transitory software programs, instructions, and modules stored in the memory 52, that is, implements the video behavior recognition method in the above-described method embodiment 1.

Memory 52 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by the processor 51, etc. In addition, memory 52 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 52 may optionally include memory located remotely from processor 51, which may be connected to processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory 52 that, when executed by the processor 51, perform the video behavior recognition method of embodiment 1.

The details of the above computer device may be correspondingly understood by referring to the corresponding related descriptions and effects in embodiment 1, and will not be repeated here.

It will be appreciated by those skilled in the art that a program implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, and when executed, may include the above-described embodiment method flow. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A method for identifying video behavior, comprising the steps of:

carrying out multistage feature extraction on the video to be identified;

simultaneously performing single person behavior recognition and group behavior recognition respectively based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, performing time sequence modeling of group behaviors by using an ROI matching recursion convolution network to generate a prediction result of the group behaviors;

the process for identifying the single person behavior, carrying out time sequence reasoning on the ROI time sequence of the target object, and obtaining a prediction result of the single person behavior by accessing two full-connection layers and a Softmax layer comprises the following steps:

setting a main areaWherein the ROI area containing the target object to be identified is provided with two secondary areas simultaneouslyAs contextual clues for reasoning about main region +.>Behavior of (2);

for timePerforming behavior recognition on target objects in video image I, < >>ROI comprising the target object as main region +.>Will->The ROI of the target object is included as a secondary region +.>For reasoning->Behavior of frame target object,/->Is the current time +.>Image (S)/(S)>Is->Comprises a main area of the target object, and the action of the target object is->The score of (2) is defined as equation (5):

(5)

wherein ,is from->Middle main area->Extracted feature vector, < > and-> and />Is from->,Middle minor region->Extracted feature vector, < > and->Represents->,/>ROI, & gt of frame target object> and />Respectively represent the current time +.> and />The ROI area of the target object belonging to the action->Weight of->Representing the feature vector +.>Weight-> and />Is obtained by using random gradient descent training;

time pooling by maximaIs->The score of the largest region is selected and added with the score of the main region to obtain the final score, and the final score is converted into posterior probability at a Softmax layerAnd predicting a single action label to obtain a single prediction result.

2. The method for identifying video behavior according to claim 1, wherein the process of performing multi-level feature extraction on the video to be identified comprises:

3. The method of claim 2, wherein the process of initially detecting the ROI of the target object using a depth-full convolution network comprises:

4. A video behavior recognition system, comprising:

the behavior recognition module is used for respectively and simultaneously carrying out single person behavior recognition and group behavior recognition based on the ROI set of the final target object; for single person behavior identification, performing time sequence reasoning on the ROI time sequence of the target object, and obtaining a single person behavior prediction result by accessing two full-connection layers and a Softmax layer; for group behavior identification, performing time sequence modeling of group behaviors by using an ROI matching recursion convolution network to generate a prediction result of the group behaviors;

(5)

time pooling by maximaIs->And adding the largest score of the region selection and the score of the main region to obtain a final score, converting the final score into posterior probability at a Softmax layer, and predicting a single action label to obtain a single prediction result.

5. A computer-readable storage medium storing computer instructions for causing the computer to perform the video behavior recognition method of any one of claims 1-3.

6. A computer device, comprising: a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory storing computer instructions, the processor executing the computer instructions to perform the video behavior recognition method of any one of claims 1-3.