CN111898421A - Regularization method for video behavior recognition - Google Patents

Regularization method for video behavior recognition Download PDF

Info

Publication number
CN111898421A
CN111898421A CN202010560716.2A CN202010560716A CN111898421A CN 111898421 A CN111898421 A CN 111898421A CN 202010560716 A CN202010560716 A CN 202010560716A CN 111898421 A CN111898421 A CN 111898421A
Authority
CN
China
Prior art keywords
feature map
channel
regularization
mask
behavior recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010560716.2A
Other languages
Chinese (zh)
Other versions
CN111898421B (en
Inventor
张宇
米思娅
陈铮杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010560716.2A priority Critical patent/CN111898421B/en
Publication of CN111898421A publication Critical patent/CN111898421A/en
Application granted granted Critical
Publication of CN111898421B publication Critical patent/CN111898421B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Abstract

The invention discloses a regularization method for video behavior recognition, which comprises the steps of firstly utilizing a global average pooling technology to carry out significance evaluation on a feature map at each time step, utilizing a gESD (global static discharge) inspection method to determine the feature map containing the most significant spatial features, then taking a channel as a minimum unit in a selected feature map, taking the channel activation value ratio as a basis to calculate the discarding probability of each channel and execute discarding operation (setting the activation value of the corresponding channel to zero), and finally, because a regularization module takes effect only in a training stage, in order to keep the consistency of the output activation value amplitudes of the training stage and an inference stage, calculating a compensation coefficient for the output of the training stage to be multiplied by the output feature map. The method can effectively improve the verification set precision of the video identification network under the condition of not increasing any extra calculation consumption in the inference stage, and can be added into any existing neural network architecture, thereby effectively relieving the problem that the network overfitts the spatial characteristics in the video identification task and neglects the time sequence characteristics.

Description

Regularization method for video behavior recognition
Technical Field
The application relates to the field of regularization, in particular to a regularization method for video behavior recognition.
Background
Deep neural networks perform well in many complex machine learning tasks. However, since the architecture of the deep neural network requires a large amount and abundance of data and a large number of parameters, in some cases, the deep neural network may over-fit limited data, making the trained network perform poorly on validation set samples outside the training set. The resulting reduction in generalization ability and stability of machine learning algorithms has been a ubiquitous challenge. The problem of overfitting usually occurs during the training of a network with relatively excessive parameters, in which case the trained network always fits the training data well and the loss function value may be very close to 0. However, this results in that it cannot be generalized into new data samples, so that new samples cannot be well predicted. To solve these limitations, many Regularization (Regularization) techniques have been proposed, which can greatly improve the generalization and convergence performance of the model.
The regularization technology is one of important components of machine learning, particularly deep learning, and is often used for avoiding the phenomenon that a network with relatively large parameters generates an overfitting phenomenon on limited data in the training process. Regularization aims to reduce test set errors rather than training set errors, which enhances the generalization of the model by avoiding training the coefficients of perfect-fit data samples. Generally, increasing the number of training samples is an effective means to prevent overfitting. In addition, data enhancement, L1 regularization, L2 regularization, Dropout, DropConnect, and Early stopping (Early stopping) methods, etc. are also commonly used means to prevent overfitting.
However, the existing commonly used regularization technology does not fully utilize the characteristics of the video data for targeted optimization, for example, the video data is distinguished from the time dimension information specific to the image data, so that the regularization effect of the existing regularization technology on a video task is limited. In practical application, tasks for video data exist in a large amount, and the neural network model for the video task has larger parameters so that the model is easier to overfit, so that a regularization method for video behavior recognition is urgently needed.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems in the prior art, realize proper regularization processing on a deep space-time neural network used for a video behavior recognition task and effectively improve the generalization and stability of a model, the invention provides a regularization method for video behavior recognition.
The technical scheme is as follows: a regularization method for video behavior recognition, comprising the steps of:
the method comprises the following steps: after extracting features through a space-time convolution neural network, obtaining a feature map (H multiplied by W is a space size, C is a channel number) with the size of H multiplied by W multiplied by C of the last layer output N time steps, and recording the feature map of the ith time step as viWherein i is 1, …, N;
step two: then, by taking the time step as a unit, obtaining the significance score s of the ith feature map by using a 3D global average pooling technologyiObtained as follows:
Figure BDA0002546195740000021
Figure BDA0002546195740000022
step three: and after the significance scores of the N corresponding characteristic graphs are obtained, outlier detection is carried out by using a gESD detection method. First, the test statistic R is calculated:
Figure BDA0002546195740000023
wherein
Figure BDA0002546195740000024
Is the mean of the N significance scores.
Step four: the threshold λ is then calculated as follows:
Figure BDA0002546195740000025
wherein t isp,N-2Is a 100p quantile from the N-2 degree of freedom t distribution. And p is derived from the significance level α:
Figure BDA0002546195740000026
the test statistic R is then compared to a threshold value λ, and if R > λ then there is a significant spatial feature map in the batch of N feature maps then step five is performed, otherwise there is no feature map then step six.
Step five: after the significant spatial feature map is selected, in order to efficiently discard the significant spatial features, 2D global average pooling is performed by taking the channel as a minimum unit to obtain a significant score of a corresponding channel, and a corresponding discard probability is set for each channel according to the significant score, wherein the discard probability of the c channel at the ith time step is calculated as:
Figure BDA0002546195740000027
wherein, PsalIs a function to ensure that the expected drop probability for all channels of the profile is close to PsalIs preset.
Setting the discarding probability of all channels in the feature map of the residual time step as Prest(value less than P)sal)。
Step six: if no significant spatial feature map exists, the feature map discarding probability of all time steps is set as PrestSo as to ensure that the regularization effect takes effect to a certain extent in the whole training process.
Step seven: and randomly generating a tensor mask with a value range of [0,1] consistent with the time dimension of the input feature map, comparing with the discarding probability of all channels, reserving elements with tensor values larger than the corresponding discarding probability, and otherwise discarding (namely, setting zero). Thereby producing a 0-1 mask of the same size.
Step eight: and calculating a compensation coefficient to multiply the 0-1 mask in the step seven to increase the amplitude of the output activation value in order to keep the consistency of the amplitude of the output activation value of the training stage and the inference stage. For efficiency, a global compensation factor β (β ≧ 1) is calculated as follows:
Figure BDA0002546195740000031
and multiplying the compensation coefficient by a 0-1 mask (to form a 0-beta mask), and multiplying the mask by the upper-layer output characteristic graph element by element to finally obtain the regularized lower-layer input characteristic graph.
Further, the technique of obtaining the significance score with the time step as the minimum unit in the first step and the second step may be 3D global average pooling or other related algorithms.
Further, in step three, performing outlier detection on the N significant scores obtained in step two, and determining whether a significant spatial feature map exists, wherein the outlier detection algorithm may be a gdesd inspection method or other related algorithms.
Further, in step five, in the selected significant spatial feature map, the algorithm for obtaining the significance score with the channel as the minimum unit in units of channels may be 2D global average pooling or other related algorithms.
Further, in step five, the expected drop probability value range of any channel is [0,1], and the overall expectation is close to the set hyper-parameter.
Further, in the sixth step, under the condition that the significant spatial feature map does not exist, the feature maps at all time steps are assigned with the uniform discarding probability so as to ensure that the regularization takes effect in the whole training process.
Further, in step seven, a tensor mask with a value range of [0,1] consistent with the time dimension of the input feature map is generated completely randomly and compared with the discarding probability of all channels, so that a 0-1 mask with the same size is generated.
Further, in step eight, a compensation coefficient is calculated to be multiplied by the output mask, so as to maintain consistency of the amplitude of the output activation value in the training stage and the inference stage.
Has the advantages that:
compared with the prior art, the regularization method for video behavior recognition is more suitable for regularizing video recognition tasks. The regularization method only takes effect in the model training stage, can be easily added into any conventional convolutional network model and only generates little extra computational consumption, and does not take effect in the inference stage, namely does not bring any extra computational consumption of the inference stage. The method can effectively improve the generalization and stability of the space-time neural network model for video identification, and effectively solves the problem of overfitting when the space-time neural network model with huge parameter quantity is applied to a video identification task.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic diagram of a regularization module of the present invention insertable into a residual network location.
Detailed Description
The invention will be described in further detail with reference to the following detailed description and accompanying drawings:
the embodiment provides a regularization method for video behavior recognition, and by the regularization method, any existing neural network framework can be added under the condition of not increasing any model parameters, only a small amount of calculation consumption is introduced in a training stage, and the generalization and stability of a space-time neural network model for video recognition can be effectively improved.
The flow of the method is shown in figure 1:
the method comprises the following steps: after extracting features through a space-time convolution neural network, obtaining a feature map (H multiplied by W is a space size, C is a channel number) with the size of H multiplied by W multiplied by C of the last layer output N time steps, and recording the feature map of the ith time step as viWherein i is 1, …, N;
as shown in fig. 2, which is a schematic diagram of the proposed regularization module that can be inserted into an existing residual network location, since the input and output sizes of the module remain unchanged, it can be theoretically added to any location of the existing network architecture. Table 1 below shows the accuracy performance of each insertion scheme of FIG. 1 on the validation set of Something-Something data sets.
Scheme(s) Verification set accuracy
Base line 47.8%
Scheme (a) 48.2%
Scheme (b) 49.2%
Scheme (c) 49.8%
Scheme (d) 49.0%
TABLE 1
It can be seen from table 1 that the highest accuracy was achieved with protocol (c), but all protocols were more accurate than baseline, so a reasonable insertion protocol further improved the use of the method. The present invention includes, but is not limited to, the 4 insertion schemes of fig. 1, and in theory the module can be inserted anywhere in the existing network architecture.
Step two: then, by taking the time step as a unit, obtaining the significance score s of the ith feature map by using a 3D global average pooling technologyiObtained as follows:
Figure BDA0002546195740000051
Figure BDA0002546195740000052
step three: and after the significance scores of the N corresponding characteristic graphs are obtained, outlier detection is carried out by using a gESD detection method. First, the test statistic R is calculated:
Figure BDA0002546195740000053
wherein
Figure BDA0002546195740000058
Is the mean of the N significance scores.
Step four: the threshold λ is then calculated as follows:
Figure BDA0002546195740000055
wherein t isp,N-2Is a 100p quantile from the N-2 degree of freedom t distribution. And p is derived from the significance level α:
Figure BDA0002546195740000056
the test statistic R is then compared to a threshold value λ, and if R > λ then there is a significant spatial feature map in the batch of N feature maps then step five is performed, otherwise there is no feature map then step six.
Step five: after the significant spatial feature map is selected, in order to efficiently discard the significant spatial features, 2D global average pooling is performed by taking the channel as a minimum unit to obtain a significant score of a corresponding channel, and a corresponding discard probability is set for each channel according to the significant score, wherein the discard probability of the c channel at the ith time step is calculated as:
Figure BDA0002546195740000057
wherein, PsalIs a function to ensure that the expected drop probability for all channels of the profile is close to PsalIs preset.
Setting the discarding probability of all channels in the feature map of the residual time step as Prest(value less thanPsal)。
Step six: if no significant spatial feature map exists, the feature map discarding probability of all time steps is set as PrestSo as to ensure that the regularization effect takes effect to a certain extent in the whole training process.
Step seven: and randomly generating a tensor mask with a value range of [0,1] consistent with the time dimension of the input feature map, comparing with the discarding probability of all channels, reserving elements with tensor values larger than the corresponding discarding probability, and otherwise discarding (namely, setting zero). Thereby producing a 0-1 mask of the same size.
Step eight: and calculating a compensation coefficient to multiply the 0-1 mask in the step seven to increase the amplitude of the output activation value in order to keep the consistency of the amplitude of the output activation value of the training stage and the inference stage. For efficiency, a global compensation factor β (β ≧ 1) is calculated as follows:
Figure BDA0002546195740000061
table 2 below shows the validation set accuracy comparison of the proposed method on the sometalling-sometalling dataset with other regularization methods.
Method of producing a composite material Verification set accuracy
Base line 47.8%
Dropout 48.3%
SpatialDropout 48.6%
StochasticDropPath 48.2%
DropBlock 48.4%
WCD 48.7%
Proposed regularization method 49.8%
TABLE 2
It can be seen from table 2 that the proposed regularization method is superior to other existing regularization methods in performance, and the proposed method designed for video data characteristics has a better regularization effect on video identification tasks.
In this example, the regularization method proposed was added to the conventional spatio-temporal convolutional neural network I3D to perform a behavior recognition study on the public video data set someth-someth. After the characteristics are extracted and classified by the method, behavior recognition performance evaluation is carried out by utilizing the classification accuracy of the verification set. The identification performance is shown in table 1. It can be seen that the space-time convolutional neural network effectively improves the accuracy of the verification set relative to the baseline model after the regularization method is added, especially on the video data set centered on motion rather than on static appearance. This shows that the regularization method provided herein has a better regularization effect on the behavior recognition task of the existing space-time convolutional neural network, and also shows that the method improves the extraction capability of the network on the time dimension characteristics. The method has strong practical application value because the behavior category is identified in real life mainly by dynamic conversion information rather than static appearance information.

Claims (8)

1. A regularization method for video behavior recognition, comprising the steps of:
the method comprises the following steps: after extracting features through a space-time convolution neural network, obtaining a feature map with the size of H multiplied by W multiplied by C of the last layer output N time steps, wherein H multiplied by W is the space size, C is the number of channels, and the feature map of the ith time step is recorded as viWherein i is 1, …, N;
step two: then, by taking the time step as a unit, obtaining the significance score s of the ith feature map by using a 3D global average pooling technologyiObtained as follows:
Figure FDA0002546195730000011
Figure FDA0002546195730000012
step three: and after the significance scores of the N corresponding characteristic graphs are obtained, outlier detection is carried out by using a gESD detection method. First, the test statistic R is calculated:
Figure FDA0002546195730000013
wherein
Figure FDA0002546195730000014
Is the mean of the N significance scores;
step four: the threshold λ is then calculated as follows:
Figure FDA0002546195730000015
wherein t isp,N-2Is a 100p quantile from the N-2 degree of freedom t distribution. And p is derived from the significance level α:
Figure FDA0002546195730000016
then comparing the test statistic R with a critical value lambda, if R > lambda, then a significant spatial feature map exists in the N feature maps of the batch, and then executing a step five, otherwise, the significant spatial feature map does not exist and then executing a step six;
step five: after the significant spatial feature map is selected, in order to efficiently discard the significant spatial features, 2D global average pooling is performed by taking the channel as a minimum unit to obtain a significant score of a corresponding channel, and a corresponding discard probability is set for each channel according to the significant score, wherein the discard probability of the c channel at the ith time step is calculated as:
Figure FDA0002546195730000017
wherein, PsalIs a function to ensure that the expected drop probability for all channels of the profile is close to PsalThe pre-set hyper-parameter of (a),
setting the discarding probability of all channels in the feature map of the residual time step as PrestA value less than Psal
Step six: if no significant spatial feature map exists, the feature map discarding probability of all time steps is set as PrestTo ensure that the regularization effect takes effect to a certain extent in the whole training process;
step seven: randomly generating a tensor mask with a value range of [0,1] consistent with the time dimension of the input characteristic diagram, comparing with the discarding probability of all channels, reserving elements with tensor values larger than the corresponding discarding probability, and otherwise discarding, thereby generating a 0-1 mask with the same size;
step eight: and calculating a compensation coefficient to multiply the 0-1 mask in the step seven to increase the amplitude of the output activation value, and calculating a global compensation coefficient beta (beta is larger than or equal to 1) as follows:
Figure FDA0002546195730000021
and multiplying the compensation coefficient by a 0-1 mask to form a 0-beta mask, and multiplying the mask by the upper-layer output characteristic graph element by element to finally obtain the regularized lower-layer input characteristic graph.
2. The regularization method for video behavior recognition according to claim 1, wherein the technique of obtaining the significance score with the time step as the minimum unit in the first and second steps adopts a 3D global average pooling algorithm.
3. The regularization method for video behavior recognition according to claim 1, wherein in step three, the N significant scores obtained from step two are subjected to outlier detection, and it is determined whether there is a significant spatial feature map, wherein the outlier detection algorithm is a gdesd test method.
4. The regularization method for video behavior recognition according to claim 1, wherein in the fifth step, in the selected significant spatial feature map, an algorithm for obtaining the significance score with the channel as the minimum unit in the unit of the channel is a 2D global average pooling algorithm.
5. The regularization method for video behavior recognition according to claim 1, wherein in step five, the expected drop probability value range of any channel is [0,1], and the overall expectation is close to the set hyper-parameter.
6. The regularization method for video behavior recognition according to claim 1, wherein in step six, under the condition that it is determined that there is no significant spatial feature map, a uniform discarding probability is assigned to all time step feature maps to ensure that regularization takes effect in the whole training process.
7. A regularization method for video behavior recognition as defined in claim 1 wherein, in step seven, a tensor mask of value range [0,1] is generated completely randomly with the time dimension of the input feature map, and compared with the drop probability of all channels to produce a 0-1 mask of the same size.
8. The regularization method for video behavior recognition according to claim 1, wherein in step eight, a compensation coefficient is calculated to be multiplied by the output mask to maintain consistency of the amplitudes of the output activation values in the training phase and the inference phase.
CN202010560716.2A 2020-06-18 2020-06-18 Regularization method for video behavior recognition Active CN111898421B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010560716.2A CN111898421B (en) 2020-06-18 2020-06-18 Regularization method for video behavior recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010560716.2A CN111898421B (en) 2020-06-18 2020-06-18 Regularization method for video behavior recognition

Publications (2)

Publication Number Publication Date
CN111898421A true CN111898421A (en) 2020-11-06
CN111898421B CN111898421B (en) 2022-11-11

Family

ID=73206876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010560716.2A Active CN111898421B (en) 2020-06-18 2020-06-18 Regularization method for video behavior recognition

Country Status (1)

Country Link
CN (1) CN111898421B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871136A (en) * 2017-03-22 2018-04-03 中山大学 The image-recognizing method of convolutional neural networks based on openness random pool
CN107895170A (en) * 2017-10-31 2018-04-10 天津大学 A kind of Dropout regularization methods based on activation value sensitiveness
CN108596258A (en) * 2018-04-27 2018-09-28 南京邮电大学 A kind of image classification method based on convolutional neural networks random pool
CN110163302A (en) * 2019-06-02 2019-08-23 东北石油大学 Indicator card recognition methods based on regularization attention convolutional neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871136A (en) * 2017-03-22 2018-04-03 中山大学 The image-recognizing method of convolutional neural networks based on openness random pool
CN107895170A (en) * 2017-10-31 2018-04-10 天津大学 A kind of Dropout regularization methods based on activation value sensitiveness
CN108596258A (en) * 2018-04-27 2018-09-28 南京邮电大学 A kind of image classification method based on convolutional neural networks random pool
CN110163302A (en) * 2019-06-02 2019-08-23 东北石油大学 Indicator card recognition methods based on regularization attention convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周安众等: "一种卷积神经网络的稀疏性Dropout正则化方法", 《小型微型计算机系统》 *
程俊华等: "基于Dropout的改进卷积神经网络模型平均方法", 《计算机应用》 *

Also Published As

Publication number Publication date
CN111898421B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
Dai et al. Compressing neural networks using the variational information bottleneck
He et al. Soft filter pruning for accelerating deep convolutional neural networks
Liu et al. Wasserstein GAN with quadratic transport cost
Chin et al. Incremental kernel principal component analysis
CN109271958B (en) Face age identification method and device
CN110175168B (en) Time sequence data filling method and system based on generation of countermeasure network
CN109726195B (en) Data enhancement method and device
CN113240111B (en) Pruning method based on discrete cosine transform channel importance score
CN111127387A (en) Method for evaluating quality of non-reference image
CN111291810B (en) Information processing model generation method based on target attribute decoupling and related equipment
CN112257738A (en) Training method and device of machine learning model and classification method and device of image
WO2023116632A1 (en) Video instance segmentation method and apparatus based on spatio-temporal memory information
CN111783997A (en) Data processing method, device and equipment
CN111898421B (en) Regularization method for video behavior recognition
Liu et al. Convergence rates of a partition based Bayesian multivariate density estimation method
CN110941542A (en) Sequence integration high-dimensional data anomaly detection system and method based on elastic network
CN107563287B (en) Face recognition method and device
Wang et al. Computing multiple image reconstructions with a single hypernetwork
CN112738724B (en) Method, device, equipment and medium for accurately identifying regional target crowd
CN112420135A (en) Virtual sample generation method based on sample method and quantile regression
CN113255927A (en) Logistic regression model training method and device, computer equipment and storage medium
Lim et al. Analyzing deep neural networks with noisy labels
CN111882441A (en) User prediction interpretation Treeshap method based on financial product recommendation scene
CN113011163A (en) Compound text multi-classification method and system based on deep learning model
CN110084303B (en) CNN and RF based balance ability feature selection method for old people

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant