CN111325140A

CN111325140A - Infrared video sequence behavior identification method and device

Info

Publication number: CN111325140A
Application number: CN202010099461.4A
Authority: CN
Inventors: 丁萌; 吴晓舟; 曹云峰; 杨汝名
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-23

Abstract

The embodiment of the invention discloses an infrared video sequence behavior identification method and device in the technical field of image processing. The invention discloses an infrared video sequence behavior identification method, which comprises the following steps: acquiring spatial motion information of a visible light motion video data set and time motion information corresponding to optical flow characteristics in specified time, and acquiring spatial motion information of an infrared motion video and time motion information corresponding to optical flow characteristics in specified time by migration training; and integrating all the spatial motion information and the time motion information, and classifying the motion of the original infrared motion video clip. The invention can identify the video sequence behaviors under the infrared condition and can obtain better infrared video sequence behavior classification effect.

Description

Infrared video sequence behavior identification method and device

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to an infrared video sequence behavior identification method and device.

Background

Human behavior recognition has wide application in science and technology and life, such as video monitoring, man-machine interaction, virtual reality, video retrieval and the like, so that especially video data is exponentially increased nowadays, and an intelligent human recognition technology has high research value and application prospect.

The infrared video sequence has special application environment and scene, and is suitable for low illumination such as light shortage in evening, night, cloudy day and the like. Due to the limitation of equipment and the difference of imaging principles, the behavior recognition research of infrared video is not sufficient.

Disclosure of Invention

The embodiment of the invention provides an infrared video sequence behavior identification method and device, which are used for solving the technical problems mentioned in the technical background.

The embodiment of the invention provides an infrared video sequence behavior identification method. In one possible solution, the following steps are included:

acquiring spatial motion information of a visible light motion video data set and time motion information corresponding to optical flow characteristics in specified time, and acquiring spatial motion information of an infrared motion video acquired by migration training in the specified time and time motion information corresponding to the optical flow characteristics; and integrating all the space action information and the time action information, and classifying the actions of the original infrared action video clips.

The embodiment of the invention also provides an external video sequence behavior identification method. In a possible scheme, the step of acquiring the time motion information corresponding to the optical flow features of the visible light motion video data within the specified time is as follows:

extracting light stream characteristics between front and back frames of a single video sequence in a visible light data set to capture a visible light stream frame sequence, dividing the visible light stream frame sequence into short-time continuous light stream video frames serving as input ends of parameters for training a three-dimensional convolution neural network, and testing the classification effect of a corresponding network on the visible light motion video light stream sequence.

The embodiment of the invention also provides an external video sequence behavior identification method. In one possible solution, the step of obtaining the spatial motion information of the visible light motion video data within the specified time includes:

and dividing a single video sequence concentrated by visible light into short-time continuous video frames as input ends for training three-dimensional convolutional neural network parameters, and testing the classification effect of the corresponding network on visible light motion videos.

The embodiment of the invention also provides an external video sequence behavior identification method. In a possible scheme, the acquisition of the time action information corresponding to the optical flow features of the infrared motion video within the specified time by the migration training comprises the following steps:

extracting light stream characteristics between front and back frames of a single video sequence in an infrared data set to capture a visible light stream frame sequence, dividing the infrared light stream frame sequence into short-time continuous light stream video frames serving as input ends, performing transfer learning according to visible light network training parameters to obtain infrared three-dimensional convolution neural network parameters, and finally training a svm classifier by using the characteristics learned by the network to test the classification effect on the infrared motion video light stream sequence.

The embodiment of the invention also provides an external video sequence behavior identification method. In one possible solution, the acquisition of the spatial motion information of the infrared motion video in a specified time by the migration training includes:

dividing a single sequence in an infrared data set into short-time continuous video frames as an input end, performing transfer learning according to visible light network training parameters to obtain infrared three-dimensional convolution neural network parameters, and finally training an svm classifier by using the characteristics learned by the network to test the classification effect of the infrared motion video.

The embodiment of the invention also provides an external video sequence behavior identification method. In one possible solution, the step of classifying the original infrared motion video clip to implement motion comprises:

fusing the characteristics learned by the infrared video serving as an input spatial information network and a time information network; and training an svm classifier through the fused features, and identifying the behavior of the infrared video.

The embodiment of the invention also provides an external video sequence behavior identification device. In a feasible scheme, the method comprises an infrared video acquisition module, a basic network parameter training module, an infrared network parameter migration training module and a feature extraction and fusion module;

the infrared video acquisition module is used for acquiring an infrared video motion sequence;

the basic network parameter training module is used for training initial parameters of a time information network and a space information network by means of a visible light data set;

the network parameter migration training module is used for training the infrared time information and the spatial information network parameters according to the pre-training parameters of the visible light video by the basic network parameter training module;

and the feature extraction and fusion module is used for fusing features extracted by the infrared time and space information network so as to train the svm classifier to realize classification of behaviors in the infrared motion sequence.

The embodiment of the invention also provides an external video sequence behavior identification device. In a feasible scheme, the basic network parameter training module comprises a visible light video data preprocessing submodule, a time information network parameter pre-training submodule and a space information network parameter pre-training submodule;

the visible light video data preprocessing submodule is used for preprocessing the visible light video data according to the characteristics of the three-dimensional convolution network, and is used for selecting the video frame size and the continuous input frame number and obtaining the optical flow so as to obtain the input data suitable for network training;

the time information network parameter pre-training submodule is used for obtaining initial parameters of a time information network by using a three-dimensional convolution neural network trained by inputting visible light video optical stream characteristics;

and the spatial information network parameter pre-training submodule is used for inputting and training a three-dimensional convolutional neural network by using a visible light original video to obtain initial parameters of the spatial information network.

The embodiment of the invention also provides an external video sequence behavior identification device. In a feasible scheme, the infrared network parameter migration training module comprises an infrared video data preprocessing submodule, a time information network parameter pre-training submodule and a spatial information network parameter pre-training submodule;

the infrared video data preprocessing submodule is used for preprocessing infrared video data according to the characteristics of the three-dimensional convolution network, and is used for selecting the video frame size, the continuous input frame number and obtaining the optical flow so as to obtain input data suitable for network training;

the time information network parameter pre-training submodule is used for inputting infrared video optical flow characteristics, training a three-dimensional convolution neural network on the basis of the parameters obtained by the basic network parameter training module, and obtaining the parameters of the time information network;

and the spatial information network parameter pre-training submodule is used for inputting an infrared original video, training a three-dimensional convolutional neural network on the basis of the parameters obtained by the basic network parameter training module, and obtaining the parameters of the spatial information network.

The embodiment of the invention also provides an external video sequence behavior identification device. In one possible scheme, the feature extraction fusion module comprises a feature extraction submodule and an svm classification submodule;

the characteristic extraction submodule is used for extracting the characteristics of the infrared video to be classified, which are obtained by the infrared time information network and the space information network;

and the svm classification submodule is used for training an svm classifier by means of the obtained features to realize the classification of behaviors in the infrared video sequence.

Based on the scheme, the invention trains the network parameters by using the disclosed visible light video data set, acquires the time information of the action by using the optical flow frame as input, acquires the space information of the action by using the original video frame as input, trains the infrared video network on the basis of the network training parameters so as to solve the problems that the infrared video data samples are few, the network training is difficult and is easy to fall into overfitting, also acquires the time information of the infrared video action by using the optical flow frame as input, acquires the space information of the infrared video action by using the original video frame as input, fuses the characteristics obtained by the two networks, trains the svm classifier to realize classification so as to consider the time information and the space information of the action at the same time and improve the classification effect, and in conclusion, the invention can identify the video sequence action under the infrared condition, better behavior classification effect of the infrared video sequence can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without inventive labor.

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a block diagram of an electronic device according to an embodiment of the invention;

FIG. 3 is a flow chart of temporal motion information embodied by optical flow characteristics of a set of visible light motion video data over a period of time in accordance with the present invention;

FIG. 4 is a block diagram of a three-dimensional convolutional neural network of the present invention;

FIG. 5 is a flowchart of acquiring spatial motion information of a visible light motion video data set within a certain time period according to the present invention;

FIG. 6 is a block diagram of another three-dimensional convolutional neural network of the present invention;

FIG. 7 is a flowchart of acquiring time movement information embodied by optical flow characteristics of an infrared motion video within a certain time through migration training according to the present invention;

FIG. 8 is a flowchart of acquiring spatial motion information of an infrared motion video within a certain time period through migration training according to the present invention;

FIG. 9 is a flowchart of the present invention for integrating spatial and temporal motion information to classify the motion of the original infrared motion video clip;

FIG. 10 is a system architecture diagram of the apparatus of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and for simplicity in description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are not to be construed as limiting the invention.

In the present invention, unless otherwise specifically stated or limited, the terms "mounted," "connected," "fixed," and the like are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally formed; the connection can be mechanical connection, electric connection or communication connection; either directly or indirectly through intervening media, either internally or in any other suitable relationship, unless expressly stated otherwise. The specific meaning of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a method for identifying external video sequence behaviors according to a first embodiment of the present invention, and as shown in fig. 1, the method for identifying external video sequence behaviors according to the present embodiment includes the following steps:

Through the above, it is not difficult to find that the network parameters are trained by using the public visible light video data set, the optical flow frame is used as input to obtain the time information of the action, the original video frame is used as input to obtain the space information of the action, the infrared video network is trained on the basis of the network training parameters so as to solve the problems that the infrared video data samples are few, the network training is difficult and is easy to fall into the overfitting, the optical flow frame is also used as input to obtain the time information of the infrared video action, the original video frame is used as input to obtain the space information of the infrared video action, the obtained features of the two networks are fused, and the svm classifier is trained to realize classification so as to simultaneously consider the time information and the space information of the action and improve the classification effect.

Optionally, as shown in fig. 3, in this embodiment, the step of acquiring the time motion information corresponding to the optical stream feature of the visible light motion video data within the specified time is as follows:

extracting light stream characteristics between front and back frames of a single video sequence in a visible light data set to capture a visible light stream frame sequence, dividing the visible light stream frame sequence into short-time continuous light stream video frames serving as input ends of parameters for training a three-dimensional convolution neural network, and testing the classification effect of a corresponding network on the visible light motion video light stream sequence. It should be noted that, in this embodiment, 1, when extracting optical flow features between two frames before and after a single video sequence in a visible light data set to obtain a visible light optical flow frame sequence, optical flow feature results in two horizontal and vertical directions are extracted for a visible light original video, and the moving direction and speed of a moving target pixel are recorded. Synthesizing a new optical flow motion video according to a mode of alternately combining a frame of transverse optical flow and a frame of longitudinal optical flow;

2. when a visible light optical flow frame sequence is divided into short-time continuous optical flow video frames, only short video frames can be used as input due to the forgetting problem due to the convolutional neural network. Selecting a proper input video frame length t, splitting a visible light video into each t frame as a unit, wherein unit labels from the same video are consistent. In this embodiment, t is taken to be 8, and each 8 frames is divided into one input unit. Convolutional neural networks, and in particular fully-connected layers, require input data of uniform size. In the present embodiment, the size of input data is 128 × 171, continuous 8 frames are used as one input unit, and the optical flow stream video is divided into 8 frames;

3. when training the three-dimensional convolutional neural network parameters, the network adopts a three-dimensional convolutional network mode, and a network model structure is designed according to the input data type as shown in fig. 4. All input data are sized 128 x 171, which is approximately half the size of the video frames in the data set used. And randomly cutting out 128 × 171 areas from the original video frame. The network contains 7 convolutional layers, 5 max pooling layers and 2 fully-connected layers. The convolution kernel sizes of the 7 convolution layers are all 3 x 3, and the sliding step sizes are all 1 x 1. The number of cores of the convolutional layer is set to 64, 128, 256, 512 in this order. Each neuron is responsible for sensing information within a corresponding region. The information of each pixel point in the region is weighted and summed, and the bias of the neuron is added, so that the output of the neuron can be obtained:

wherein w_iIs the weight parameter and b is the bias. The model trained under SPORT-1M by Du Tran et al is used as our initial weight in this example. Since SPORT-1M are long-time videos, they choose to randomly draw 5 short videos of 2 seconds from each video and resize them to the appropriate size. Helping to make our network less prone to getting over-fitted during training. The output of the neuron is then processed by an activation function to generate a non-linear mapping:

where j represents a different neuron. In this embodiment, the activation function in the network selects a RELU function:

f(x)＝max(0,x)

the probability of gradient disappearance is reduced, and the calculation amount is reduced.

Specifically, each neuron with the size of n × n traverses the input picture with the size of m × m through a sliding window, and finally an output feature map with the size of (m-n +1) × (m-n +1) is obtained. There are two main reasons for taking the uniform convolution kernel size of 3 × 3, the first is that according to some related studies, the neuron sensing field of the size 3 × 3 is smaller, which is more advantageous in detail perception and has better effect when applied to the classification and identification task; secondly, because the data used for training are limited to 8 frames, a large number of small receptive field neurons are easily distributed in each part of the input cube, so that rich characteristic information is learned.

Because the optical flow video almost covers all pixel points with brightness change in the space, and many noise points generated by camera shake and background interference in the video are reserved, we want to extract more accurate time domain information and improve the significance of features, so all pooling layers in the network adopt maximum pooling. Except that the kernel size of pool1 and pool5 is 1 x 2, where 1 is the kernel depth on the time axis, and it is intended that pooling is not performed in time, the pooling step length is 1 x 2, the kernel size of the remaining pooling layers is 2 x 2, and the step size is 2 x 2, this is to avoid that after the previous pooling, the depth of the data is rapidly reduced to 1 frame, so that the subsequent convolutional layers cannot be convolved on the three-dimensional layer any more. Finally, two fully-connected layers are connected, each having 4096 output cells.

The input to the three-dimensional convolution operation is a data cube of size c l h w, where c is the number of channels, l is the depth (number of frames), and h and w are the height and width of the frame, respectively. The corresponding three-dimensional convolution kernel size is l_n*h_n*w_n，l_nIs the length of the convolution kernel, h_nIs the height of the convolution kernel, w_nIs the convolution kernel width. Similar to the two-dimensional convolution operation, each three-dimensional convolution kernel traverses the whole input cube to obtain a section of convolved feature sequence, and the feature sequence is finally obtained after passing through a three-dimensional convolution layer_n+1)*(w-w_n+1)*(h-h_n+1) × n, where n is the number of three-dimensional convolution kernels.

In this embodiment, an SGD training strategy is used, 30 segments of videos are used as a group, the initial learning rate is 0.001, the learning rate decreases ten times after 10000 iterations, and the training is stopped after 50000 iterations. Wherein, the training strategy adopts an SGD (Stochastic Gradient Descent) random Gradient descent strategy:

α is the learning rate, h is the prediction function, and only one sample j is selected to calculate the gradient and adjust the parameter theta.

The loss function is chosen to be the Softmax loss function, which is a commonly used loss function, actually combining polynomial logics loss and Softmax. Given a training set (x)_i,y_i)，x_iIs the ith training sample, y_iIs the corresponding label. Then use softmax to x_iA classification is made, the prediction belonging to the j-th class being:

herein, the

Is an output after the active layer, so

It can also be written as:

softmax converts the predicted values to non-negative values and normalizes them to obtain class probabilities, which are typically used to compute polynomial logics loss, such as Softmax loss:

a regularization method is adopted in the full-connection layer to prevent overfitting, and a dropout method is adopted in the embodiment. Specifically, the output after dropout is:

y＝r*a(W^Tx)

where x is [ x ]₁,x₂,...,x_n]Is the input of the full connection layer W ∈ R^n×dIs a weight matrix and r is a d-dimensional binary vector whose elements are the bernoulli distribution obeying p. Dropout causes a portion of the neurons to stop by subsequently deleting the activation values of a portion of the neurons. By doing so, the output of the network can be prevented from being excessively dependent on any one or a group of neurons, because the randomness of dropouts can lead each neuron not to be activated in the network of one dropout every time, and weight updates which are dominant by depending on interaction of a part of neurons are eliminated, so that the network can be forced to be more accurate and more generalized in the absence of certain information.

4. Testing the classification effect of the network on the visible light motion video optical flow sequence; and testing the result of the training network by adopting a visible light video optical flow sequence which is different from the training set but distributed as the training set, and adjusting the network to obtain the pre-trained network parameters.

Further, as shown in fig. 5, the step of acquiring the spatial motion information of the visible light motion video data within the specified time includes:

dividing a single video sequence concentrated by visible light into short-time continuous video frames as input ends for training three-dimensional convolutional neural network parameters, and testing the classification effect of a corresponding network on visible light motion videos; in this embodiment, 1, a single video sequence in the visible light data set is divided into short-time consecutive video frames. Specifically, with visible light video as the pre-trained input, we can only use short video frames as the input due to the forgetting problem of convolutional neural networks. Selecting a proper input video frame length t, splitting a visible light video into each t frame as a unit, wherein unit labels from the same video are consistent. In this embodiment, t is taken to be 8, and each 8 frames is divided into one input unit.

2. The short video frame of the previous step is taken as input, and the three-dimensional convolution neural network parameters are trained, as shown in fig. 6. In particular, the network uses a three-dimensional convolutional network that sizes all input data to 128 x 171, which is about half the video frame size in the data set used. And randomly cutting out 128 × 171 areas from the original video frame. The network contains 8 convolutional layers, 6 max pooling layers and 2 full-link layers. The reason different from the optical flow network model structure is that a large amount of spatial information exists in an original video and a large amount of noise is added, so that a convolutional layer containing 1024 neurons is added for learning, noise interference is filtered to obtain higher semantic features, and in order to not be filtered by a pooling layer in a large amount and retain more information to a last convolutional layer, pooling is only carried out on the space of a fifth pooling layer and a sixth pooling layer. The convolution kernel sizes of the 8 convolution layers are all 3 x 3, and the sliding step sizes are all 1 x 1. The number of cores of the convolutional layer is set to 64, 128, 256, 512, 1024 in sequence.

In this embodiment, the activation function in the network selects a RELU function:

f(x)＝max(0,x)

specifically, the unified convolution kernel size is also 3 × 3, a large amount of spatial information exists in an original video, and meanwhile, a large amount of noise is added, so that a deeper network structure is adopted, the last layer of convolution layer is a convolution layer containing 1024 neurons for learning, noise interference is filtered, and higher semantic features are obtained.

In this embodiment, the network also adopts three-dimensional convolution, uses the SGD training strategy, and takes 30 segments of video as a group, the initial learning rate is 0.001, and after 10000 iterations, the learning rate is multiplied by 0.1, and after 50000 iterations, the training is stopped. The loss curve with iteration increases during training is shown in fig. 10.

The loss function still selects the Softmaxloss function, a regularization method is adopted on the full connection layer to prevent overfitting, and a dropout method is still adopted in the embodiment;

3. and testing the classification effect of the network on the visible light motion video. And testing the result of the training network by adopting a visible light video behavior which is different from the training set and is distributed with the training set, and adjusting the network to obtain the pre-training network parameters.

More specifically, as shown in fig. 7, the acquisition of the temporal motion information corresponding to the optical flow features of the infrared motion video in the specified time by the migration training includes the following steps:

extracting optical flow features between two frames before and after a single video sequence in an infrared data set to capture a visible light optical flow frame sequence, dividing the infrared optical flow frame sequence into short-time continuous optical flow video frames as an input end, performing transfer learning according to visible light network training parameters to obtain infrared three-dimensional convolution neural network parameters, and finally testing the classification effect on the infrared motion video optical flow sequence by using a feature training svm classifier learned by a network. Specifically, optical flow characteristic results in the horizontal direction and the longitudinal direction are respectively extracted from the infrared original video, and the moving direction and speed of the moving target pixel are recorded. Then, a new optical flow motion video is synthesized according to a mode that one frame of transverse optical flow and one frame of longitudinal optical flow are alternately combined.

2. The sequence of infrared optical flow frames is divided into short duration continuous optical flow video frames. In particular, we can only use short video frames as input, again due to the forgetting problem of convolutional neural networks. Selecting a proper input video frame length t, splitting a visible light video into each t frame as a unit, wherein unit labels from the same video are consistent. In this embodiment, t is taken to be 8, and each 8 frames is divided into one input unit. Convolutional neural networks, and in particular fully-connected layers, require input data of uniform size. In the present embodiment, the size of input data is 128 × 171, continuous 8 frames are used as one input unit, and the optical flow stream video is divided into 8-frame input units.

3. And taking the short optical flow video frame in the last step as input, carrying out transfer learning on the basis of visible light network training parameters, and training infrared three-dimensional convolution neural network parameters. Specifically, the neural network depends on data, the required training data volume is large, and when the training data size is small, dimensionality disaster is easily caused, so that neurons die, and the neural network falls into overfitting. Based on parameters of visible light video time network pre-training, a transfer learning strategy is adopted to solve the problem that a small-scale data set is easy to be over-fitted when being trained from the beginning. The content similarity and scale difference of the two cross-media data sets conform to the conditions and ideas of transfer learning, so that a strategy of pre-training infrared data set fine tuning by adopting a visible light data set can be adopted. The network structure and the related function are unchanged, on the basis of the visible light optical flow network training, the optical flow characteristic data of the infrared video is used for retraining, 20 sections of videos are taken as a group, the initial learning rate is 0.001, the learning rate is reduced by ten times after 10000 iterations, and the training is stopped after 50000 iterations.

4. And testing the classification effect of the network, training an svm classifier by using the characteristics learned by the network, and testing the classification effect on the infrared motion video optical flow sequence. Specifically, an infrared optical flow video behavior which is different from the training set but distributed in the same way is adopted to test the result of the training network, and the network is adjusted to obtain the pre-trained network parameters. And extracting the characteristics learned by the network to train the svm classifier.

Further, as shown in fig. 8, the acquisition of the spatial motion information of the infrared motion video in the designated time by the migration training includes:

dividing a single sequence in an infrared data set into short-time continuous video frames as an input end, performing transfer learning according to visible light network training parameters to obtain infrared three-dimensional convolution neural network parameters, and finally training an svm classifier by using the characteristics learned by the network to test the classification effect on the infrared motion video; it should be explained that, in the present embodiment, 1, a single video sequence in the infrared data set is divided into short-time consecutive video frames. Specifically, infrared video is used as input, and because of the forgetfulness problem of the convolutional neural network, only short video frames can be used as input. Selecting the length t of a proper input video frame, splitting a visible light video into each t frame as a unit, wherein unit labels from the same video are consistent. In this embodiment, t is taken to be 8, and each 8 frames is divided into one input unit;

2. and taking the short video frame in the last step as input, carrying out transfer learning on the basis of the visible light network training parameters, and training the infrared three-dimensional convolution neural network parameters. Specifically, to avoid falling into overfitting. Still on the basis of parameters of visible light video space network pre-training, a transfer learning strategy is adopted to solve the problem that the small-scale data set is easy to be over-fitted when being trained from the beginning. The network structure and the related function are unchanged, on the basis of the previous visible light network training, the feature data of the infrared video is used for retraining, 20 sections of videos are taken as a group, the initial learning rate is 0.001, 10000 times of iteration are not performed, the learning rate is reduced by ten times, and the training is stopped after 50000 times of iteration. In training, as the number of iterations increases.

3. And testing the classification effect of the network, training an svm classifier by using the characteristics learned by the network, and testing the classification effect of the infrared motion video. Specifically, the result of the training network is tested by adopting an infrared video behavior which is different from the training set but distributed with the training set, and the network is adjusted to obtain the pre-training network parameters. And extracting the characteristics learned by the network to train the svm classifier.

Further, as shown in fig. 9, the step of classifying the original infrared motion video segment to implement motion includes:

fusing the characteristics learned by the infrared video serving as an input spatial information network and a time information network; through the feature training svm classifier after the fusion, the infrared video is subjected to behavior recognition, and it should be explained that, in this embodiment, 1, features learned by a spatial information network and a temporal information network, which input the infrared video, are fused. Specifically, the output characteristics of the last layer of convolution layers of the two networks are directly added to obtain space-time information for representing behaviors in the infrared video sequence;

2. training an svm classifier by the fused features, and performing behavior recognition on the infrared video. Specifically, the features obtained in the last step are used for training an svm classifier, and recognition and classification of behavior categories in the infrared video sequence are realized.

Fig. 10 is an external video sequence behavior recognition apparatus in the second embodiment of the present invention, in the second embodiment, an improvement scheme based on the first embodiment is shown in fig. 10, and the external video sequence behavior recognition apparatus in the second embodiment includes an infrared video acquisition module, a basic network parameter training module, an infrared network parameter migration training module, and a feature extraction and fusion module;

and the feature extraction and fusion module is used for fusing features extracted by the infrared time and space information network so as to train the svm classifier to realize classification of behaviors in the infrared motion sequence. .

Further, the basic network parameter training module comprises a visible light video data preprocessing sub-module, a time information network parameter pre-training sub-module and a spatial information network parameter pre-training sub-module;

Further, the infrared network parameter migration training module comprises an infrared video data preprocessing submodule, a time information network parameter pre-training submodule and a space information network parameter pre-training submodule;

Furthermore, the feature extraction and fusion module comprises a feature extraction submodule and an svm classification submodule;

To sum up, the infrared video sequence behavior recognition method provided by the embodiment of the invention firstly trains the network parameters by the disclosed visible light video data set, inputs the optical flow frames to acquire the time information of the action, inputs the original video frames to acquire the space information of the action, trains the infrared video network on the basis of the network training parameters so as to solve the problems that the infrared video data samples are few, the network training is difficult and easy to get over-fitting, also inputs the optical flow frames to acquire the time information of the infrared video action, inputs the original video frames to acquire the space information of the infrared video action, fuses the characteristics obtained by the two networks, trains the svm classifier to realize classification so as to consider the time information and the space information of the action at the same time and improve the classification effect, and in conclusion, the invention can recognize the video sequence behavior under the infrared condition, better infrared video sequence behavior classification effect can be obtained

In order to implement an infrared video sequence behavior identification method based on three-dimensional convolution and a dual-stream network, some electronic devices are used in cooperation, as shown in fig. 2, including: the device comprises an infrared video acquisition device, a memory, a storage controller, a processor, a peripheral interface, an input/output unit and a display unit.

The memory, the memory controller, the processor, the peripheral interface, the input-output unit, and the display unit are electrically connected to each other directly or indirectly to enable data transmission or interaction. The units may be electrically connected to each other by means of one or more communication buses or signal lines, for example. The means for infrared video acquisition includes a module that can interact with a memory. The processor is adapted to execute an executable module stored in the memory, such as a software function module or a computer program comprised by the behavior recognizing apparatus.

The memory may be, but is not limited to, a random access memory, a read only memory, a programmable read only memory, an erasable read only memory, and an electrically erasable read only memory. The memory is used for storing a program, and the processor executes the program after receiving an execution instruction, and the method executed by the server defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to or implemented by the processor.

The processor may be an integrated circuit chip having signal processing capabilities. The processor can be a general processor, including a central processing unit, a network processor, etc.; but may also be a digital signal processor, an application specific integrated circuit, an off-the-shelf programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The peripheral interface couples various input/output devices to the processor as well as to the memory. In some embodiments, the peripheral interface, the processor, and the memory controller may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.

The input and output unit is used for providing input data for a user to realize the interaction between the user and the electronic equipment. The input/output unit may be, but is not limited to, a mouse, a keyboard, and the like.

The display unit provides an interactive interface (e.g., a user interface) between the electronic device and a user or for displaying image data to a user reference. The display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.

It will be appreciated that the configuration shown in fig. 2 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 2 or may have a different configuration than shown in fig. 2. The components shown in fig. 2 may be implemented in hardware, software, or a combination thereof. In this embodiment, the electronic device may be a computer, a server, or other device with image processing capabilities.

The embodiment of the invention is used for carrying out behavior identification through the infrared video with continuous multi-frame images containing behavior actions or other multi-frame images with chronological order, and the video or other multi-frame images with chronological order are obtained by shooting the target behaviors. The photographing may be performed by an image acquisition device such as a camera or a video camera having an infrared function.

In the present invention, unless otherwise explicitly specified or limited, the first feature "on" or "under" the second feature may be directly contacting the first feature and the second feature or indirectly contacting the first feature and the second feature through an intermediate.

Also, a first feature "on," "above," and "above" a second feature may be directly or obliquely above the second feature, or simply indicate that the first feature is higher in level than the second feature. A first feature "under," "beneath," and "beneath" a second feature may be directly or obliquely under the first feature or may simply mean that the first feature is at a lower level than the second feature.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example" or "some examples," or the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An infrared video sequence behavior identification method is characterized by comprising the following steps:

acquiring spatial motion information of a visible light motion video data set and time motion information corresponding to optical flow characteristics in specified time, and acquiring spatial motion information of an infrared motion video and time motion information corresponding to optical flow characteristics in specified time by migration training; and integrating all the spatial motion information and the time motion information, and classifying the motion of the original infrared motion video clip.

2. The method for recognizing behaviors of infrared video sequences according to claim 1, wherein the step of acquiring the time motion information corresponding to the optical flow features of the visible light motion video data within the specified time is as follows:

extracting optical flow characteristics between front and back frames of a single video sequence in a visible light data set to capture a visible light optical flow frame sequence, dividing the visible light optical flow frame sequence into short-time continuous optical flow video frames serving as input ends of training three-dimensional convolution neural network parameters, and testing the classification effect of a corresponding network on the visible light motion video optical flow sequence.

3. The method for recognizing the behaviors of the infrared video sequence according to claim 1, wherein the step of acquiring the spatial motion information of the visible light motion video data within a specified time comprises the steps of:

and dividing a single video sequence concentrated by visible light into short-time continuous video frames as input ends for training three-dimensional convolutional neural network parameters, and testing the classification effect of the corresponding network on the visible light motion video.

4. The method for recognizing the behaviors of the infrared video sequence according to claim 1, wherein the step of acquiring the time action information corresponding to the optical flow features of the infrared motion video within the specified time by the migration training comprises the following steps:

extracting light stream characteristics between front and back frames of a single video sequence in an infrared data set to capture a visible light stream frame sequence, dividing the infrared light stream frame sequence into short-time continuous light stream video frames as an input end, performing transfer learning according to visible light network training parameters to obtain infrared three-dimensional convolution neural network parameters, and finally training an svm classifier by using the characteristics learned by the network to test the classification effect on the infrared motion video light stream sequence.

5. The method for recognizing the behaviors of the infrared video sequence according to claim 1, wherein the acquisition of the spatial motion information of the infrared motion video by the transfer training within a specified time comprises the following steps:

dividing a single sequence in an infrared data set into short-time continuous video frames as input ends, performing transfer learning according to visible light network training parameters to obtain infrared three-dimensional convolutional neural network parameters, and finally training an svm classifier by using the characteristics learned by the network to test the classification effect on the infrared motion video.

6. The method for behavior recognition of an infrared video sequence according to claim 1, wherein the step of classifying the action implemented on the original infrared action video segment comprises:

fusing the characteristics learned by the infrared video serving as an input spatial information network and a time information network; and training an svm classifier through the fused features, and performing behavior recognition on the infrared video.

7. An infrared video sequence behavior recognition device is characterized by comprising an infrared video acquisition module, a basic network parameter training module, an infrared network parameter migration training module and a feature extraction and fusion module;

the basic network parameter training module is used for training initial parameters of a time information network and a spatial information network by means of a visible light data set;

8. The infrared video sequence behavior recognition device of claim 7, wherein the basic network parameter training module comprises a visible light video data preprocessing sub-module, a temporal information network parameter pre-training sub-module, and a spatial information network parameter pre-training sub-module;

the visible light video data preprocessing submodule is used for preprocessing the visible light video data according to the characteristics of the three-dimensional convolution network, and is used for selecting the video frame size, the continuous input frame number and obtaining the optical flow so as to obtain the input data suitable for network training;

the time information network parameter pre-training submodule is used for inputting and training a three-dimensional convolution neural network by using the characteristics of visible light video optical flow to obtain initial parameters of the time information network;

and the spatial information network parameter pre-training submodule is used for obtaining initial parameters of the spatial information network by using a three-dimensional convolutional neural network trained by inputting a visible light original video.

9. The infrared video sequence behavior recognition device according to claim 7, wherein the infrared network parameter migration training module includes an infrared video data preprocessing sub-module, a temporal information network parameter pre-training sub-module, and a spatial information network parameter pre-training sub-module;

the infrared video data preprocessing submodule is used for preprocessing infrared video data according to the characteristics of the three-dimensional convolution network, and is used for selecting the size of a video frame, the number of continuous input frames and obtaining optical flow so as to obtain input data suitable for network training;

10. The infrared video sequence behavior recognition device of claim 7, wherein the feature extraction fusion module comprises a feature extraction submodule and a svm classification submodule;