CN110427796B

CN110427796B - Method for obtaining dynamic texture description model and video abnormal behavior retrieval method

Info

Publication number: CN110427796B
Application number: CN201910379016.0A
Authority: CN
Inventors: 胡兴; 段倩倩; 黄影平; 张亮; 杨海马
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2023-06-30
Anticipated expiration: 2039-05-08
Also published as: CN110427796A

Abstract

The invention provides a method for obtaining a dynamic texture description model, which comprises the steps of firstly, giving a pixel point in a video, and defining the pixel point as a dynamic texture description model; the dynamic texture description model comprises an orthogonal vector group model in three directions; the orthogonal vector group model comprises a center vector and a plurality of adjacent vectors surrounding the center vector; the intersection point of the three orthogonal vector group models is the pixel; calculating an included angle and a binarization included angle between the center vector and the adjacent vector, and finally solving the model to obtain a mode value of each orthogonal vector group model in the dynamic texture description model; the pattern values obtained in three orthogonal directions are combined into a three-dimensional vector. According to the invention, each video segment or space-time block is extracted into a TOSCLBP histogram feature, and the TOSCLBP histogram feature can reflect the change information of dynamic textures in the video space-time block in time and space, and is robust to noise, illumination change and other interference in the video. The invention also provides a video abnormal behavior retrieval method.

Description

Method for obtaining dynamic texture description model and video abnormal behavior retrieval method

Technical Field

The invention belongs to the field of video signal feature extraction, and particularly relates to a method for acquiring a dynamic texture description model and a video abnormal behavior retrieval method.

Background

Currently, there are many methods for feature extraction in video sequences, mainly including: artificial and learning features: specifically, 1) learning features refers to obtaining features by optimizing a specific objective function through a machine learning algorithm, and typically has deep learning features. In the deep learning, the features obtained by utilizing the convolutional neural network or the neural network such as the deep self-encoder can extract important information in data to obtain concise features, and the deep learning method has strong generalization capability and strong universality and does not depend on priori knowledge. However, the learning features usually depend on a large number of training samples, so that the calculation amount is large, and the method with higher requirements on real-time performance is not facilitated; 2) Artificial features refer to descriptors that are empirically designed to extract certain specific feature information. Such as motion trajectories, spatiotemporal point of interest features, optical flow histograms, spatiotemporal gradient histograms, tri-orthogonal planar local binary patterns, hybrid dynamic textures, and the like. The artificial features can extract specific information such as track, motion features, appearance features, dynamic texture features and the like quickly and efficiently, and the parameter design does not need training; the artificial features benefit from the mastering of priori knowledge by people, complicated training process is not needed, the calculated amount is low, the real-time performance is good, and specific useful information can be extracted; in the process of extracting the artificial features, the data are firstly converted into specific artificial features to be input into a deep neural network, so that a deep learning model can be simplified to a certain extent, important features can be learned easily, and therefore the artificial features still have important roles in video analysis. However, in the existing video artificial feature extraction, common features such as an optical flow histogram, a gradient histogram, a space-time local binary pattern and the like mostly reflect feature statistical information of a local area, but cannot jointly model dynamic texture information in time and space. Local Binary Pattern (LBP) is a powerful texture feature that is widely used in image texture feature extraction. Two-dimensional LBP can extract texture features of two-dimensional graphics, but can only model static textures, and cannot acquire time feature information in video. The improved three-dimensional LBP, while capable of modeling both spatial and temporal dynamic texture information, has the disadvantage of being sensitive to noise.

A Squirrel cage local binary pattern (Squirrel Cage Local Binary Pattern, SCLBP) is proposed in the document Squirrel-cage Local Binary Pattern and Its Application in Video Anomaly Detection. SCLBP is a brand new structural variant of LBP, can effectively encode motion information in video into an SCLBP mode, and has good robustness and discriminant. SCLBP is applied in video anomaly detection, where the performance tested on multiple common datasets approaches that of current advanced methods. However, SCLBP is only focused on motion information in video, and does not relate to appearance information in video.

Disclosure of Invention

The invention aims to provide a method for acquiring a dynamic texture description model and a video abnormal behavior retrieval method, wherein each video segment or space-time block is extracted with a TOSCLBP histogram feature, the TOSCLBP histogram feature can reflect the change information of dynamic textures in the segment or space-time block in time and space, and has the advantages of robust interference on noise, illumination change and the like in video, simple principle, novel design and quick calculation, and is a powerful feature suitable for video analysis.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the invention provides a method for acquiring a dynamic texture description model, which comprises the following steps:

step S1: defining the pixel point as a dynamic texture description model; the dynamic texture description model comprises an orthogonal vector group model in three directions; the intersection point of the orthogonal vector group model is the pixel point; the orthogonal vector set model includes a center vector and a plurality of adjacent vectors circumferentially surrounding the center vector; the adjacent vectors are concentric and parallel to each other;

step S2: calculating an included angle between the center vector and the adjacent vector through an inverse cosine function model;

step S3: binarizing the included angle in the step S2 under the action of a threshold value;

step S4: solving the model to obtain a mode value of each orthogonal vector group model in the dynamic texture description model; the pattern values obtained in the three directions are combined into a three-dimensional vector.

Preferably, in step S2, the inverse cosine function model is:

preferably, in step S3, the binarization model is:

preferably, in step S4, the solution model is:

preferably, in step S2, the inverse cosine function model is:

the invention also provides a video abnormal behavior retrieval method, which is characterized by comprising the following steps of:

step 1: dividing the video into video segments or spatio-temporal blocks;

step 2: solving a mode value of the dynamic texture description model corresponding to a pixel point in the video segment or the space-time block by adopting a method for acquiring the dynamic texture description model;

step 3: repeating the step 2 until the mode values of the dynamic texture description model corresponding to all pixel points in the video sequence or the space-time block are obtained, and solving a histogram formed by all the mode values;

step 4: training a dictionary based on the dynamic texture description model by adopting an unsupervised machine learning method;

step 5: and marking the abnormal video data by adopting an online sparse reconstruction algorithm.

Preferably, in step 3, the method for solving the histogram is as follows:

preferably, in step 1, it is first determined whether or not there is an abnormality in each frame of the two-dimensional image; if the two-dimensional image of part of the frame is abnormal, the position of the abnormal behavior is positioned first, and then the video sequence or the space-time block is divided into local blocks.

Compared with the prior art, the invention has the advantages that: by temporally and spatially describing the behavior in a video sequence, a dynamic texture description model (tosbbp mode) is extracted for each pixel in the video, which can contain important features of varying similarity of three orthogonal vectors intersecting the pixel point in the video sequence and their surrounding vectors. TOSCLBP is robust to noise, illumination change and other interference in a scene, is simple to realize, is fast in calculation, and can describe target behaviors in a video efficiently.

Drawings

FIG. 1 is a flow chart of a method for obtaining a dynamic texture description model according to an embodiment of the present invention;

FIGS. 2 (a) - (d) are schematic diagrams of the TOSCLBP model of FIG. 1 and its SCLBP models in three directions;

FIGS. 3 (b) - (d) are diagrams illustrating the TOSCLBP model of FIG. 2 and its SCLBP model in three directions;

fig. 4 (a) - (b) are schematic diagrams of video anomaly retrieval methods corresponding to global anomaly detection and local anomaly detection, respectively;

fig. 5 (a) to (b) are graphs of detection effects corresponding to global abnormality detection and local abnormality detection.

Detailed Description

The method for obtaining a dynamic texture description model and the video anomaly behavior retrieval method of the present invention will be described in more detail with reference to the accompanying schematic drawings, in which preferred embodiments of the present invention are shown, and it should be understood that those skilled in the art can modify the present invention described herein while still achieving the advantageous effects of the present invention. Accordingly, the following description is to be construed as broadly known to those skilled in the art and not as limiting the invention.

As shown in fig. 1, the present embodiment proposes a method for obtaining a dynamic texture description model, which includes the following steps S1 to S4, specifically as follows:

step S1: defining a pixel point in a given video as a dynamic texture description model; the dynamic texture description model comprises an orthogonal vector group model in three directions; the intersection point of the orthogonal vector group model is the pixel point; the orthogonal vector group model comprises a center vector and a plurality of adjacent vectors which surround the center vector in a circular shape; the adjacent vectors are concentric and parallel to each other; specifically, as shown in fig. 2 (a) to (d) and fig. 3 (b) to (d), in the video three-dimensional space, for each pixel, toskbp describes a dynamic texture based on three orthogonal directions (XYT, XTY, and YTX), respectively. SCLBP is constructed to mimic the cage components in a cage motor. Wherein, the basic structure center vector v of each orthogonal vector group SCLBP model of TOSCLBP _c And P parallel adjacent vectors [ v ] around the periphery thereof ₁ ，v ₂ ，...，v _P ]The method comprises the steps of carrying out a first treatment on the surface of the The SCLBPs in the three directions are orthogonal to each other. Specifically, SCLBP in XYT direction extracts motion information in video; the change information of vertical textures in the video is extracted in the XTY direction; the change information of the horizontal texture in the video is extracted in the YTX direction.

Step S2: in the orthometric vector set model SCLBP in each direction, the center vector v can be calculated by an inverse cosine function due to the angle between the two vectors _c And the surrounding vector v _p An included angle between the two; the inverse cosine function model is as follows:

in this embodiment, in order to reduce the influence of illumination, the cosine similarity may be replaced by pearson correlation, and the center vector v may be calculated by an inverse cosine function _c And the surrounding vector v _p The included angle between the two is further optimized as the inverse cosine function model:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is the mean of the vectors. The pearson correlation is robust to the effects of translation in vectors due to interference such as illumination.

Step S3: binarizing the included angle in the step S2 under the action of a threshold value; i.e. between each adjacent vector and the center vector at a threshold ρ is binarized as:

step S4: solving a model to obtain a mode value of each orthogonal vector group model SCLBP in a dynamic texture description model TOSCLBP, and forming a three-dimensional vector by the mode values obtained in three orthogonal directions; the vector may be used to describe the dynamic texture of the local surround centered around the pixel. Wherein, the solving model is:

wherein P is the number of surrounding neighboring vectors around the center vector;

the method for acquiring the dynamic texture description model can use various video sequence analysis works, such as video individual action recognition, group activity recognition, video abnormal behavior detection and the like.

As shown in fig. 4 (a) - (b), the present embodiment further provides a video abnormal behavior retrieval method, which adopts the method for obtaining the dynamic texture description model, that is, based on toscbp, and includes the following steps 1-5, specifically as follows:

step 1: if the images of a given video sequence are colored, the color images are first converted to grayscale images. Because the abnormal behavior in the video is divided into global abnormal behavior and local abnormal behavior, when global abnormality is detected, the video is divided into segments, whether each video segment contains abnormality is judged, when local abnormality is detected, the video is required to be divided into local space-time blocks, and whether the video segment contains abnormality is judged; the size of each block is set according to the target size in the video. If the difference of the target sizes in the scene is large due to the reasons of a lens and the like, the video scene can be divided into blocks with different scales, and finally the detection results under the blocks with different scales are integrated. After dividing the video frame into local blocks, each local block is flattened into a vector.

Step 2: solving a mode value of a TOSCLBP of the dynamic texture description model corresponding to pixel points (x, y, t) in the video by adopting a method for acquiring the dynamic texture description model; the mode value of the dynamic texture description model TOSCLBP comprises three mode values of each orthogonal vector group model SCLBP; i.e. the effect of this step is to calculate a toskbp model for each pixel;

step 3: repeating the step 2 until the mode values of a dynamic texture description model TOSCLBP corresponding to all pixel points in the video sequence or the space-time block are obtained, and solving a histogram formed by all the mode values; the histogram forms TOSCLBP characteristics of the whole video sequence or the space-time block, and the histogram is the dynamic texture description of the video sequence or the space-time block; the method for solving the histogram comprises the following steps:

step 4: after extracting the toscbp features of the vector, an unsupervised machine learning method using an online dictionary learning algorithm is used to train the dictionary based on the dynamic texture description model.

Step 5: and marking the abnormal video data by adopting an online sparse reconstruction algorithm. At each moment, the sparse reconstruction cost of the current TOSCLBP feature under the current learned dictionary is calculated. And marking video data corresponding to the TOSCLBP characteristics with the sparse reconstruction cost larger than a specific threshold value as abnormal. The former part of data in the video is typically employed for dictionary learning, and the latter part of data is used to detect abnormal behavior while updating the learned dictionary. The abnormality detection effect is shown in fig. 5 (a) to (b).

In summary, in the method for taking the dynamic texture description model and the video abnormal behavior retrieval method provided by the embodiments of the present invention, based on the toscbp, by performing temporal and spatial description on the dynamic texture in the video sequence, a toscbp histogram feature is extracted for each pixel, where the histogram feature can include joint description of appearance or motion features in the video sequence, and is robust to noise, illumination variation, and other interferences in a scene, the implementation is simple, the computation is fast, and the target behavior in the video can be described efficiently. Further, the dynamic texture description model toskbp provided in this embodiment can effectively process multidimensional or multichannel data, and is fast in calculation and stronger in robustness.

The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Any person skilled in the art will make any equivalent substitution or modification to the technical solution and technical content disclosed in the invention without departing from the scope of the technical solution of the invention, and the technical solution of the invention is not departing from the scope of the invention.

Claims

1. A video abnormal behavior retrieval method adopts a method for acquiring a dynamic texture description model and is characterized by comprising the following steps:

step 1: dividing a video sequence into video segments or spatio-temporal blocks;

step 5: marking abnormal video data by adopting an online sparse reconstruction algorithm;

the method for acquiring the dynamic texture description model comprises the following steps:

step S4: solving the model to obtain a mode value of each orthogonal vector group model in the dynamic texture description model; the mode values obtained in the three directions are formed into a three-dimensional vector;

in step S2, the inverse cosine function model is:

in step S3, the binarization model is:

in step S4, the solution model is:

in step S2, the inverse cosine function model is:

in step 3, the histogram solving method is as follows:

2. the video anomaly retrieval method according to claim 1, wherein in step 1, it is first determined whether there is anomaly in each frame of two-dimensional image; if the two-dimensional image of part of the frame is abnormal, the position of the abnormal behavior is positioned first, and then the video sequence or the space-time block is divided into local blocks.