CN116030390A - Intelligent detection method, device, equipment and storage medium for abnormal behavior in video - Google Patents

Intelligent detection method, device, equipment and storage medium for abnormal behavior in video Download PDF

Info

Publication number
CN116030390A
CN116030390A CN202310003213.9A CN202310003213A CN116030390A CN 116030390 A CN116030390 A CN 116030390A CN 202310003213 A CN202310003213 A CN 202310003213A CN 116030390 A CN116030390 A CN 116030390A
Authority
CN
China
Prior art keywords
feature
video
image
abnormal behavior
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310003213.9A
Other languages
Chinese (zh)
Inventor
陈昊天
胡兴
高昊江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northking Information Technology Co ltd
Original Assignee
Northking Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northking Information Technology Co ltd filed Critical Northking Information Technology Co ltd
Priority to CN202310003213.9A priority Critical patent/CN116030390A/en
Publication of CN116030390A publication Critical patent/CN116030390A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses an intelligent detection method, device, equipment and storage medium for abnormal behavior in video, and also relates to a training method for an abnormal behavior detection model, wherein a training image set is obtained by carrying out first preprocessing and image frame interception on a first video; inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: the system comprises a first feature extraction network, a second feature extraction network, a feature fusion network, a memory network and a decoder. Performing feature extraction on the training image through a first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through a second feature extraction network to obtain first object features; the first image features and the first object features are input into a feature fusion network to obtain a first fusion feature map, so that the object features and the image features can be fused, the representation capability of the features is enhanced, and the detection precision is improved.

Description

Intelligent detection method, device, equipment and storage medium for abnormal behavior in video
Technical Field
The present invention relates to the field of computer vision processing technologies, and in particular, to an intelligent detection method, apparatus, device, and storage medium for abnormal behavior in video.
Background
Video monitoring has been widely used in many public places such as military, business, banking and campus as a necessary means for guaranteeing citizen security and preventing crimes. Since abnormal behavior is often associated with suspicious behavior, abnormal behavior detection (Abnormal Action Detection, AAD) is critical to intelligent video surveillance systems.
As a particular class of problems in human motion recognition (Human Action Recognition) in the field of computer vision, abnormal behavior detection presents a number of challenges: 1) The definition of abnormal behavior has ambiguity and is highly dependent on the scene; 2) The occurrence of abnormal behavior is rare, and the method is various and non-enumeratable. The current widely-used abnormal behavior detection method is generally an abnormal behavior detection method for performing unsupervised modeling based on a whole frame of image. However, on one hand, the abnormal behavior patterns of different objects are different, and on the other hand, some abnormal behavior patterns are included in interactions of different objects, and the abnormal behavior detection network adopting the characteristics of the whole frame of image is not high in abnormal behavior detection precision due to the lack of the characteristics of different objects.
Disclosure of Invention
The invention provides an intelligent detection method, device, equipment and storage medium for abnormal behaviors in video, which can be used for realizing fusion of object characteristics and image characteristics, enhancing the representation capability of the characteristics and solving the problem that the detection precision is low because the abnormal behaviors contained in interactions of different objects cannot be detected by using the image characteristics singly.
According to an aspect of the present invention, there is provided a training method of an abnormal behavior detection model, including:
acquiring a first video, performing first preprocessing and image frame interception on the first video to obtain a training image set;
inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: a first feature extraction network and a second feature extraction network, a feature fusion network, a memory network and a decoder;
performing feature extraction on the training image through the first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through the second feature extraction network to obtain a first object feature;
inputting the first image features and the first object features into the feature fusion network to obtain a first fusion feature image output by the feature fusion network, and splitting pixel points of the first fusion feature image to obtain a first query feature set;
inputting a first query feature in the first query feature set into the memory network to obtain a first mode feature corresponding to the first query feature; splicing each first query feature and the corresponding first mode feature to obtain a first memory feature map;
Inputting the first memory feature map into the decoder, and decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to the training image;
and calculating a loss function value based on the first predicted image and the training image, and iteratively adjusting network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model.
According to another aspect of the present invention, there is provided an intelligent detection method for abnormal behavior in video, including:
acquiring a video to be detected, and performing second preprocessing on the video to be detected to obtain an image set to be detected;
inputting the image set to be detected into a target abnormal behavior detection model obtained by training by adopting the training method of the abnormal behavior detection model in the first embodiment or the second embodiment;
and acquiring an abnormal value of the image set to be detected, which is output by the target image semantic segmentation model, and determining an abnormal behavior test result of the test video set based on the abnormal value.
According to another aspect of the present invention, there is provided a training apparatus of an abnormal behavior detection model, comprising:
The training image acquisition module is used for acquiring a first video, performing first preprocessing on the first video and intercepting an image frame to obtain a training image set;
the training image input module is used for inputting the training image into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: a first feature extraction network and a second feature extraction network, a feature fusion network, a memory network and a decoder;
the first feature extraction module is used for carrying out feature extraction on the training image through the first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through the second feature extraction network to obtain a first object feature;
the first feature fusion module is used for inputting the first image features and the first object features into the feature fusion network to obtain a first fusion feature map output by the feature fusion network, and splitting pixel points of the fusion features to obtain a first query feature set;
the first memory query module is used for inputting first query features in the first query feature set into the memory network to obtain first mode features corresponding to the first query features; splicing each first query feature and the corresponding first mode feature to obtain a first memory feature map;
The first decoding module is used for inputting the first memory feature map into the decoder, and decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to the training image;
and the parameter adjustment module is used for calculating a loss function value based on the first predicted image and the training image, and carrying out iterative adjustment on network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model.
According to another aspect of the present invention, there is provided an intelligent detection apparatus for abnormal behavior in video, including:
the to-be-detected video acquisition module is used for acquiring a third video, and performing second preprocessing on the third video to obtain a to-be-detected video;
the video input module to be detected is used for inputting the video to be detected into the target abnormal behavior detection model obtained by training the abnormal behavior detection model by adopting the training method of the first embodiment or the second embodiment;
the result determining module is used for obtaining the abnormal value of the video to be detected, which is output by the target image semantic segmentation model, and determining an abnormal behavior test result of the video to be detected based on the abnormal value.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the training method of the abnormal behavior detection model according to any of the embodiments of the present invention, or the intelligent detection method of abnormal behavior in video.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the training method of the abnormal behavior detection model according to any one of the embodiments of the present invention or the intelligent detection method of abnormal behavior in video when executed.
According to the technical scheme, a first video is obtained, first preprocessing and image frame interception are carried out on the first video, and a training image set is obtained; inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: the system comprises a first feature extraction network, a second feature extraction network, a feature fusion network, a memory network and a decoder. The training image is subjected to feature extraction through a first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through a second feature extraction network to obtain first object features; the first image feature and the first object feature are input into the feature fusion network to obtain a first fusion feature map, the object feature and the image feature can be fused, the representation capability of the feature is enhanced, the problem that abnormal behaviors contained in interactions of different kinds of objects cannot be detected by using the image feature singly, and therefore detection accuracy is low is solved, and detection accuracy is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a training method of an abnormal behavior detection model according to a first embodiment of the present invention;
FIG. 2A is a flowchart of a training method of an abnormal behavior detection model according to a second embodiment of the present invention;
FIG. 2B is a schematic diagram of an abnormal behavior detection model according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a method for intelligently detecting abnormal behavior in video according to a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a training device for an abnormal behavior detection model according to a fourth embodiment of the present invention;
Fig. 5 is a schematic structural diagram of an intelligent detection device for abnormal behavior in video according to a fifth embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device implementing a training method of an abnormal behavior detection model or an intelligent detection method of abnormal behavior in video according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," "third," "initial," and "target" in the description and claims of the present invention and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a training method for an abnormal behavior detection model according to an embodiment of the present invention, where the method may be performed by a training device for an abnormal behavior detection model, the training device for an abnormal behavior detection model may be implemented in hardware and/or software, and the training device for an abnormal behavior detection model may be configured in an electronic device. As shown in fig. 1, the method includes:
s101, acquiring a first video, performing first preprocessing on the first video, and intercepting an image frame to obtain a training image set.
The first video is a video for training an abnormal behavior detection model. The training image set is a set of training images for training the initial abnormal behavior detection model.
Specifically, the obtained first video is subjected to first preprocessing to obtain at least one training video, and each training video is subjected to image frame interception to obtain at least one training image set, wherein each training image set comprises at least one training image.
For example, the first video may be obtained by downloading the monitoring video v as the first video through monitoring system software carried by a camera of an indoor environment (such as an office, an underground parking garage, a hospital, a bank, etc.). It can be appreciated that the monitoring video of other scenes (such as an open parking lot and a campus) can be selected as the first video according to actual requirements.
By way of example, the first preprocessing may include video slicing, denoising, and data cleaning, to which embodiments of the present invention are not limited. In addition, the image frame may be taken in a sampling manner to obtain a training image, and the sampling manner of the image frame according to the embodiment of the present invention is not limited, and may include random length sampling, random interval sampling, or sampling based on a sliding window, for example.
S102, inputting a training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: the system comprises a first feature extraction network, a second feature extraction network, a feature fusion network, a memory network and a decoder.
In this embodiment, the initial abnormal behavior detection model refers to an untrained or untrained complete abnormal behavior detection model. The initial abnormal behavior detection model is used for detecting the behaviors of the person in the training image. The initial abnormal behavior detection model includes: the system comprises a first feature extraction network, a second feature extraction network, a feature fusion network, a memory network and a decoder.
The first feature extraction network is used for extracting image features of the training images based on the whole frame of images; the second feature extraction network is used for extracting object features of the training images based on different objects; the feature fusion network is used for carrying out feature fusion on the image features and the object features. The memory network is used for providing memory capacity to store and record items of normal behavior patterns for matching the characteristics extracted from the training images; the decoder is used for decoding the feature images obtained through the feature extraction network, the feature fusion network and the memory network features.
In addition, in the embodiment of the invention, the training image input into the initial abnormal behavior detection model does not need to be labeled with label data. An initial abnormal behavior detection model is trained by an unsupervised method, additional data annotation information is not needed, the additional data annotation requirement is reduced, and the labor cost is lower; in addition, unstable detection noise is not introduced, and the method can be applied to any monitoring scene, so that quick deployment is realized, and the universality is stronger.
S103, carrying out feature extraction on the training image through a first feature extraction network to obtain first image features; and performing target detection and feature extraction on the training image through a second feature extraction network to obtain the first object feature.
The first image features are features of an image level extracted by the training image through a first feature extraction network, and the first object features are features of an object level extracted by the training image through a second feature extraction network; the first image feature and the first object feature may each be represented in the form of a feature map.
Specifically, training images are respectively input into a first feature extraction network and a second feature extraction network in a feature extraction network; and respectively carrying out feature extraction based on the whole frame of image on the training image through a first feature extraction network to obtain first image features, and carrying out feature extraction based on the object contained in the image on the training image through a second feature extraction network to obtain first object features. It is understood that the feature extraction processes of the first feature extraction network and the second feature extraction network are independent of each other, and may be performed simultaneously, so as to save training time.
In the present embodiment, the first feature extraction network for extracting the first image feature may be an existing feature extraction network, for example, the three-dimensional convolutional neural network 3DCNN may be selected as long as the first image feature having semantic information can be extracted, which is not limited in the embodiment of the present invention. Similarly, the feature extraction module in the second feature extraction network may also use an existing feature extraction network, and may be the same as or different from the first feature extraction network. In addition, since the second feature extraction network needs to extract features of objects in the training image, the feature module is processed, and at least a target detector needs to be included to detect objects in the training image.
S104, inputting the first image features and the first object features into a feature fusion network to obtain a first fusion feature map output by the feature fusion network, and splitting pixels of the first fusion feature map to obtain a first query feature set.
The first fusion feature map is a feature obtained by fusing the first image feature and the first object feature. The first query feature set is a set of features of each pixel in the fused feature as one first query feature.
Specifically, the first image feature and the first object feature are input into a feature fusion network throughThe feature fusion network performs feature fusion on the first image feature and the second object feature to obtain a first fusion feature map F 1 . For characteristic diagram F 1 Splitting each pixel point on the first query characteristic q j First query feature set Q of constitution 1 ={q j J=1,.,; wherein J represents a first fused feature map F 1 The number of pixels in the first query feature set is the total number of first query features in the first query feature set.
S105, inputting first query features in the first query feature set into a memory network to obtain first mode features corresponding to the first query features; and splicing each first query feature and the corresponding first mode feature to obtain a first memory feature map.
The first pattern feature is a pattern feature read from the memory network based on the first query feature, the pattern feature is a feature obtained by weighting and summing the items based on the matching probability between each item in the memory network and the first query feature, and the pattern feature can be used for reflecting the matching degree between the first query feature and the item which records the normal behavior pattern and is stored in the memory network.
Specifically, each first query feature in the first query feature set is sequentially input into the memory network, all items stored in the memory network are read, and a first mode feature corresponding to the first query feature is determined according to the matching probability of the first query feature and each item. And respectively splicing each first query feature in the first query feature set with the corresponding first mode feature read in the memory network to obtain a first memory feature map.
Illustratively, each query feature q j And the corresponding read first pattern feature p j ' splice and use it as pixel point to obtain the first memory characteristic diagram F 1 ′。
S106, inputting the first memory feature map into a decoder, and decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to the training image.
Specifically, the first memory feature map is input into a decoder, and the first memory feature map can be restored into an image through decoding operation of the decoder, so that a first predicted image corresponding to the training image is obtained.
Exemplary, the first memory map F is decoded using a 3 DCNN-based Decoder 1 And carrying out deconvolution operation to obtain a predicted image I' corresponding to the training image I.
And S107, calculating a loss function value based on the first predicted image and the training image, and iteratively adjusting network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model.
Wherein the loss function value is used to measure a difference in features between the predicted image and the training image, and may include a reconstructed predicted loss value, a feature compactness loss value, and a feature separation loss value.
Specifically, a reconstructed predicted loss value, a feature compactness loss value and a feature separation loss value are respectively determined according to the first predicted image and the training image, and are summed to obtain a loss function value, and network parameters in the initial abnormal behavior detection model are adjusted based on the loss function value.
According to the technical scheme, a first video is obtained, first preprocessing and image frame interception are carried out on the first video, and a training image set is obtained; inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: a first feature extraction network and a second feature extraction network, a feature fusion network, a memory network and a decoder; performing feature extraction on the training image through a first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through a second feature extraction network to obtain first object features; the first image features and the first object features are input into the feature fusion network to obtain a first fusion feature map output by the feature fusion network, the object features and the image features can be fused, the representation capability of the features is enhanced, the problem that abnormal behaviors contained in interactions of different kinds of objects cannot be detected by using the image features singly, the detection precision is low, and the detection precision is improved is solved.
Example two
Fig. 2A is a flowchart of a training method of an abnormal behavior detection model according to a second embodiment of the present invention, where the steps of "performing a first preprocessing on a first video and capturing an image frame to obtain a training image set" in the foregoing embodiment are further refined. As shown in fig. 2A, the method includes:
s201, acquiring a first video, and performing first preprocessing on the first video to obtain a training video set, wherein the first preprocessing comprises: video slicing, data cleansing and random sampling of video segments.
Specifically, for the acquired first video
Figure BDA0004034856420000091
First video +.>
Figure BDA0004034856420000092
Video segmentation is carried out to obtain M video segments
Figure BDA0004034856420000093
For each video segment->
Figure BDA0004034856420000094
Performing data cleansing to form an active video collection
Figure BDA0004034856420000095
For active video set->
Figure BDA0004034856420000096
Is +.>
Figure BDA0004034856420000097
Sampling with random length to obtain training video set +.>
Figure BDA0004034856420000098
Exemplary criteria for data cleansing may be removal of non-office hours, less personnel activityLarge video segments. Other criteria may be applied to data cleaning according to actual requirements, and the embodiment of the present invention is not limited thereto. The way the video segments are randomly sampled may be: for each active video
Figure BDA0004034856420000099
The video segment v with the time interval t and the random sampling duration tau is used as training video, thus forming a training video set +. >
Figure BDA00040348564200000910
Wherein N is train For training data set size, τ is a random number and τ ε [ τ ] minmax ]。
S202, respectively carrying out image frame sliding interception on each training video in the training video set based on the sliding window, and forming a training image set according to each frame image contained in the sliding window after each sliding.
Specifically, for a training video set
Figure BDA0004034856420000104
Each training video v of (a) a ={f i I=1, …, m, where f i An ith frame image, m is the total frame number; based on a sliding window W with the window width of n+1 frames, respectively carrying out image frame sliding interception on each frame image contained in each training video, and forming a training image set according to the n+1 frame images in the sliding window W after each sliding>
Figure BDA0004034856420000102
i=1,…,n+1,n≥1。
Exemplary, the step size of the sliding window is set to 1, if (n+1)<m, sliding window forms (m-n) training image sets
Figure BDA0004034856420000103
If (n+1) is not less than m, multiplexing training video v b Some of which form 1 training image set +.>
Figure BDA0004034856420000101
In addition, in order to ensure that the image sizes input into the initial abnormal behavior detection model are the same, the image sizes in the training image set can be normalized.
S203, inputting a training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: the system comprises a first feature extraction network, a second feature extraction network, a feature fusion network, a memory network and a decoder.
S204, performing feature extraction on the training image through a first feature extraction network to obtain first image features.
Illustratively, in the first feature extraction network, training image I is encoded using a 3 DCNN-based Encoder Encoder i Extracting features to obtain a first image feature F RGB
F RGB =Enc(I i ),i=1,…,n
Wherein F is RGB Representing a first image feature, I i Representing the ith training image in the training image set, the total number of training images in the n training image sets, enc representing the encoder in the first feature extraction network.
Exemplary, in the second feature extraction network, an initial mask matrix is obtained by target detection of training images using a pre-trained object detector
Figure BDA0004034856420000105
The calculated final mask matrix M' i Feature extraction is performed on the fused weighted mask matrix by using another 3 DCNN-based Encoder to obtain a first object feature F OBJ
F obj =Emb_Enc(M′ i ),i=1,…,n
Wherein F is obj Representing a first object feature, M' i An initial mask matrix representing an ith training image based on object detector output
Figure BDA0004034856420000106
The final mask matrix, emb_En, is calculatedc denotes an encoder in the second feature extraction network.
S205, performing target detection and feature extraction on the training image through a second feature extraction network to obtain a first object feature.
Optionally, as shown in fig. 2B, the second feature extraction network includes: a pre-trained target detector, a mask matrix fusion layer and an embedded coding layer;
step S205 includes:
s2051, performing target detection on the first n frames of training images in the training image set through the pre-trained target detector to obtain at least one target object and a mask matrix corresponding to the target object; wherein the training image set includes (n+1) frames of training images.
The pre-trained target detector is used for performing target detection on the training image to detect a target object in the training image, and the target object may include: a person or other object. The embodiment of the invention does not limit the target detector as long as the target object in the training image can be identified.
Specifically, the (n+1) th training image in the training image set is taken as a target training image, a pre-trained target detector is utilized to carry out target detection on the first n training images in the training image set, a plurality of target objects in the training image determined by the target detector are obtained, and a mask matrix corresponding to each target object in the training image is obtained. Each target object on each training image corresponds to a mask matrix
Figure BDA0004034856420000111
k=1..k; i=1,.. the size of the mask matrix is consistent with the training image; mask matrix->
Figure BDA0004034856420000112
The elements of (2) correspond to the spatio-temporal positions of the kth object present on the training image of the ith frame, mask matrix +.>
Figure BDA0004034856420000113
Elements of (2)Corresponds to a spatio-temporal position on the ith training image where the kth target object is not present. />
S2052, calculating the confidence coefficient and the label value of each target object detected on the training image according to the mask matrix corresponding to each target object through the mask matrix fusion layer, and carrying out weighted fusion on the mask matrix of each target object on the training image according to the confidence coefficient and the label value to obtain a fusion weighted mask matrix of the training image.
Specifically, each mask matrix is calculated by a mask matrix fusion layer
Figure BDA0004034856420000114
Confidence of->
Figure BDA0004034856420000115
To represent the likelihood that the kth target object is detected on the ith training image; calculate each mask matrix +.>
Figure BDA0004034856420000116
Tag value of (2)
Figure BDA0004034856420000117
A tag value for indicating that the kth target object is detected on the ith frame of training image; confidence level is to be calculated
Figure BDA0004034856420000118
And tag value->
Figure BDA0004034856420000119
As weights, calculate mask matrix +.>
Figure BDA00040348564200001110
Corresponding weighting mask matrix->
Figure BDA00040348564200001113
The method comprises the following steps:
Figure BDA00040348564200001111
for the training image of the ith frame, the weighted mask matrix corresponding to k target objects is obtained
Figure BDA00040348564200001114
Fusion is carried out to obtain a fusion weighted mask matrix M' i The method comprises the following steps:
Figure BDA00040348564200001112
s2053, performing three-dimensional convolution operation on the fusion weighted mask matrix through the embedded coding layer to obtain a first object feature.
Specifically, the feature extraction is performed on the fused weighted mask matrix by using another 3 DCNN-based Encoder through the embedded coding layer to obtain a first object feature F OBJ
F obj =Emb_Enc(M i ′),i=1,…,n
Wherein F is obj Representing the first object feature, embenc represents the encoder in the embedded coding layer.
S206, inputting the first image features and the first object features into a feature fusion network to obtain a first fusion feature map output by the feature fusion network, and splitting pixels of the first fusion feature map to obtain a first query feature set.
In the feature fusion network, the first image features and the object-level features are fused by using a fusion function to obtain a first fusion feature map, for example:
F=Agg(F RGB ,F OBJ )=F RGB ⊙F OBJ
wherein, as follows, hadamard matrix multiplication is indicated, agg represents a fusion function, F RGB Representing a first image feature, F OBJ Representing a first object feature.
S207, inputting first query features in a first query feature set into a memory network to obtain first mode features corresponding to the first query features; and splicing each first query feature and the corresponding first mode feature to obtain a first memory feature map.
S208, inputting the first memory feature map into a decoder, and decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to a target training image in the training image set; the target training image is the (n+1) th frame training image in the training image set.
Specifically, a first memory feature map formed according to the first n frames of training images in the training image set is input into a decoder, and decoding operation is performed on the first memory feature map through the decoder, so that the (n+1) th frame of training image, namely, a first prediction image corresponding to the target training image, can be obtained through reconstruction.
S209, calculating a loss function value based on the first predicted image and the target training image, and performing iterative adjustment on network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model.
According to the technical scheme, a first video is obtained, first preprocessing and image frame interception are carried out on the first video, and a training image set is obtained; inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: the system comprises a first feature extraction network, a second feature extraction network, a feature fusion network, a memory network and a decoder. The training image is subjected to feature extraction through a first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through a second feature extraction network to obtain first object features; the first image feature and the first object feature are input into the feature fusion network to obtain a first fusion feature map, the object feature and the image feature can be fused, the representation capability of the feature is enhanced, the problem that abnormal behaviors contained in interactions of different kinds of objects cannot be detected by using the image feature singly, and therefore detection accuracy is low is solved, and detection accuracy is improved.
Optionally, inputting the first query feature in the first query feature set into the memory network to obtain a first mode feature corresponding to the first query feature includes:
reading all items stored in the memory network, wherein the items are used for recording normal behavior modes;
determining a two-dimensional correlation matrix according to cosine similarity between each first query feature in the first query feature set and each item;
for each first query feature, determining a first matching probability of the first query feature and each item according to the two-dimensional correlation matrix; and carrying out weighted summation on each item based on the first matching probability to obtain a first mode feature corresponding to the first query feature.
Specifically, as shown in FIG. 2B, the first query feature q is identified by j And a memory module input into the deep learning model to read the items in the memory module or update the items so that they can record the normal behavior pattern. The memory module comprises a plurality of items p for recording normal behavior patterns t T=1, … …, T. In reading the items in the memory module, according to each first query feature q j And each item p in the memory module t The cosine similarity between them determines a two-dimensional correlation matrix, for example:
Figure BDA0004034856420000131
wherein R represents a two-dimensional correlation matrix; r is (r) jt Representing the j-th first query feature q j And item t p in the memory module t Cosine similarity between them. In the two-dimensional correlation matrix, the rows represent the first query features q of the first set of query features j Each item p in the column memory module t
For each first query feature q j Each column element R in the two-dimensional correlation matrix R is calculated in column units jt Obtain a first query feature q j And each item p t Is the first matching probability w of (2) jt I.e. w jt =Softmax t (r jt ),t=1,……T,j=1,……J。
For each first query feature q j Based on the first matching probability w jt For item p t Weighted summation is carried out to obtain a first query feature q j Corresponding first pattern feature p j ' i.e.:
Figure BDA0004034856420000141
optionally, after determining the two-dimensional correlation matrix according to the cosine similarity between each first query feature in the first set of query features and each item, the method further includes:
and updating the items in the memory network according to the two-dimensional correlation matrix.
Specifically, as shown in fig. 2B, the memory module may perform an operation of updating the item after determining the two-dimensional correlation matrix in addition to the operation of reading the item.
Optionally, updating the item in the memory network according to the two-dimensional correlation matrix includes:
determining a second matching probability of each item in the memory network and each first query feature according to the two-dimensional correlation matrix;
determining a set of best matching query features based on the best matching first query features for each item in the memory network; the first query feature which is most matched is the first query feature with the highest probability of being matched with the first item;
for each item in the memory network, determining a normalized second matching probability according to the second matching probability of the first query feature in the set of best matching query features;
for each item in the memory network, carrying out weighted summation on first query features in the best matching query feature set corresponding to the item based on the normalized second matching probability to obtain an updated value;
and for each item in the memory network, updating the item according to the item and the corresponding two-norm normalized value of the updated value.
Specifically, the memory module calculates each row element r in the two-dimensional correlation matrix in row units during the operation of updating the items jt The Softmax function value of (2) to obtain item p t And each first query feature q j The second matching probability u of (2) jt U is namely jt =Softmax j (r jt ) T=1, … … T, j=1, … … J. The second match probability may be considered a forward match probability and the second match probability may be considered a reverse match probability.
For item p in each memory module t Determining all and item p t First query feature q of best match jt I.e. first matching probability w jt The corresponding first query feature is the maximum value, and forms a project p t The best matching query feature set U pt,j . For each item p in the memory network t According to the best matching inquiry feature set U pt,j Second matching probability u of the first query feature in (a) jt Determining a normalized second match probability u' pt,j . Based on normalized second matching probability u' pt,j For item p t Corresponding best matching query feature set U pt,j First query feature q in (1) jt And carrying out weighted summation to obtain an updated value as follows:
Figure BDA0004034856420000151
wherein p is t,update For item p t Corresponding update value, q jt For item p t Corresponding best matching query feature set U Pt,jt First query feature q in (1) j
Based on the two-norm normalization function, calculating the item p t The updated value p of (2) t,update The corresponding two-norm normalization value is:
p t ←L2(p t +p t,update );
and updating the items stored in the memory network according to the two-norm normalized value of the updated value of the item.
In order to verify the performance of the target abnormal behavior detection model obtained through training, a prediction result of the target abnormal behavior detection model can be tested by adopting a test video, and the prediction result is compared with label data of the test video to verify the performance of the target abnormal behavior detection model.
Optionally, after step S211, the method further includes:
s212, acquiring a second video containing known abnormal behaviors, and performing first preprocessing and behavior tag data labeling on the second video to obtain a test video set with behavior tag data; the second video and the first video are acquired by adopting cameras with the same visual angle.
Wherein the second video may be understood as a video for verifying the performance of the target abnormal behavior detection model. The second video is acquired by adopting a camera with the same visual angle as the first video, and in order to verify whether the target abnormal behavior detection model can effectively detect abnormal behaviors, the second video needs to comprise known abnormal behaviors. Abnormal behavior may be suspicious or unusual behavior within the acquired scene.
Specifically, a first preprocessing mode which is the same as that of the first video is adopted to preprocess the test image to obtain a test video set
Figure BDA0004034856420000153
b=1,…,N test Wherein N is test In order to test the number of test videos included in the video set, the first preprocessing method is not described in detail in this embodiment. And labeling behavior label data of each test video in the test video set obtained by preprocessing, and labeling the test video v containing abnormal behaviors b Labeling behavior label data is abnormal, and the rest labeling behavior label data is normal. Forming a test video set with behavior tag data +.>
Figure BDA0004034856420000152
b=1,…,N test Wherein 0 represents "normal", 1 generationThe table "abnormal".
S213, inputting the test videos in the test video set into the target abnormal behavior detection model to obtain an abnormal behavior test result output by the target abnormal behavior detection model.
The abnormal behavior detection result comprises: normal or abnormal.
Specifically, the test videos in the test video set are sequentially input into a trained target abnormal behavior detection model, and the target abnormal behavior detection model is adopted to detect abnormal behaviors of the test video set so as to obtain an abnormal behavior detection result of each test video.
S214, verifying the target abnormal behavior detection model according to the abnormal behavior test result and the abnormal behavior label data.
Specifically, according to the abnormal behavior label data marked by each test video and the detected abnormal behavior test result, the performance of the abnormal behavior detection model can be verified. The performance of the abnormal behavior detection model can be measured by adopting indexes such as accuracy, recall rate and the like. The mode of calculating the model performance indexes such as the accuracy rate, the recall rate and the like according to the abnormal behavior test result and the abnormal behavior label data can be sequentially referred to the existing mode, and the embodiment of the invention is not limited to the existing mode.
Optionally, step S213, inputting the test video of the test video set into the target abnormal behavior detection model, and obtaining the abnormal behavior test result output by the target abnormal behavior detection model includes:
s2131, carrying out feature extraction on a test image contained in the test video through the first feature extraction network to obtain a second image feature; and carrying out target detection and feature extraction on the test image through the second feature extraction network to obtain a second object feature.
The second image features are features of the image level extracted by the test image through the first feature extraction network, and the second object features are features of the object level extracted by the test image through the second feature extraction network; both the second image feature and the second object feature may be represented in the form of feature maps.
Specifically, a test image contained in a test video is respectively input into a first feature extraction network and a second feature extraction network, so that a second image feature output by the first feature extraction network and a second object feature output by the second feature extraction network are obtained. It can be appreciated that the feature extraction process of the test image and the feature extraction process of the training image are the same in principle.
S2132, inputting the second image feature and the second object feature into the feature fusion network to obtain a second fusion feature output by the feature fusion network, and splitting pixels of the second fusion feature to obtain a second query feature set.
The second fusion feature map is a feature obtained by fusing the second image feature and the second object feature. The second query feature set is a set of features of each pixel in the fused feature as a second query feature.
Specifically, the second image feature and the second object feature are input into a feature fusion network, and feature fusion is carried out on the second image feature and the second object feature through the feature fusion network to obtain a second fusion feature map F 2 . For characteristic diagram F 2 Splitting each pixel point on the first query characteristic q to form a second query characteristic q g The second query feature set Q 2 ={q g G=1,. }, G; wherein G represents a second fusion profile F 2 The number of pixels in the second query feature set is the total number of second query features in the second query feature set.
S2133, inputting second query features in the second query feature set into the memory network to obtain second matching items and second mode features matched with the second query features, and splicing the second query features and the corresponding second mode features to obtain a second memory feature map.
Specifically, each second query feature in the second query feature set is sequentially input into the memory network, all items stored in the memory network are read, and the second mode feature corresponding to the second query feature is determined according to the matching probability of the second query feature and each item. And respectively splicing each second query feature in the second query feature set with the corresponding second mode feature read in the memory network to obtain a second memory feature map.
Exemplary, each second query feature q g And the corresponding read second pattern feature p' g Splicing and taking the spliced pixels as pixel points to obtain a second memory characteristic diagram F 2 ′。
S2134, inputting the second memory feature map into the decoder, and performing decoding operation on the second memory feature map by the decoder to obtain a second predicted image corresponding to the test image.
Specifically, the second memory feature map is input into a decoder, and the second memory feature map can be restored into an image through the decoding operation of the decoder, so that a second predicted image corresponding to the test image is obtained.
S2135, determining an outlier according to the two-norm distance between the second query feature and the corresponding best matching item and the normalized peak signal-to-noise ratio between the test image and the second predicted image.
Specifically, each second query feature q is calculated g And the second norm distance D between the items p which are most matched with the memory module; transforming the two-norm distance D to [0,1 ] using a min-max normalization operation]The value D' of the interval. Calculating a peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR) between the test image and the second predicted image and transforming the PSNR value to [0,1 ] using a min-max normalization operation]The value PSNR' of the interval. The outlier S is determined as a weighted average between the value D 'and the value PSNR'.
S2136, determining an abnormal behavior test result of the test video based on the abnormal value.
Specifically, an outlier is set as S, and a threshold value is set as lambda; when S > lambda, judging that abnormal behavior occurs in the test video; and when S is less than or equal to lambda, judging that no abnormal behavior occurs in the test video.
Example III
Fig. 3 is a flowchart of an intelligent detection method for abnormal behavior in a video, which is provided in a third embodiment of the present invention, where the method may be performed by an intelligent detection device for abnormal behavior in a video based on a trained and complete abnormal behavior detection model, the intelligent detection device for abnormal behavior in a video may be implemented in hardware and/or software, and the intelligent detection device for abnormal behavior in a video may be configured in an electronic device. As shown in fig. 3, the method includes:
s301, acquiring a third video, and performing second preprocessing on the third video to obtain a video to be detected.
The third video is a video that needs to perform abnormal behavior detection, and may be a video that is shot in any scene, for example, a monitoring video in an office scene. The second pre-treatment may include: data cleaning, denoising, video slicing and the like.
By way of example, a large-section video with less personnel activity in non-office time is removed through data cleaning, and the video after data cleaning is segmented into a plurality of sections of videos to be detected.
S302, inputting the video to be detected into a target abnormal behavior detection model obtained by training by using the training method of the abnormal behavior detection model.
The target abnormal behavior detection model adopts the training method of the abnormal behavior detection model of any one of the first embodiment to the second embodiment of the invention to train a complete abnormal behavior detection model.
Specifically, an image to be detected in a video to be detected is input into a target abnormal behavior detection model to perform behavior detection, so that an abnormal value is obtained. The target abnormal behavior detection model comprises: the method comprises the steps of performing feature extraction on images to be detected in an image set to be detected through a first feature extraction network to obtain third image features; performing target detection and feature extraction on the image to be detected through a second feature extraction network to obtain a third object feature; inputting the third image feature and the third object feature into a feature fusion network to obtain a third fusion feature output by the feature fusion network, and splitting pixels of the third fusion feature to obtain a third query feature set; inputting third query features in the second query feature set into a memory network to obtain third matching items and third mode features matched by the third query features, and splicing the third query features and the corresponding third mode features to obtain a third memory feature map; inputting the third memory feature map into a decoder, and decoding the third memory feature map through the decoder to obtain a third prediction map corresponding to the image to be detected; and determining an outlier according to the second norm distance between the third query feature and the third matching item and the normalized peak signal-to-noise ratio between the image to be detected and the third predictive graph.
S303, acquiring an abnormal value of the image set to be detected output by the target image semantic segmentation model, and determining an abnormal behavior test result of the video to be detected based on the abnormal value.
Specifically, an outlier is set as S, and a threshold value is set as lambda; when S > lambda, judging that abnormal behavior occurs in the video to be detected; and when S is less than or equal to lambda, judging that no abnormal behavior occurs in the video to be detected.
According to the technical scheme, the video to be detected is obtained, and is subjected to first preprocessing to obtain an image set to be detected; inputting the image set to be detected into a target abnormal behavior detection model obtained by training by adopting the training method of the abnormal behavior detection model; acquiring an abnormal value of an image set to be detected, which is output by a target image semantic segmentation model, and determining an abnormal behavior test result of a test video set based on the abnormal value; by fusing object features and image features, the representation capability of the features is enhanced, the problem that abnormal behaviors contained in interactions of different types of objects cannot be detected by using the image features singly, so that detection accuracy is low is solved, and detection accuracy is improved.
Example IV
Fig. 4 is a schematic structural diagram of a training device for an abnormal behavior detection model according to a fourth embodiment of the present invention. As shown in fig. 4, the apparatus includes:
The training image acquisition module 410 is configured to acquire a first video, perform first preprocessing on the first video, and intercept an image frame to obtain a training image set;
a training image input module 420 for inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: a first feature extraction network and a second feature extraction network, a feature fusion network, a memory network and a decoder;
a feature extraction module 430, configured to perform feature extraction on the training image through the first feature extraction network to obtain a first image feature; performing target detection and feature extraction on the training image through the second feature extraction network to obtain a first object feature;
the feature fusion module 440 is configured to input the first image feature and the first object feature into the feature fusion network to obtain a first fusion feature map output by the feature fusion network, and split pixels of the fusion feature to obtain a first query feature set;
a memory query module 450, configured to input a first query feature in the first query feature set into the memory network to obtain a first mode feature corresponding to the first query feature; splicing each first query feature and the corresponding first mode feature to obtain a first memory feature map;
The decoding module 460 is configured to input the first memory feature map to the decoder, and perform a decoding operation on the first memory feature map by the decoder to obtain a first predicted image corresponding to the training image;
and the parameter adjustment module 470 is configured to calculate a loss function value based on the first predicted image and the training image, and iteratively adjust network parameters in the initial abnormal behavior detection model based on the loss function value, so as to obtain a target abnormal behavior detection model.
Optionally, the second feature extraction network includes: a pre-trained target detector, a mask matrix fusion layer and an embedded coding layer;
performing target detection on the first n frames of training images in the training image set through the pre-trained target detector to obtain at least one target object and a mask matrix corresponding to the target object; wherein the training image set comprises (n+1) frames of training images;
calculating the confidence coefficient and the label value of each target object detected on a training image according to the mask matrix corresponding to each target object through the mask matrix fusion layer, and carrying out weighted fusion on the mask matrix of each target object on the training image according to the confidence coefficient and the label value to obtain a fusion weighted mask matrix of the training image;
And carrying out three-dimensional convolution operation on the fusion weighted mask matrix through the embedded coding layer to obtain a first object feature.
Optionally, the decoding module 460 includes:
the decoding unit is used for decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to a target training image in the training image set; the target training image is an (n+1) -th frame training image in the training image set.
Optionally, the parameter adjustment module 470 includes:
and a loss value calculation unit configured to calculate a loss function value based on the first predicted image and the target training image.
Optionally, the memory query module 450 includes:
the first reading unit is used for reading all the items stored in the memory network, wherein the items are used for recording normal behavior patterns;
the first matrix determining unit is used for determining a two-dimensional correlation matrix according to cosine similarity between each first query feature in the first query feature set and each item;
a first matching probability determining unit, configured to determine, for each of the first query features, a first matching probability between the first query feature and each of the items according to the two-dimensional correlation matrix; and carrying out weighted summation on each item based on the first matching probability to obtain a first mode feature corresponding to the first query feature.
Optionally, the method further comprises:
and the item updating module is used for updating the items in the memory network according to the two-dimensional correlation matrix after determining the two-dimensional correlation matrix according to the cosine similarity between each first query feature in the first query feature set and each item.
Optionally, the item updating module includes:
a second matching probability determining unit, configured to determine a second matching probability of each item in the memory network and each of the first query features according to the two-dimensional correlation matrix;
a set determination unit for determining a set of best matching query features based on the best matching first query features of each item in the memory network; the first query feature which is most matched is the first query feature with the highest probability of being matched with the first item;
a normalization unit, configured to determine, for each item in the memory network, a normalized second matching probability according to a second matching probability of a first query feature in the set of best matching query features;
the updating value determining unit is used for carrying out weighted summation on first query features in the most matched query feature set corresponding to each item based on the normalized second matching probability for each item in the memory network to obtain an updating value;
And the item updating unit is used for updating each item in the memory network according to the item and the corresponding two-norm normalized value of the updated value.
Optionally, the training image acquisition module includes:
the first preprocessing unit is configured to perform first preprocessing on the first video to obtain a training video set, where the first preprocessing includes: video segmentation, data cleaning and random sampling of video segments;
the image capturing unit is used for respectively carrying out image frame sliding capturing on each training video in the training video set based on the sliding window, and forming a training image set according to each frame image contained in the sliding window after each sliding.
Optionally, the apparatus further includes:
the test image acquisition module is used for carrying out iterative adjustment on network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model, then acquiring a second video containing known abnormal behaviors, and carrying out first preprocessing and behavior tag data labeling on the second video to obtain a test video set with behavior tag data; the second video and the first video are acquired by adopting cameras with the same visual angle;
The test image detection module is used for inputting the test videos in the test video set into the target abnormal behavior detection model to obtain an abnormal behavior test result output by the target abnormal behavior detection model;
and the model verification module is used for verifying the target abnormal behavior detection model according to the abnormal behavior test result and the abnormal behavior label data.
Optionally, the test image detection module is specifically configured to:
performing feature extraction on the test image through the first feature extraction network to obtain a second image feature; performing target detection and feature extraction on the test image through the second feature extraction network to obtain a second object feature;
inputting the second image feature and the second object feature into the feature fusion network to obtain a second fusion feature output by the feature fusion network, and splitting pixel points of the second fusion feature to obtain a second query feature set;
inputting second query features in the second query feature set into the memory network to obtain second matching items and second mode features matched with the second query features, and splicing the second query features and the corresponding second mode features to obtain a second memory feature map;
Inputting the second memory feature map into the decoder, and decoding the second memory feature map through the decoder to obtain a second prediction map corresponding to the test image;
determining an outlier according to a two-norm distance between the second query feature and the second matching item and a normalized peak signal-to-noise ratio between the test image and the second predictive graph;
and determining an abnormal behavior test result of the test video set based on the abnormal value.
The training device for the abnormal behavior detection model provided by the embodiment of the invention can execute the training method for the abnormal behavior detection model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example five
Fig. 5 is a schematic structural diagram of an intelligent detection device for abnormal behavior in video according to a fourth embodiment of the present invention. As shown in fig. 5, the apparatus includes:
the to-be-detected video acquisition module 510 is configured to acquire a third video, and perform second preprocessing on the third video to obtain a to-be-detected video;
the video input module to be detected 520 is configured to input the video to be detected into a target abnormal behavior detection model obtained by training the abnormal behavior detection model according to the training method of the first embodiment or the second embodiment;
The result determining module 530 is configured to obtain an outlier of the video to be detected output by the target image semantic segmentation model, and determine an abnormal behavior test result of the video to be detected based on the outlier.
The intelligent detection device for the abnormal behavior in the video provided by the embodiment of the invention can execute the intelligent detection method for the abnormal behavior in the video provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example six
Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, for example, a training method of an abnormal behavior detection model or an intelligent detection method of abnormal behavior in video.
In some embodiments, the training method of the abnormal behavior detection model or the intelligent detection method of abnormal behavior in video may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the training method of the abnormal behavior detection model or the intelligent detection method of abnormal behavior in video described above may be performed. Alternatively, in other embodiments, processor 11 may be configured in any other suitable manner (e.g., by means of firmware) to perform a training method of an abnormal behavior detection model or an intelligent detection method of abnormal behavior in video.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (14)

1. A method of training an abnormal behavior detection model, the method comprising:
acquiring a first video, performing first preprocessing and image frame interception on the first video to obtain a training image set;
inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: a first feature extraction network and a second feature extraction network, a feature fusion network, a memory network and a decoder;
Performing feature extraction on the training image through the first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through the second feature extraction network to obtain a first object feature;
inputting the first image features and the first object features into the feature fusion network to obtain a first fusion feature image output by the feature fusion network, and splitting pixel points of the first fusion feature image to obtain a first query feature set;
inputting a first query feature in the first query feature set into the memory network to obtain a first mode feature corresponding to the first query feature; splicing each first query feature and the corresponding first mode feature to obtain a first memory feature map;
inputting the first memory feature map into the decoder, and decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to the training image;
and calculating a loss function value based on the first predicted image and the training image, and iteratively adjusting network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model.
2. The method of claim 1, wherein the second feature extraction network comprises: a pre-trained target detector, a mask matrix fusion layer and an embedded coding layer;
performing target detection on the first n frames of training images in the training image set through the pre-trained target detector to obtain at least one target object and a mask matrix corresponding to the target object; wherein the training image set comprises (n+1) frames of training images;
calculating the confidence coefficient and the label value of each target object detected on a training image according to the mask matrix corresponding to each target object through the mask matrix fusion layer, and carrying out weighted fusion on the mask matrix of each target object on the training image according to the confidence coefficient and the label value to obtain a fusion weighted mask matrix of the training image;
and carrying out three-dimensional convolution operation on the fusion weighted mask matrix through the embedded coding layer to obtain a first object feature.
3. The method of claim 2, wherein decoding the first memory map by the decoder to obtain a first predicted image corresponding to the training image comprises:
Decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to a target training image in the training image set; the target training image is an (n+1) th frame training image in the training image set;
accordingly, the calculating a loss function value based on the first predicted image and the training image includes:
a loss function value is calculated based on the first predicted image and the target training image.
4. The method of claim 1, wherein inputting a first query feature in the first set of query features into the memory network results in a first pattern feature corresponding to the first query feature comprises:
reading all items stored in the memory network, wherein the items are used for recording normal behavior modes;
determining a two-dimensional correlation matrix according to cosine similarity between each first query feature in the first query feature set and each item;
for each first query feature, determining a first matching probability of the first query feature and each item according to the two-dimensional correlation matrix; and carrying out weighted summation on each item based on the first matching probability to obtain a first mode feature corresponding to the first query feature.
5. The method of claim 4, further comprising, after determining a two-dimensional correlation matrix based on cosine similarities between each first query feature in the first set of query features and each of the terms:
and updating the items in the memory network according to the two-dimensional correlation matrix.
6. The method of claim 5, wherein updating the items in the memory network according to the two-dimensional correlation matrix comprises:
determining a second matching probability of each item in the memory network and each first query feature according to the two-dimensional correlation matrix;
determining a set of best matching query features based on the best matching first query features for each item in the memory network; the first query feature which is most matched is the first query feature with the highest probability of being matched with the first item;
for each item in the memory network, determining a normalized second matching probability according to the second matching probability of the first query feature in the set of best matching query features;
for each item in the memory network, carrying out weighted summation on first query features in the best matching query feature set corresponding to the item based on the normalized second matching probability to obtain an updated value;
And for each item in the memory network, updating the item according to the item and the corresponding two-norm normalized value of the updated value.
7. The method of claim 1, wherein performing a first preprocessing and image frame truncation on the first video results in a training image set, comprising:
performing first preprocessing on the first video to obtain a training video set, wherein the first preprocessing comprises: video segmentation, data cleaning and random sampling of video segments;
and respectively carrying out image frame sliding interception on each training video in the training video set based on a sliding window, and forming a training image set according to each frame image contained in the sliding window after each sliding.
8. The method of claim 1, further comprising, after iteratively adjusting network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model:
acquiring a second video containing known abnormal behaviors, and performing first preprocessing and behavior tag data labeling on the second video to obtain a test video set with behavior tag data; the second video and the first video are acquired by adopting cameras with the same visual angle;
Inputting the test videos in the test video set into the target abnormal behavior detection model to obtain an abnormal behavior test result output by the target abnormal behavior detection model;
and verifying the target abnormal behavior detection model according to the abnormal behavior test result and the abnormal behavior label data.
9. The method of claim 8, wherein inputting test videos in the test video set into the target abnormal behavior detection model to obtain an abnormal behavior test result output by the target abnormal behavior detection model comprises:
performing feature extraction on the test image contained in the test video through the first feature extraction network to obtain a second image feature; performing target detection and feature extraction on the test image through the second feature extraction network to obtain a second object feature;
inputting the second image feature and the second object feature into the feature fusion network to obtain a second fusion feature output by the feature fusion network, and splitting pixel points of the second fusion feature to obtain a second query feature set;
inputting second query features in the second query feature set into the memory network to obtain second matching items and second mode features matched with the second query features, and splicing the second query features and the corresponding second mode features to obtain a second memory feature map;
Inputting the second memory feature map into the decoder, and decoding the second memory feature map through the decoder to obtain a second predicted image corresponding to the test image;
determining an outlier according to a two-norm distance between the second query feature and the corresponding most-matched item and a normalized peak signal-to-noise ratio between the test image and the second predicted image;
and determining an abnormal behavior test result of the test video based on the abnormal value.
10. An intelligent detection method for abnormal behaviors in video, which is characterized by comprising the following steps:
acquiring a third video, and performing second preprocessing on the third video to obtain a video to be detected;
inputting the video to be detected into a target abnormal behavior detection model obtained by training the training method of the abnormal behavior detection model according to any one of claims 1-9;
and acquiring an abnormal value of the video to be detected, which is output by the target image semantic segmentation model, and determining an abnormal behavior test result of the video to be detected based on the abnormal value.
11. A training device for an abnormal behavior detection model, comprising:
The training image acquisition module is used for acquiring a first video, performing first preprocessing on the first video and intercepting an image frame to obtain a training image set;
the training image input module is used for inputting the training image set into an initial abnormal behavior detection model; wherein the initial abnormal behavior detection model comprises: a first feature extraction network and a second feature extraction network, a feature fusion network, a memory network and a decoder;
the first feature extraction module is used for carrying out feature extraction on the training image through the first feature extraction network to obtain first image features; performing target detection and feature extraction on the training image through the second feature extraction network to obtain a first object feature;
the first feature fusion module is used for inputting the first image features and the first object features into the feature fusion network to obtain a first fusion feature map output by the feature fusion network, and splitting pixel points of the fusion features to obtain a first query feature set;
the first memory query module is used for inputting first query features in the first query feature set into the memory network to obtain first mode features corresponding to the first query features; splicing each first query feature and the corresponding first mode feature to obtain a first memory feature map;
The first decoding module is used for inputting the first memory feature map into the decoder, and decoding the first memory feature map through the decoder to obtain a first predicted image corresponding to the training image;
and the parameter adjustment module is used for calculating a loss function value based on the first predicted image and the training image, and carrying out iterative adjustment on network parameters in the initial abnormal behavior detection model based on the loss function value to obtain a target abnormal behavior detection model.
12. An intelligent detection device for abnormal behavior in video, comprising:
the to-be-detected video acquisition module is used for acquiring a third video, and performing second preprocessing on the third video to obtain a to-be-detected video;
the video input module to be detected is used for inputting the video to be detected into a target abnormal behavior detection model obtained by training the training method of the abnormal behavior detection model according to any one of claims 1-9;
the result determining module is used for obtaining the abnormal value of the video to be detected, which is output by the target image semantic segmentation model, and determining an abnormal behavior test result of the video to be detected based on the abnormal value.
13. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the training method of the abnormal behavior detection model of any one of claims 1-9 or to implement the intelligent detection method of abnormal behavior in video as claimed in claim 10.
14. A computer readable storage medium storing computer instructions for causing a processor to implement a training method of an abnormal behavior detection model according to any one of claims 1-9 or an intelligent detection method of abnormal behavior in video according to claim 10 when executed.
CN202310003213.9A 2023-01-03 2023-01-03 Intelligent detection method, device, equipment and storage medium for abnormal behavior in video Pending CN116030390A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310003213.9A CN116030390A (en) 2023-01-03 2023-01-03 Intelligent detection method, device, equipment and storage medium for abnormal behavior in video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310003213.9A CN116030390A (en) 2023-01-03 2023-01-03 Intelligent detection method, device, equipment and storage medium for abnormal behavior in video

Publications (1)

Publication Number Publication Date
CN116030390A true CN116030390A (en) 2023-04-28

Family

ID=86078035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310003213.9A Pending CN116030390A (en) 2023-01-03 2023-01-03 Intelligent detection method, device, equipment and storage medium for abnormal behavior in video

Country Status (1)

Country Link
CN (1) CN116030390A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197737A (en) * 2023-09-08 2023-12-08 数字广东网络建设有限公司 Land use detection method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197737A (en) * 2023-09-08 2023-12-08 数字广东网络建设有限公司 Land use detection method, device, equipment and storage medium
CN117197737B (en) * 2023-09-08 2024-05-28 数字广东网络建设有限公司 Land use detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109344787B (en) Specific target tracking method based on face recognition and pedestrian re-recognition
WO2019119505A1 (en) Face recognition method and device, computer device and storage medium
CN108229456B (en) Target tracking method and device, electronic equipment and computer storage medium
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
CN111738244A (en) Image detection method, image detection device, computer equipment and storage medium
CN112784778B (en) Method, apparatus, device and medium for generating model and identifying age and sex
CN108734106B (en) Rapid riot and terrorist video identification method based on comparison
CN113971751A (en) Training feature extraction model, and method and device for detecting similar images
CN111626956A (en) Image deblurring method and device
CN113313053A (en) Image processing method, apparatus, device, medium, and program product
CN114898416A (en) Face recognition method and device, electronic equipment and readable storage medium
CN116030390A (en) Intelligent detection method, device, equipment and storage medium for abnormal behavior in video
Vijayan et al. A fully residual convolutional neural network for background subtraction
AU2021203821A1 (en) Methods, devices, apparatuses and storage media of detecting correlated objects involved in images
CN114169425B (en) Training target tracking model and target tracking method and device
CN114943937A (en) Pedestrian re-identification method and device, storage medium and electronic equipment
CN114898266A (en) Training method, image processing method, device, electronic device and storage medium
CN114519863A (en) Human body weight recognition method, human body weight recognition apparatus, computer device, and medium
CN114882334B (en) Method for generating pre-training model, model training method and device
CN115393755A (en) Visual target tracking method, device, equipment and storage medium
CN114943995A (en) Training method of face recognition model, face recognition method and device
CN114842411A (en) Group behavior identification method based on complementary space-time information modeling
CN114387496A (en) Target detection method and electronic equipment
CN113190701A (en) Image retrieval method, device, equipment, storage medium and computer program product
CN113221920B (en) Image recognition method, apparatus, device, storage medium, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination