CN117528053A

CN117528053A - System and method for closed circuit television anomaly detection and storage medium

Info

Publication number: CN117528053A
Application number: CN202310985949.0A
Authority: CN
Inventors: 瞿树晖; 程其森; 扬妮克·布莱泽纳; 李章焕
Original assignee: Samsung Display Co Ltd
Current assignee: Samsung Display Co Ltd
Priority date: 2022-08-05
Filing date: 2023-08-07
Publication date: 2024-02-06

Abstract

Systems and methods for closed circuit television anomaly detection and storage media are disclosed. According to some embodiments, the system comprises: a memory; an encoder; and a decoder, wherein the system is operable to: receiving an input video at an encoder; dividing, by an encoder, an input video into a plurality of video blocks; selecting, by an encoder, codes corresponding to a plurality of video blocks of an input video from a codebook including the codes; determining, by an encoder, an allocated code matrix comprising codes corresponding to a plurality of video blocks of an input video; receiving, by a decoder, the allocated code matrix from the encoder; and generating, by the decoder, a reconstructed video based on the assigned code matrix.

Description

System and method for closed circuit television anomaly detection and storage medium

Cross Reference to Related Applications

The present application claims priority and benefit from U.S. provisional patent application Ser. No. 63/395,782, filed 8/2022, and U.S. patent application Ser. No. 18/074,195, filed 12/2022, the disclosures of which are incorporated herein by reference in their entireties.

Technical Field

The present application relates generally to detecting anomalies, and more particularly to automatic encoder codebook learning for video-based vector quantization for video anomaly detection.

Background

In recent years, manufacturing plants have typically been monitored with hundreds or thousands of Closed Circuit Televisions (CCTVs) for production and infrastructure safety. Human-based CCTV anomaly detection is extremely tedious and time consuming due to the widespread use of surveillance cameras coupled with the growth of video data. Intelligent CCTV driven by AI technology has been introduced in an attempt to reduce manual monitoring by automatic anomaly detection.

Among other things, a first challenge with AI-driven intelligent CCTV is having extremely unbalanced data samples of few (if any) anomalous videos in the data set. In this way, the ratio between "normal" and "abnormal" samples is extremely unbalanced. Deep learning may require balanced data sets to achieve effective output or performance. Training unbalanced data will likely reduce the performance of deep learning and the prediction results will be biased towards normal samples.

A second challenge that arises with AI-driven intelligent CCTV is the low recognition rate of anomalies within the CCTV frames. Minor anomalies within the entire CCTV frame stream remain a detection problem.

As such, there is a need to achieve higher accuracy CCTV anomaly detection using only normal video data and to achieve higher accuracy anomaly detection in the event of a change in the recognition rate of anomalies.

The above information in the background section is only for enhancement of understanding of the background of the technology and is therefore not to be construed as an admission that the prior art exists or is relevant.

Disclosure of Invention

This summary is provided to introduce a selection of features and concepts of embodiments of the disclosure that are further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. One or more of the described features may be combined with one or more other described features to provide a viable apparatus.

Aspects of example embodiments of the present disclosure relate to automatic encoder codebook learning for vector quantization for display very small defect detection in manufacturing.

(A1) In one or more embodiments, a system for closed circuit television anomaly detection includes: a memory; an encoder connected to the memory; and a decoder connected to the encoder. The system is operable to: receiving an input video at an encoder; dividing, by an encoder, an input video into a plurality of video blocks; selecting, by an encoder, codes corresponding to a plurality of video blocks of an input video from a codebook including the codes; determining, by an encoder, an allocated code matrix comprising codes corresponding to a plurality of video blocks of an input video; receiving, by a decoder, the allocated code matrix from the encoder; and generating, by the decoder, a reconstructed video based on the assigned code matrix.

(A2) The system of (A1), wherein codes corresponding to a plurality of video blocks of the input video are selected from a codebook using a lookup function, wherein the lookup function, the codebook, and an allocated code matrix are stored in a memory.

(A3) The system of (A2), wherein the system is further operable to: extracting, by an encoder, potential features from a plurality of video blocks of an input video; and determining, by the encoder, a latent feature matrix comprising latent features extracted from a plurality of video blocks of the input video.

(A4) The system of (A3), wherein the lookup function is operable to select a closest code as an allocation for each of the potential features in the potential feature matrix to determine an allocated code matrix.

(A5) The system of (A3), wherein the encoder is operable to select codes from the codebook that correspond to video blocks of the plurality of video blocks of the input video by comparing the similarity measure between the potential feature representations of the video blocks in the potential feature matrix and the corresponding codes in the codebook.

(A6) The system of (A5), wherein the similarity metric comprises a euclidean distance or mahalanobis distance between a potential feature representation of the video block in the potential feature matrix and a corresponding code in the codebook.

(A7) The system of (A1), wherein codes among codes in the allocated code matrix are allocated to a video block of a plurality of video blocks of the input video based on vector quantization of potential features corresponding to the video block.

(A8) The system of (A7), wherein the vector quantization loss of the system includes a reconstruction loss that occurs during generation of the reconstructed video and a loss that occurs during vector quantization of potential features corresponding to a plurality of video blocks of the input video.

(A9) The system according to (A8), further comprising: a block-level discriminator network operable to operate as a generate countermeasure network to determine a countermeasure training penalty between the input video and the reconstructed video.

(A10) The system of (A9), wherein the total loss of the system in generating the reconstructed video from the input video includes a vector quantization loss and an countermeasure training loss.

(A11) The system of (A1), further operable to: receiving, by an encoder, a test input video; dividing, by an encoder, a test input video into a plurality of test video blocks; extracting, by an encoder, potential features from a plurality of test video blocks of the test input video; encoding, by an encoder, each of the plurality of test video blocks into a potential feature vector based on the extracted potential features; assigning, by the encoder, a code to each of the plurality of test video blocks to determine an assigned code for the plurality of test video blocks; determining, by the encoder, a set of blocks including the allocated code; determining, by the encoder, an anomaly score for each of the assigned codes for the set of blocks; comparing, by the encoder, the anomaly score for each of the assigned codes of the set of blocks to a threshold; and determining, by the encoder, defects in one or more of the plurality of test video blocks based on the result of the comparing.

(A12) The system of (a 11), wherein a code from the codebook that has the shortest distance to a potential feature vector of a test video block of the plurality of test video blocks among codes in the codebook is assigned to the test video block.

(A13) The system of (a 11), wherein the anomaly score for each of the assigned codes for the set of blocks is determined based on a probability density function.

(B1) A method for closed circuit television anomaly detection comprising: receiving an input video at an encoder; dividing the input video into a plurality of video blocks at an encoder; selecting, by an encoder, codes corresponding to a plurality of video blocks of an input video from a codebook including the codes; determining, by an encoder, an allocated code matrix comprising codes corresponding to a plurality of video blocks of an input video; receiving, by a decoder, the allocated code matrix from the encoder; and generating, by the decoder, a reconstructed video based on the assigned code matrix.

(B2) The method of (B1), wherein a lookup function is used to select codes from the codebook that correspond to a plurality of video blocks of the input video, wherein the method further comprises: extracting, by an encoder, potential features from a plurality of video blocks of an input video; and determining, by the encoder, a latent feature matrix comprising latent features extracted from a plurality of video blocks of the input video.

(B3) The method of (B2), wherein the lookup function is operable to select a closest code as an allocation for each of the potential features in the potential feature matrix to determine an allocated code matrix, and wherein the encoder is operable to select a code from the codebook that corresponds to a video block of the plurality of video blocks of the input video by comparing a similarity measure between the potential feature representation of the video block in the potential feature matrix and the corresponding code in the codebook.

(B4) The method of (B1), wherein codes among codes in the allocated code matrix are allocated to a video block of the plurality of video blocks of the input video based on vector quantization of potential features corresponding to the video block, wherein the vector quantization loss includes a reconstruction loss that occurs during generation of the reconstructed video and a loss that occurs during vector quantization of potential features corresponding to the plurality of video blocks of the input video, and wherein the total loss for generating the reconstructed video from the input video includes the vector quantization loss and the countermeasure training loss.

(B5) The method according to (B1), further comprising: extracting, by an encoder, potential features from a plurality of video blocks of an input video; encoding, by an encoder, each of the plurality of video blocks into a potential feature vector based on the extracted potential features; assigning, by an encoder, a code to each of the plurality of video blocks to determine an assigned code for the plurality of video blocks; determining, by the encoder, a set of blocks including the allocated code; determining, by the encoder, an anomaly score for each of the assigned codes for the set of blocks; comparing, by the encoder, the anomaly score for each of the assigned codes of the set of blocks to a threshold; and determining, by the encoder, a defect in one or more of the plurality of video blocks based on a result of the comparing.

(C1) A non-transitory computer-readable storage medium operable to store instructions that, when executed by a processor included in a computing device, cause the computing device to: receiving an input video at an encoder of a computing device; dividing the input video into a plurality of video blocks at an encoder; selecting, by an encoder, codes corresponding to a plurality of video blocks of an input video from a codebook including the codes; determining, by an encoder, an allocated code matrix comprising codes corresponding to a plurality of video blocks of an input video; receiving, by a decoder of the computing device, the allocated code matrix from the encoder; and generating, by the decoder, a reconstructed video based on the assigned code matrix.

(C2) A non-transitory computer-readable storage medium operable to store instructions that, when executed by a processor included in a computing device, cause the computing device to perform any one of (A2-a 13).

Drawings

These and other features of some example embodiments of the present disclosure will be appreciated and understood with reference to the specification, claims, and drawings, in which:

FIG. 1 illustrates codebook learning for a video-based automatic encoder for block-level vector quantization (VBPVQAE) system, according to some embodiments;

FIG. 2A illustrates a block module of a 3D convolutional encoder skeletal model of the VBPVQAE system of FIG. 1, according to some embodiments;

FIG. 2B illustrates a block module of a 3D convolution discriminator skeletal model of the VBPVQAE system of FIG. 1, according to some embodiments;

FIG. 3 illustrates anomaly detection using block-level codebook learning, according to some embodiments;

FIG. 4 illustrates anomaly detection using learned spatio-temporal dependencies, according to some embodiments; and is also provided with

FIG. 5 illustrates a method for anomaly detection using a VBPVQAE system in accordance with some embodiments.

Aspects, features, and effects of embodiments of the present disclosure are best understood by referring to the following detailed description. Like reference numerals refer to like elements throughout the drawings and the written description unless otherwise specified, and thus, the description thereof will not be repeated. In the drawings, the relative sizes of elements, layers and regions may be exaggerated for clarity.

Detailed Description

The detailed description set forth below in connection with the appended drawings is intended as a description of some example embodiments of systems and methods for CCTV anomaly detection provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. Like element numbers are intended to indicate like elements or features as indicated elsewhere herein.

It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section without departing from the scope of the present disclosure.

Spatially relative terms, such as "under … …," "under … …," "lower," "under … …," "over … …," and "upper" and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that such spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "below," "beneath" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, the example terms "below … …" and "below … …" may encompass both an orientation of above and below. The device may be oriented in other ways (e.g., rotated 90 degrees or in other orientations) and the spatially relative descriptors used herein interpreted accordingly. In addition, it will also be understood that when a layer is referred to as being "between" two layers, it can be the only layer between the two layers, or one or more intervening layers may also be present.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the terms "substantially," "about," and similar terms are used as approximate terms and not as degree terms, and are intended to account for inherent deviations in measured or calculated values that would be recognized by one of ordinary skill in the art.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. A expression such as "at least one of … …" modifies the entire list of elements when following the list of elements without modifying individual elements in the list. Furthermore, the use of "may" when describing embodiments of the present disclosure refers to "one or more embodiments of the present disclosure. Also, the term "exemplary" is intended to refer to an example or illustration. As used herein, the term "use" and variants thereof may be considered synonymous with the term "utilize" and variants thereof, respectively.

It will be understood that when an element or layer is referred to as being "on," "connected to," or "coupled to" another element or layer, it can be directly on, connected to, coupled to, or directly adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being "directly on," directly connected to, "directly coupled to," or "immediately adjacent to" another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges subsumed with the same numerical precision within the recited range. For example, a range of "1.0 to 10.0" is intended to include all subranges such as, for example, between (and including) the recited minimum value of 1.0 and the recited maximum value of 10.0 (i.e., having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0), 2.4 to 7.6. Any maximum numerical limitation recited herein is intended to include all smaller numerical limitations subsumed therein, and any minimum numerical limitation recited in the present specification is intended to include all larger numerical limitations subsumed therein.

In some embodiments, one or more outputs of the different embodiments of the methods and systems of the present disclosure may be sent to an electronic device coupled to or having a display device for displaying the one or more outputs of the different embodiments of the methods and systems of the present disclosure or information regarding the one or more outputs.

Any suitable hardware, firmware (e.g., application specific integrated circuits), software, or a combination of software, firmware, and hardware may be used to implement the electronic or electrical devices and/or any other related devices or components described herein in accordance with embodiments of the present disclosure. For example, the various components of these devices may be formed on one Integrated Circuit (IC) chip or on separate IC chips. Furthermore, the various components of these devices may be implemented on flexible printed circuit films, tape Carrier Packages (TCP), printed Circuit Boards (PCB) formed on a substrate or other suitable architecture. Furthermore, the various components of these devices may be processes or threads executing computer program instructions and interacting with other system components in one or more computing devices on one or more processors to perform the various functions described herein. The computer program instructions are stored in a memory such as Random Access Memory (RAM) that may be implemented in a computing device using standard memory means. The computer program instructions may also be stored in other non-transitory computer readable media such as, for example, a CD-ROM or flash drive. Moreover, those skilled in the art will appreciate that the functionality of various computing devices may be combined or integrated into a single computing device, or that the functionality of a particular computing device may be distributed across one or more other computing devices without departing from the spirit and scope of the exemplary embodiments of the present disclosure.

As previously discussed, there is a need to achieve higher accuracy CCTV anomaly detection using only normal video data. In addition, as previously discussed, there is also a need to achieve higher accuracy of anomaly detection in the event of a change in the identification rate of anomalies. As described herein, this may be achieved by the following steps: (i) Designing and constructing a block-level anomaly detection model (as shown in fig. 1) for learning patterns of a normal scene to improve identification of minute anomalies in local blocks, and an architecture of the learning model (as shown in fig. 2A and 2B); (ii) As shown in fig. 3, a detailed check is performed on each small block and whether there is an abnormality is identified; and (iii) as shown in fig. 4, performing a time dependency analysis to find dependencies between blocks across different time frames.

Table 1 depicts the performance based on existing datasets. "AUROC" is the "region under the operating characteristics of the receiver". The VBPVQAE system has the highest performance due to the new approach presented herein. In particular, the VBPVQAE system is able to identify local anomalies as well as long-duration (e.g., time-dependent) anomalies. However, other methods including OCNN, OCELM, and OCSVM cannot take such abnormal patterns into account. OCELM refers to a single class extreme learning machine, OCNN refers to a single class neural network, and OCSVM refers to a single class support vector machine.

AUROC	OCELM	OCNN	OCSVM	VBPVQAE
					UCF-101	0.947	0.934	0.905	0.983

TABLE 1

It should be noted that "anomaly" and "defect" may be used interchangeably herein. Target anomalies include, but are not limited to, fires, natural disasters, smoke scenes, and persistent anomalies such as waste, dangerous objects in field scenes. Activity anomalies include, but are not limited to, improper behavior of humans. In some embodiments, this can be achieved by designing and constructing a model for achieving high defect detection accuracy using only normal data samples. In other embodiments, this can be achieved by minimizing the impact of unbalanced distribution of normal and defect data on the deep learning model.

One or more embodiments of the present disclosure may provide a solution to challenges in defect detection using machine learning through a video-based block-level vector quantization based automatic encoder (VBPVQAE) system that trains based on video. In one or more embodiments, the VBPVQAE system may mitigate the above-mentioned challenges of defect detection. The VBPVQAE system divides a video frame into a sequence of video blocks across time and learns a codebook from the video blocks. The codes in the codebook represent the characteristic pattern of the video block. In this way, potential features from the video block may be extracted. The codebook is learned by vector quantization techniques. Codes can then be assigned to each learned potential feature from the codes assigned to the video blocks by the distance of the potential feature of the video block from the assigned codes. VBPVQAE systems are capable of achieving the most advanced performance in defect detection.

Following investigation of a single class of methods, in one or more embodiments, the VBPVQAE system of the present disclosure may be trained without abnormal data. In one or more embodiments, the VBPVQAE system may use vector quantization. Vector quantization is a technique that enables learning of expressive representations stored in discrete codebooks. In one or more embodiments, context-rich codebooks may contribute to the most advanced performance in many visual tasks, including video generation and composition. Moreover, the codebook can be easily integrated into the generative model as an end-to-end framework. By introducing this technique into a single class auto-encoder architecture, in one or more embodiments, the VBPVQAE system learns the unique normal representation in an end-to-end fashion. In addition, the high degree of expression of the learned codebook may allow the VBPVQAE system to train against multiple targets that incorporate a vast range of visual patterns. As such, the VBPVQAE system of the present disclosure may eliminate the need for additional tuning and data set specific tradeoffs, and may alleviate the burden of training multiple models.

In one or more embodiments, each codebook entry may be directed to a local visual pattern in the video using a Convolutional Neural Network (CNN) as the encoder. Unlike natural video, video captured in a manufacturing setting typically involves a complex but similar pattern of distribution across video components. Based on this observation, a reservation of global information may be required in codebook learning. In one or more embodiments, in the case of visual transducers and variants, a multi-headed attention (MHA) mechanism may have excellent capabilities in capturing relationships between remote contexts. In one or more embodiments, MHA and CNN may be combined into an encoder skeleton to enhance local and global information expression.

Based on the learned codebook, the single-class detection framework or VBPVQAE system of the present disclosure can calculate an anomaly score for each video block by generating a posterior probability for the retrieved code, and identify a defect by the value of the anomaly score.

In one or more embodiments, the VBPVQAE system of the present disclosure is composed of two parts: (1) A block vector quantization automatic encoder that learns a codebook that incorporates a representation of normal data; and (2) a defect detector that identifies defects based on the learned codebook. The VBPVQAE system of the present disclosure quantizes potential features into the index of the codebook. For example, the VBPVQAE system of the present disclosure may train an encoder network for building a discrete embedded table and reconstruct the decoder network of each video block with the indices in the embedded table (see, e.g., fig. 1). Both the encoder and decoder may be Convolutional Neural Networks (CNNs) and multi-headed attention blocks (e.g., transformers). At test time, the encoder-decoder pair may generate an index matrix for all video blocks in the test video using the learned codebook. By looking at the joint probabilities of the codes corresponding to the indices, the defect detector calculates a score matrix that can be used for block level defect identification.

As such, one or more embodiments of the present disclosure provide a VBPVQAE system that addresses single class defect detection problems in manufacturing. The VBPVQAE system of the present disclosure can be strictly trained based on normal data and in the event of abnormal recognition rate changes, thus greatly reducing the need for defective samples in training. The principles of the VBPVQAE system are based on learning and matching the normal representation in the learned codebook. By means of force vector quantization technology, end-to-end codebook learning can be achieved. The VBPVQAE system of the present disclosure may eliminate complex tuning or trade-offs specific to different data sets. In one or more embodiments, representation learning of global context may also be enhanced by combining a CNN encoder and a multi-headed self-attention mechanism. This approach may lead to successful encoding of a wide range of visual patterns. In one or more embodiments, the high degree of representation of the codebook may facilitate representation learning for different objectives using a single model. Additionally, in one or more embodiments, the context-rich codebook may be further beneficial for other downstream tasks such as video reconstruction or composition of normal video.

FIG. 1 illustrates codebook learning for the VBPVQAE system of the present disclosure.

The VBPVQAE system 100 quantizes the potential features into codebook entries (i.e., potential feature matrices) using a pair of encoder (E) network and decoder (G) network. For example, the VBPVQAE system 100 uses vector quantization to learn a codebook for the entire data set, so the data set can be represented by the codebook. In one or more embodiments, the code represents a characteristic of a video block of the display panel.

For example, during training, encoder (E) 120 reads in video (e.g., video (x) 110 (also referred to as normal video (x), input video (x), original video (x), or original input video (x))) and extracts and stores visual patterns into codebook (Q) 140 (also referred to as potential embedded codebook (Q)), while decoder (G) 160 aims to select an appropriate code index for each video block and attempts to reconstruct the input video based on the selected code.

In one or more embodiments, the potential embedded codebook (Q) 140 can be represented as Q ε R (KXnz), where K is the number of discrete codes in the codebook and nz is the potential embedded vector Q _k Dimension of e R (nz), K e 1,2, …, K.

For example, during training of the VBPVQAE system 100, as shown in FIG. 1, the encoder (E) 120 receives a video (x) 110, where x ε R (H W T3) and outputs a latent feature matrix (z) 130. Here, H is the height of the video (x) 110, W is the width of the video (x) 110, T is the time of the video (x) 110, and the numeral "3" represents the number of the channel (e.g., red, green, blue) of the video (x) 110.

For example, in the method of fig. 1, when the input video (x) 110 is received at the encoder (E) 120, the input video (x) 110 may be divided into a plurality of video blocks. In one or more embodiments, potential features from each of the plurality of video blocks may be extracted to determine a potential feature matrix (z) 130, where z E R (h x w x nz) and z=e (x). Here, h is the height of the latent feature matrix (z) 130, w is the width of the latent feature matrix (z) 130, and nz represents the hidden vector size of the latent feature matrix (z) 130.

For example, after the potential feature matrix (z) 130 is determined by the encoder (E) 120, the codebook (Q) 140 may be learned using vector quantization to assign codes from the codebook (Q) 140 to each of a plurality of potential features of the potential feature matrix (z) 130. Here, each of the potential features in the potential feature matrix (z) 130 corresponds to each of the plurality of video blocks from the input video (x) 110. In some embodiments, the video block is a sub-tensor of the video tensor. For example, the video tensor matrix may be defined as V.epsilon.R ^H,W,T,C Where H is height, W is width, T is time, and C is channel. Video block v _{h：h+δh，w：w+δw，t：t+δt，c} Is part of the video tensor matrix V, where δh, δw, δt refer to the corresponding step size.

In one or more embodiments, the VBPVQAE system 100 may further comprise a memory and a processor, and the latent feature matrix (z) 130 may be determined by the processor based on the latent features extracted from the plurality of video blocks, and in one or more embodiments, the latent feature matrix (z) 130 may be stored in the memory of the VBPVQAE system 100. The encoder (E) 120 may be connected to a memory. The decoder (G) 160 may be connected to the encoder (E) 120.

For example, after the potential feature matrix (z) 130 is determined by the encoder (E) 120, in forward pass, the code for each video block is selected (or assigned) from the codebook (Q) 140 using the lookup function Q (-) of the encoder (E) 120 or processor of the VBPVQAE system 100. In one or more embodiments, the lookup function Q () and the codebook (Q) 140 can be stored in the memory of the VBPVQAE system 100. By comparing the potential feature representations of the input video blocks in the potential feature matrix (z) 130 to similarity metrics (e.g., euclidean distance, mahalanobis distance, etc.) between each code in the codebook (Q) 140, a look-up function Q (), selects the nearest code as an assignment for each potential feature in the potential feature matrix (z) 130 to generate an assigned code matrix (Q) 150. For example, in one or more embodiments, the find function Q (), by comparing the euclidean distance between the potential feature representation of the input video block in the potential feature matrix (z) 130 and each code in the codebook (Q) 140, selects the nearest code as the allocation for each potential feature in the potential feature matrix (z) 130. In one or more embodiments, the assigned code matrix (q) 150 can be generated by the encoder (E) 120 or processor of the VBPVQAE system 100. The assigned code matrix (q) 150 may be stored in the memory of the VBPVQAE system 100.

For example, the allocation in the allocated code matrix (q) 150 may be expressed as:

for example, the number of the cells to be processed,can represent the assigned code momentElements of array (q) 150.

For example, in one or more embodiments, a code may be assigned to each input video block based on vector quantization of potential features corresponding to the input video block. For example, in codebook learning of the VBPVQAE system 100, the concept of vector quantization may be used in determining an assigned code matrix (Q) 150 by selecting a code from the codebook (Q) 140 with a look-up function Q ().

The assigned code matrix (q) 150 is then passed to a decoder (G) 160 to generate a reconstructed video or a reconstructed input video from the elements of the assigned code matrix (q) 150170. In one or more embodiments, the assigned code matrix (q) 150 has the same shape as the latent feature matrix (z) 130. For example, the assigned code matrix (q) 150 may be a quantized version of the latent feature matrix (z) 130. The assigned code matrix (q) 150 is then passed to a decoder (G) 160, which decoder (G) 160 is an upsampling network that accepts the hXwXnz feature matrix and outputs HXWX3 video.

In one or more embodiments, an input video is reconstructed 170 may be represented as:

reconstructing video170 and the corresponding input video (x) 110 are used as a guide for the VBPVQAE system 100.

In one or more embodiments, the argmin operator is not differentiable during back propagation. The gradient of this step can be approximated by using a straight-through estimator (STE) and directly through the gradient from q (z) to z so that the reconstruction loss can be combined with the loss for the neural discrete representation learning. As such, the vector quantization loss function can be expressed as:

in equation (3), the first termRepresenting reconstruction loss (e.g. in reconstructing video +.>Loss during generation of (c). Furthermore, in equation (3), the second and third terms +.>Representing the vector quantization loss.

Furthermore, in equation (3), the sg function sg [ E (x) ] represents a stop gradient operator that enables zero partial derivative, q is code embedding, and β is a hyper-parameter.

In one or more embodiments, countermeasure training may be incorporated in codebook learning to enhance the degree of expression of learned embeddings in the potential feature space. For example, in one or more embodiments, as shown in fig. 1, a block-level discriminator network (D) 180 may be added to an encoder-decoder framework (e.g., including encoder (E) 120 and decoder (G) 160) as in generating a reactive network (GAN). Can also be added with countertraining loss L _GAN To stimulate the VBPVQAE system 100 in the original video (x) 110 and reconstructed video170. In one or more embodiments, the block-level discriminator network (D) 180 can determine the challenge training penalty L of the VBPVQAE system 100 _GAN And total loss L.

Original input video (x) 110 and reconstructed videoLoss of countermeasure training L between 170 _GAN Can be expressed as:

overall, the overall penalty for VBPVQAE system 100 training is:

L＝argmin _E，G，Q max _D L _VQ (E，G，Z)+λL _GAN ((E，G，Q)，D) (5)

in one or more embodiments, equation (5) or the term "L" can represent the total loss of the VBPVQAE system 100 during training. For example, the term "L" in equation (5) essentially merges the original video (x) 110 with the reconstructed videoVector quantization loss "L" between 170 _VQ (E, G, Z) "and GAN loss or contrast training loss" L _9AN ((E, G, Q), D) ". In equation (5), "λ" is a super parameter that can be adaptively tuned.

In one or more embodiments, real world video from a production line may incorporate complex visual patterns that include subtle defect features. To learn an efficient codebook from such complex data sets, it may be desirable to employ a model that can concurrently (e.g., simultaneously) learn local features and understand the global composition of these local features. By encoding the information for both views, the product video can be represented by a series of locally vivid and globally coherent perceptual codes. To achieve this, above the convolutional layer, a multi-headed self-attention layer may be employed to learn cross-correlation dependencies between elements within the sequence (i.e., words for linguistic tasks or video blocks for visual tasks). Thus, the VBPVQAE system 100 is able to learn codebooks with richer perceptions of complex industrial products.

FIG. 2A is a block module representation of an example 3D convolutional encoder skeletal model for the VBPVQAE system of FIG. 1, and FIG. 2B is a block module representation of an example 3D convolutional discriminator skeletal model for the VBPVQAE system of FIG. 1, according to some embodiments. It is noted that the model used herein can also be interpreted as a system architecture.

The input sensor 210 represents an input to the block module. The sub-tensor 212 is a sub-tensor of the input sensor 210 and represents a partial representation of the input sensor 210. The convolution kernel 214 convolves the sub-tensor 212.

The output 216 is the output of convolution operations 212 and 214. The decoder 222 is a skeleton for reconstructing the original video. The decoder 222 reconstructs the original video from the quantized compressed representation.

The reconstructed video 218 is the output of the decoder 222. As explained in fig. 1, the reconstructed video 218 reconstructs the original input video 220 (or referred to as video or original video). The upsampling blocks 219a through 219e may include bilinear samples, interpolated samples, etc. to upsample from a lower recognition rate, a small number of time frames to a higher recognition rate, and more time frames.

Residual blocks 221a to 221j are residual network 3D blocks that constitute residual connections of the block modules. The residual blocks 221a to 221j are used to convolve the input tensor. The attention block 226 is a multi-headed attention block. It is used to reassign tensor weights along both the spatial and temporal dimensions according to the importance of the learned tensor. Video 220 is an input video that provides information/patterns that require neural network learning in some embodiments. Downsampling blocks 228a through 228e of video 220 compress the dimensions of the tensors (e.g., through a pooling layer including max-pooling, average pooling, etc.).

Residual blocks 230a through 230j are also residual network 3D blocks. The residual blocks 230a to 230j convolve the input tensors. Attention block 232 is a multi-headed attention block for reassigning tensor weights along the spatial and temporal dimensions. Reference numeral 224 denotes an encoder.

Fig. 2B is a block module for a 3D convolution discriminator skeletal model that helps identify whether the video is from a reconstructed video 218 or an original video 220 as shown in fig. 2A, thereby improving encoder and decoder training.

Video 250 may be original video 220 or vector quantized reconstructed video 218.

Residual blocks 251a through 251j are also residual network 3D blocks. In some embodiments, residual blocks 251a through 251j convolve the input tensor. The pooling 253a to 253e is a downsampling layer. In some embodiments, pooling 253a through 253e downsamples the input tensor, which may include maximum pooling, average pooling, and so on. The attention block 252 is a multi-headed attention block for reassigning tensor weights along the spatial and temporal dimensions.

Frame attention 254 is the weight assigned to each frame (i.e., the contribution of each frame to the final prediction). Whether the input tensor is reconstructed or original, the frame probability 256 is the probability of each frame. The output (T/F, true/false) 258 is the output of the network that predicts whether the input video is reconstructed or original.

In one or more embodiments, the VBPVQAE system 100 may also determine defects in the system. For example, fig. 3 illustrates defect detection using video-based block-level codebook learning. For example, referring back to FIG. 1, given an input video (x) 110, during the test phase, the trained VBPVQAE system 100 may be configured to respond by assigning a code z _q The anomaly score s is estimated for each video block indexed by i, j (e.g., 310 (1), 310 (2), 310 (3) … …,310 (n)) _i，j To identify defects.

As in the training phase, in the embodiment of fig. 3, the input video (x) 110 may be divided into a plurality of video blocks during the testing phase. Next, potential features from the multiple video blocks of the input video (x) 110 at the time of the test may be extracted, and the video blocks of the multiple video blocks may be encoded into potential feature vectors by the encoder (E) 120(for z) _i，j In a short expression) of the above-mentioned compounds.

In fig. 3, elements 310 (1), 310 (2), 310 (3) … … represent potential feature vectors of video blocks of input video (x) 110 at the test stageAnd (3) representing.Then, from codebook (Q) 320, among all codes in codebook (Q) 320 (e.g., based on the potential feature vector of the input video block ∈ - >Representing a similarity measure with the kth code from codebook (Q) 320) with a potential feature vector +.>The kth code of the shortest distance of (a) is allocated to the video block (e.g., 310 (1), 310 (2), 310 (3) … …,310 (n)) by the encoder (E) 120. In one or more embodiments, the kth code from codebook (Q) 320 may be represented as:

extracting potential feature vectors with all video blocksAnd a set of blocks 330 of the corresponding code is determined by the encoder (E) 120. The corresponding set of blocks 330 of code may be represented as:

in one or more embodiments, the set of blocks 330 may include an assigned code corresponding to each of a plurality of video blocks (e.g., 310 (1), 310 (2), 310 (3) … …,310 (n)).

For each entry in the set of blocks 330, an anomaly score s _i，j E R may use probability density functionsOr weighted distances to the k nearest neighbors in the set of blocks 330. For example, in FIG. 3, an anomaly score matrix(s) 340 incorporates an anomaly score s for each entry in block set 330 _i，j . At one or more ofIn one embodiment, the anomaly score s _i，j Can be expressed as:

in one or more embodiments, a block-level anomaly score s _i，j Can be based on the anomaly score s _i，j Is reorganized into anomaly score matrix(s) 340. In one or more embodiments, the anomaly score s _i，j May be calculated by encoder (E) 120.

In one or more embodiments, during defect detection, if the anomaly score s _i，j If any of the input samples is greater than a given threshold at index k, then the input samples are determined to be defective, which can be expressed as:

FIG. 4 illustrates learned spatio-temporal dependent defect detection. Discrete potential features 420a through 420f represent feature representations of all frame partial blocks along the spatial and temporal dimensions by learned codes. The transformer 418 learns the spatial and temporal correlations between these codes from the normal video and accepts the sequences of potential codes to learn the correlations through a self-supervised learning task including predicting the mask code and predicting the next sequence, etc. The probability of an output target represents the degree of normality of the input sequence. The output sequence 416 is the output sequence from the transformer 418. The output sequence 416 shows the quality of the sequence reconstruction. The joint probabilities of the sequences show how the input sequences are normal to identify time-dependent anomalies and anomalies with a large recognition rate. In some embodiments, the transformer 418 generates a code level probability for each code in the sequence. By calculating the joint probability (product of code probabilities) of the code sequences, the normal level of the code sequences can be identified to identify time-dependent anomalies as well as anomalies with a large identification rate. Normal code sequences may have a high probability to be obtained, while abnormal code sequences may have a low probability. The images 412a to 412c are used to reconstruct the original video from the sequence so that it can be used for manual inspection.

FIG. 5 illustrates a method for defect detection using the VBPVQAE system 100.

For example, at step 501, an input video may be received at an encoder, and at step 502, the input video may be divided into a plurality of video blocks. For example, as discussed with respect to fig. 1, when the input video (x) 110 is received at the encoder (E) 120, the input video (x) 110 may be divided into a plurality of video blocks.

Potential features from a plurality of video blocks of the input video may be extracted by an encoder, and a potential feature matrix including potential features extracted from the plurality of video blocks of the input video may be determined by the encoder. For example, as discussed with respect to fig. 1, potential features from each of the plurality of video blocks may be extracted to determine a potential feature matrix (z) 130, where z E R (h x w x nz) and z=e (x).

For example, at step 505, codes corresponding to a plurality of video blocks of the input video may be selected by an encoder from a codebook including codes. For example, as discussed with respect to fig. 1, after the potential feature matrix (z) 130 is determined by the encoder (E) 120, a look-up function Q (), is used by the encoder (E) 120 to select (or allocate) code for each video block from the codebook (Q) 140.

At step 506, an allocated code matrix including codes corresponding to a plurality of video blocks of the input video may be generated by the encoder. For example, as discussed with respect to fig. 1, by comparing the potential feature representations of the input video block in the potential feature matrix (z) 130 to a similarity measure (e.g., euclidean distance, mahalanobis distance, etc.) between each code in the codebook (Q) 140, the find function Q () selects the nearest code as the assignment for each potential feature in the potential feature matrix (z) 130 to generate the assigned code matrix (Q) 150.

At step 507, the assigned code matrix may be received by the decoder from the encoder, and at step 508, the decoder generates a reconstructed video based on the assigned code matrix. For example, the number of the cells to be processed,as discussed with respect to fig. 1, the assigned code matrix (q) 150 is passed to a decoder (G) 160 to generate a reconstructed video or a reconstructed input video from the elements of the assigned code matrix (q) 150

The test input video may be received at an encoder and divided into a plurality of video blocks. For example, as discussed with respect to fig. 3, the input video (x) 110 at test time received at the encoder may be divided into a plurality of video blocks.

Potential features from the plurality of video blocks of the test input video may be extracted by an encoder, and each of the plurality of video blocks may be encoded into a potential feature vector based on the extracted potential features. For example, as discussed with respect to fig. 3, after dividing the input video (x) 110 at test into a plurality of video blocks, video blocks of the plurality of video blocks may be encoded by the encoder (E) 120 into potential feature vectors(for z) _i，j Is a shorthand expression of (c). />

Next, codes may be assigned to each of the plurality of video blocks to determine assigned codes for the plurality of video blocks. For example, as discussed with respect to fig. 3, the potential feature vectors (based on the input video block) from the codebook (Q) 320 among all codes in the codebook (Q) 320 are processed by the encoder (E) 120Representing a similarity measure with the kth code from codebook (Q) 320) with a potential feature vector +.>The kth code of the shortest distance of (a) is assigned to a video block (e.g., 310 (1), 310 (2), 310 (3) … …,310 (n)).

The set of blocks comprising the allocated code may be determined by the encoder. For example, as described with respect to FIG. 3Discussed, extracting potential feature vectors with all video blocksAnd determining, by the encoder (E) 120, a set of blocks 330 of the corresponding code, wherein the set of blocks 330 may include an assigned code corresponding to each of the plurality of video blocks (e.g., 310 (1), 310 (2), 310 (3) … …,310 (n)).

The anomaly score for each of the assigned codes for the set of blocks may be determined by the encoder. For example, as discussed with respect to FIG. 3, for each entry in block set 330, an anomaly score s _i，j E R may use probability density functionsOr weighted distances to the k nearest neighbors in the set of blocks 330.

The anomaly score for each of the assigned codes for the set of blocks may be compared to a threshold by an encoder, and defects in one or more of the plurality of video blocks may be determined by the encoder based on the results of the comparison. For example, as discussed with respect to FIG. 3, during defect detection, if any anomaly score s _i,j Above a given threshold at index k, the input sample is determined to be defective.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The embodiments described herein are examples only. Those skilled in the art will recognize various alternative embodiments from the specifically disclosed embodiments. Those alternative embodiments are also intended to be within the scope of this disclosure. As such, the embodiments are limited only by the claims and their equivalents.

Claims

1. A system for closed circuit television anomaly detection, comprising:

a memory;

an encoder connected to the memory; and

a decoder, connected to the encoder,

wherein the system is operable to:

receiving an input video at the encoder;

dividing, by the encoder, the input video into a plurality of video blocks;

selecting, by the encoder, a code corresponding to the plurality of video blocks of the input video from a codebook comprising the code;

determining, by the encoder, an allocated code matrix comprising the codes corresponding to the plurality of video blocks of the input video;

receiving, by the decoder, the allocated code matrix from the encoder; and

a reconstructed video is generated by the decoder based on the assigned code matrix.

2. The system of claim 1, wherein the codes corresponding to the plurality of video blocks of the input video are selected from the codebook using a lookup function, wherein the lookup function, the codebook, and the allocated code matrix are stored in the memory.

3. The system of claim 2, wherein the system is further operable to:

extracting, by the encoder, potential features from the plurality of video blocks of the input video; and

a potential feature matrix including the potential features extracted from the plurality of video blocks of the input video is determined by the encoder.

4. A system according to claim 3, wherein the look-up function is operable to select the closest code as the allocation for each of the potential features in the potential feature matrix to determine the allocated code matrix.

5. A system according to claim 3, wherein the encoder is operable to select codes from the codebook that correspond to the video block of the plurality of video blocks of the input video by comparing similarity measures between potential feature representations of the video blocks in the potential feature matrix and corresponding codes in the codebook.

6. The system of claim 5, wherein the similarity metric comprises a euclidean distance or mahalanobis distance between the potential feature representations of the video blocks in the potential feature matrix and the corresponding codes in the codebook.

7. The system of claim 1, wherein codes among the codes in the assigned code matrix are assigned to a video block of the plurality of video blocks of the input video based on vector quantization of potential features corresponding to the video block.

8. The system of claim 7, wherein the vector quantization loss of the system comprises a reconstruction loss that occurs during generation of the reconstructed video and a loss that occurs during the vector quantization of potential features corresponding to the plurality of video blocks of the input video.

9. The system of claim 8, further comprising:

a block-level discriminator network operable to operate as a generate countermeasure network to determine a countermeasure training penalty between the input video and the reconstructed video.

10. The system of claim 9, wherein a total loss of the system in generating the reconstructed video from the input video includes the vector quantization loss and the countertraining loss.

11. The system of any of claims 1 to 10, further operable to:

receiving, by the encoder, a test input video;

Dividing, by the encoder, the test input video into a plurality of test video blocks;

extracting, by the encoder, potential features from the plurality of test video blocks of the test input video;

encoding, by the encoder, each of the plurality of test video blocks into a potential feature vector based on the extracted potential features;

code is allocated by the encoder to each of the plurality of test video blocks to determine an allocation of the plurality of test video blocks;

determining, by the encoder, a set of blocks comprising the allocated code;

determining, by the encoder, an anomaly score for each of the assigned codes of the set of blocks;

comparing, by the encoder, the anomaly score for each of the assigned codes of the set of blocks to a threshold; and

a defect in one or more of the plurality of test video blocks is determined by the encoder based on a result of the comparison.

12. The system of claim 11, wherein a code from the codebook among the codes in the codebook having the shortest distance to the potential feature vector of a test video block of the plurality of test video blocks is assigned to the test video block.

13. The system of claim 11, wherein the anomaly score for each of the assigned codes of the set of blocks is determined based on a probability density function.

14. A method for closed circuit television anomaly detection, comprising:

receiving an input video at an encoder;

dividing the input video into a plurality of video blocks at the encoder;

receiving, by a decoder, the allocated code matrix from the encoder; and

15. The method of claim 14, wherein the code corresponding to the plurality of video blocks of the input video is selected from the codebook using a lookup function, wherein the method further comprises:

16. A method according to claim 15, wherein the look-up function is operable to select the closest code as the allocation of each of the potential features in the potential feature matrix to determine the allocated code matrix, and

wherein the encoder is operable to select codes corresponding to the video blocks of the plurality of video blocks of the input video from the codebook by comparing similarity measures between potential feature representations of the video blocks in the potential feature matrix and corresponding codes in the codebook.

17. The method of claim 14, wherein codes among the codes in the assigned code matrix are assigned to a video block of the plurality of video blocks of the input video based on vector quantization of potential features corresponding to the video block,

wherein the vector quantization loss includes a reconstruction loss that occurs during generation of the reconstructed video and a loss that occurs during the vector quantization of potential features corresponding to the plurality of video blocks of the input video, and

wherein the total loss for generating the reconstructed video from the input video includes the vector quantization loss and an countermeasure training loss.

18. The method of claim 14, further comprising:

extracting, by the encoder, potential features from the plurality of video blocks of the input video;

encoding, by the encoder, each of the plurality of video blocks into a potential feature vector based on the extracted potential features;

code that assigns, by the encoder, a code to each of the plurality of video blocks to determine an assignment of the plurality of video blocks;

determining, by the encoder, a set of blocks comprising the allocated code;

a defect in one or more of the plurality of video blocks is determined by the encoder based on a result of the comparison.

19. A non-transitory computer-readable storage medium operable to store instructions that, when executed by a processor included in a computing device, cause the computing device to:

receiving an input video at an encoder of the computing device;

dividing the input video into a plurality of video blocks at the encoder;

receiving, by a decoder of the computing device, the allocated code matrix from the encoder; and