WO2014088407A1

WO2014088407A1 - A self-learning video analytic system and method thereof

Info

Publication number: WO2014088407A1
Application number: PCT/MY2013/000248
Authority: WO
Inventors: Shahirina Binti Mohd TAHIR; Zulaikha Binti KADIM; Ettikan Kandasamy A/L KARUPPIAH
Original assignee: Mimos Berhad
Priority date: 2012-12-06
Filing date: 2013-12-04
Publication date: 2014-06-12

Abstract

The present invention relates to a video analytic system 10 having a machine-learning engine 105 for enabling the video analytic system 10 to classify at least one object within an image of a video input 101 in an unsupervised manner. The machine learning engine 105 comprises a properties extraction unit 201a configured to extract the object properties of object when the object is found novel to the system 10 and a pixel cluster optimizer 201b configured to generate a plurality of optimized parameter configurations that accurately describes the properties of the novel object by clustering the objects based on similarity of the object properties, segmenting pixels within each resultant cluster into several sub-clusters of substantially correlated pixels and subsequently combining the property value associated with each of the sub-clusters.

Description

A SELF-LEARNING VIDEO ANALYTIC SYSTEM AND METHOD THEREOF

FIELD OF INVENTION [001] The present invention relates to a video analytic system. More particularly, relates to a self-learning video analytic system.

BACKGROUND [002] Most conventional video analytic systems require user input to identify what events and behaviors are common, and which are abnormal and suspicious in a scene captured by the video analytic system. These video analytic systems cannot automatically adjust their parameter settings. Instead, they often require users to manually adjust the setting of the video analytic system. Such a manual intervention, undoubtedly, is a time-consuming task. Further, it would also be overly burdensome and impractical especially when there are large number of configuration parameters needed be reset to increase the accuracy of the video analytic system during object classification and identification process. [003] In view of above, various video analytic systems with self-learning functions have thus been increasingly proposed. These self-learning functions generally configured to allow the video analytic systems to intelligently learn, identify and recognize pattern of objects from a stream of image frames and thereby based on what has been learned, adjusts the configuration parameter without any user inputs. An example relating to self-learning video analytic system includes U.S Patent Application No. 20100215254(A1). This patent document provides a self-learning and categorization system having a plurality of fuzzy logics that is configured to automatically classify an object within a stream of video images by comparing the images to a plurality of templates. The plurality of templates are a number of training images which classifications are known to the self-learning and categorization. Template that has the closest matching score with an observed image will then be determined and labels associated thereof will be used to identify the object within the image.

SUMMARY [004] In one aspect of the present invention, a video analytic system having the capacity of self-learning by which the video analytic system is able to automatically learn features that represent at least one novel object occurring within a scene the video analytic system is monitoring, and thereby generates an optimized parameter configuration based on learned features for enhancing the efficiency of the system in the aspect of object classification and identification, is disclosed. [005] The video analytic system includes a machine-learning engine that learns novel properties of objects within a video image inputted from a video input in an unsupervised manner and with the use of the learned data, object classification and tracking operations of the video analytic system could be enhanced.

[006] The machine-learning engine has a properties extraction unit for extracting properties of the at least one object from one image of the video input. A pixel cluster optimizer is also provided within the machine-learning engine. The pixel cluster optimizer adapted to cluster the at least one object based on similarity of the object properties, to segment pixels within each resultant cluster into several sub-clusters of substantially corrected pixels, and subsequently combine the property value associated with each of the sub-clusters to generate a plurality of optimized parameter configurations. The resultant optimized configurations accurately describe the properties of each of the object blobs, and it is to be stored in a parameter configuration catalogue.

[007] The video analytic system further includes an object evaluator has a property comparator that compares each object property of the least one object in a parallel manner with respect to a corresponding optimized parameter configuration acquired from the training unit. Based on the parallel comparison, the object evaluator will compute the best estimate property value of each property of the object using a weighted-averaging method. Resultant best estimate property values will then be forwarded to an object identifier so that the object can be confidently identified.

[008] In another aspect of the present invention, a method for enabling a video analytic system to classify at least one object within a video image in an unsupervised manner is provided. The method comprises determining whether the object properties of each of the at least one object in the image is learned or known by the video analytic system based on a plurality of optimized parameter configurations that has been previously acquired and stored in a parameter configuration catalogue; extracting the object properties of the at least one object when the object includes at least one object property that is novel to the video analytic system; clustering the object blobs based on their similar object property; segmenting pixels within each resultant cluster into sub- cluster of substantially correlated pixels and thereby generating an optimized parameter configuration by combining property value associated with each of the sub-clusters.

BRIEF DESCRIPTION OF THE DRAWINGS

[009] Other objects, features, and advantages of the invention will be apparent from the following description when read with reference to the accompanying drawings. In the drawings, wherein like reference numerals denote corresponding parts throughout the several views: [010] Figure 1 illustrates a block diagram of a video analytic system in accordance with one embodiment of the present invention;

[011] Figure 2 illustrates an operational flow of a video analytic system in accordance with an embodiment of the present invention;

[012] Figure 3 illustrates an operational flow of a properties extraction unit of the training unit in accordance with an embodiment of the present invention for extracting properties of object blobs;

[013] Figure 4 illustrates a process flow for generating a plurality of optimized parameter configurations by a cluster optimizer unit based on the acquired properties of each object blob from the properties extraction unit of the training unit of Figure 1; [014] Figure 5 is a flow diagram illustrating how a trained classifier operates to identify whether the features of an object has already been learned or not by the video analytic system; and

[015] Figure 6 illustrates operational flow of an object evaluator in accordance with an embodiment of the present invention. DETAILED DESCRIPTION

[016] The present invention will now be described in detail with reference to the accompanying in drawings. It is to be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated device, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates. [017] Figure 1 illustrates a block diagram of a video analytic system 10 in accordance with one embodiment of the present invention. The video analytic system 10 includes a video input 101, a background estimator 102, a connected component labeler 103, a machine-learning engine 105, an object evaluator 104, a filtering unit 106, an event analytics unit 107, and a detection unit 108. Each of the components is operatively associated with a processor (not shown). The processor executes each of the components embedded in the video analytics system 10 based on at least one set of program instructions. The processor also adapted to perform data processing and other data management services in order to coordinate overall operation of the video analytic system 10. The processor may be custom made or any commercially available microprocessor, digital signal processor, or a central processing unit in the art. [018] The video input 101 monitors and records an area of interest and events taking place therein as a sequence of individual video frames. The video input 101 may be configured to capture such video data at a specific frame rate. The video input 101 may be a video camera, a VCR, DVR, DVD, web-cam device or the like. Video data captured from the video input may be compressed by the video input 101 using suitable compression standard such as, for example MPEG-4 or H.264, before being transmitted to the background estimator 102. The background estimator 102 differentiates foreground objects in each video frame of the received video data by generating at least one binary map, in which pixels that, respectively, belong to the foreground objects and the background are well identified. The resultant binary map will then be forwarded to the connected component labeler 103. The connected component labeler 103 groups and labels all the foreground pixels in the resultant binary map to form at least one object blob for subsequent analysis, such as, for example, object tracking and classification. [019] Prior to being analyzed by the event analytic unit 107 and the detection unit 108 for object of interest and/or anomalous events in the captured video images, the resultant object blobs from the connected component labeler are sent to the machine- learning engine 105. The machine-learning engine 105 is configured to learn features that represent each object within a video frame over a period of time and, based on the learned features and parameters, obtain a plurality of optimized parameter configurations for the video analytic system 10 to enhance its classification accuracy. [020] In accordance with the illustrated embodiment, the machine-learning engine 104 has a training unit 201, a training status identifier 202, a trained classifier (not shown) and a parameter configuration catalogue 204. The training unit 201 comprises a properties extraction unit 201a, a pixel cluster optimizer 201b and a local properties database 201c. The properties extraction unit 201a adapted to extract the positional information and the physical properties of each object within video frames it receives from the connected component labeler 103. The properties extraction unit 201a is also configured to assign a confidence level to each of the objects within its received video frames. Confidence level assigned to each of the objects is according to how frequent the objects have been tracked and classified by the trained classifier under a respective classification over video frames it has received from the connected component labeler 103. For example, if an object has appeared in 5 frames and the trained classifier has identified the object as "human" for 4 frames and as "vehicle" for 1 frame, this particular object will then be labeled as "human" with a confidence level of 0.8.

[021] The pixel cluster optimizer 201b configured to generate a plurality of optimized parameter configurations for each object in a video frame. To do so, the pixel cluster optimizer 201b clusters each confidently classified object by the properties extraction unit 201a with respect to their object properties, and subsequently, segments pixels within each resultant cluster into several sub-clusters of substantially correlated pixels. Property value that is associated with each of the sub-clusters will then be combined to generate a parameter configuration for the video analytic system 10

[022] The training status identifier 202 determines whether the video analytic system 10 has been trained or not based on the plurality of optimized parameter configurations that has been previously generated and stored in the parameter configuration catalogue 203. For example, the video analytic system 10 is considered untrained when a plurality of optimized parameter configurations that are newly generated there from are not part of those parameters configurations maintained in the parameter configuration catalogue 203. Otherwise, the video analytic system 10 is considered trained when it has learned all the optimized parameter configurations that the training unit 201 has generated and stored in the parameter configuration catalogue 203. [023] Still referring to Figure 1, the object evaluator 104 includes an object identifier 104a and a property comparator 104b. The property comparator 104b adapted to compare each object property of an observed object, in parallel, with corresponding optimized parameter configurations acquired from and maintained in the parameter configuration catalogue 203. For example, if the observed object has been confidently classified by the trained classier as a "vehicle", the property comparator 104b will retrieve parameter configurations that are relating to "vehicle" for evaluating the true identity of the observed object. The object comparator 104b fuses each hypothetical feature resulted from the parallel feature comparison operations by any known weighted-averaging methods in the art in order to obtain the best-estimate properties values of each sub-cluster within the observed object for the object identifier 104a to identify the particular observed object.

[024] Figure 2 illustrates an operational flow of a video analytic system 10 in accordance with an embodiment of the present invention. The video analytic system 10 is initiated when a background estimation unit 102 receives video data from a video input 101 in step S301. Each video data includes a sequence of individual video frames and each frame depicting a scene captured by the video input 101. Upon receipt of the video data, in step S302, the background estimation unit 102 isolates foreground objects such as, for example, people, vehicle, and any moving object of interests from the background of each video frame. The results of the background estimation 102 are output as a motion map. The motion map is a binary map, in which motion or foreground pixels that indicate the foreground objects and background pixels that indicate the static background objects are well defined. The motion map will then be forwarded to a connected component labeler 103. In the connected component labeler 103, the foreground and background pixels in the video frame are respectively connected and labeled with the same labels. Connected foreground pixels collectively form an object blob. [025] A training status identifier 202 thereafter receives and analyzes the object blobs within motion maps outputted from the connected component labeler 103 in step

5303. The training status identifier 202 will determine whether the parameters and features of the object blobs, preferably, which are in the form of parameter configurations have been learned by the video analytic system 10. When the parameters configurations of the objects are unknown or considered novel to the video analytic system 10, the particular video frame will then be sent to a training unit 201, in step

5304. The training unit 201 extracts, learns and generates a plurality of optimized parameters configurations that represents the distinguishing features and characteristics of the objects within the video frame. The plurality of resultant optimized parameter configurations is then stored in a parameter configuration catalogue 203 and is readily utilized by the video analytic system 10 to identify an object in the later video frames.

[026] However, when the video analytics system is considered trained, the connected component labeler 103 outputs its labeled motion map to an object evaluator 105, in step S305. The object evaluator 105 analyzes the motion map and identifies each of object blobs within the motion map by evaluating each of the features and characteristics of the object blobs based on the parameter configurations available in the parameter configuration catalogue 203. Each feature and characteristic of the object blob will be then compared against all the well-defined and corresponding features and characteristics of objects that are associated with the parameter configuration, in parallel. Matching scores for each feature and characteristic of the object blob with that of the corresponding parameter configuration will be computed and fused using weighted-averaging method to obtain a best estimate properties values for the object evaluator 104 to identify the object blob. [027] Once identified, the object evaluator 104 determines if the object blob is an object of interest, in step S306. When the object blob is not an actual object the user is seeking for, a filtering unit 106 is initiated. The filtering unit removes this particular object blob from the system 10, in step S307. Otherwise, the particular object blob will be passed to an event analysis unit 107, in step S308. The event analysis unit 107 validates if the object blob is a normal event using a pre-configured rule set. A detection unit 108 triggers an alarm when the object blob is found and verified to cause an intrusion event within the scene captured by the video input 101, in step S309.

[028] Figure 3 is a process flow illustrating how the training unit 201 operates to extract, to learn, and to obtain a plurality of optimized parameter configurations based on the features and characteristic of object blobs that are novel to the video analytic system 10. As illustrated, object blobs that are delineated from the background image will be labeled and classified by a trained classifier in step S401. Object labels given by the trained classifier will then be used to group the object blobs based on their similarity. For example, object blobs with similar label will belong to the same group while object blobs with different labels will be grouped differently. Features and characteristics of each object group will then be extracted and temporarily stored by a properties extraction unit 201a in a local properties database 201c in step S402-403. Both of the properties extraction unit 201a and local properties database 201c reside within the training unit 201. In accordance with one embodiment, features and characteristics of the object blobs, such as, bounding box ratio, object orientation, major/minor axis and, object size, object color, may be determined and extracted.

[029] Upon the features of each object blob has been extracted and stored, the training unit 201 determines if these well-identified object blobs has also appeared in subsequent frames in step S404. The properties extraction unit 201a then computes a confidence level to each of the object blobs. The confidence level is computed based on the frequency of the trained classifier has identified the object blob under a corresponding classification in step S405. For example, as described in the preceding paragraphs, confidence level of an object blob as a "human" will be of 0.8 if this particular object blobs has been classified by the trained classifier as "human" for 4 frames out of 5 subsequent video frames that are inputted into the trained classifier over a period of time.

[030] Then, the properties extraction unit 201a determines if the object blobs have confidently been classified by comparing the confidence level assigned to each object blob against a preset threshold value in step S406. When an object blob assigned with a confidence level that is of or beyond an acceptance level of the preset threshold value, it is thus considered having been confidently classified. The properties extraction unit 201a tracks these confidently classified object blobs in sequential video frames. Features of these object blobs found in subsequent video frames are thus extracted and permanently maintained in the local database 201c of the training unit 201 in step S407. In the meanwhile, the properties extraction unit 201a compares the acquired features of corresponding object blobs in sequential frames in order to realize the changes of these object blobs in appearances and positions in step S408. By doing so, the properties extraction unit 201a obtains temporal properties of these particular object blobs, such as object speed, moving path, and object interaction within a particular scene.

[031] Soon after the features and properties of each object blob are acquired, a cluster optimizer unit 201b is prompted. The cluster optimizer 201b clusters motion pixels within each confidently classified object blobs into groups of similar object properties and thereby to generate a plurality of optimized parameter configurations for enhanced object classification, in step S409. In the subsequent step, the training unit 201 checks whether the training process has successfully completed, in step S410. The training unit 201 flags the video analytic system 10 as trained and at the same time, the training process is terminated, when the training process is carried out successfully. Otherwise, the training process will be resumed from step S401-410 until the training unit 201 has successfully learned the features and characteristic of the object blobs that have been previously found to be new to the video analytic system 10. [032] Figure 4 illustrates a process flow for generating a plurality of optimized parameter configurations by the pixel cluster optimizer unit 201b, based on the acquired properties of each object blob from the properties extraction unit 201a. The pixel cluster optimizer 201b first compares pixels of each object blob acquired from the images over a period time, and then clusters them according to their object properties in step S501. The pixel cluster optimizer unit 201b computes statistical parameters of the properties' values in each cluster in step S502. The statistical parameters include the min and max, mean, and standard deviation of the properties value in the cluster. With the use of statistical parameters, the pixel cluster optimizer unit 201b then computes a respective confidence value to each pixel within the clusters in step S503. In one embodiment, the confidence value may be computed per pixel basis or per group pixels.

[033] Pixels having similar confidence values will then be clustered to form independent sub-clusters within each cluster in step S504. Property values associated with each of the resultant sub-clusters is then considered having been optimized. This is because most pixels in the resultant sub-cluster are associated with the same property value that represents a corresponding feature of the object. These properties values will then be combined to form a parameter configuration in step S505. The parameter configuration, as understood, confidently describes a respective object, and it is necessary for the video analytic system 10 to achieve high object classification accuracy. [034] Figure 5 is a flow diagram illustrating how a training status identifier 202 operates to identify if the features of an object has already been learned by the video analytic system 10. To determine whether the video analytic system 10 has been trained with respect to the features of object blobs in current observed images, a trained classifier is initiated to classify the object blobs in step S601. Subsequently, classification of the object blobs by the trained classifier will then be forwarded to an object evaluator 105. The object evaluator 105 uses the plurality of optimized parameter configurations that has been stored in the parameter configuration catalogue 203 to check whether the object blobs have been correctly classified, in step S602. The object evaluator 105 labels each of the object blobs a respective object label.

[035] The training unit 201 thereafter computes classification accuracy on these labeled object blobs in the step S603. The classification accuracy is computed by determining the number of object blobs that have been correctly classified by the trained classifier over all its received frames where the object blobs locate. It should be noted that, an object blob is deemed correctly classified when the evaluation process outputs the same object label as what the trained classifier has previously classified the object blob. [036] In the subsequent step S604, the training unit 201 re-checks the classification accuracy computed on each labeled object blobs by finding the ratio of correct classification of the object blobs over a number of subsequent frames. When the training unit 201 has found that the classification accuracy of an object blob is decreasing with the subsequent number of frames, the training unit 201 will send such a trained object blob back to the properties extraction unit 201a to re-perform the pixel clustering, particularly, by which the features and parameters of the object blob will be re-learned and re-processed, in step S605. As a result of which, an optimized parameter configuration that is more accurately describing the features and parameters of the object blob is thus resulted. Further, the classification accuracy measurement on this particular object blob will be reset accordingly. In the meantime, the video analytic system 10 will be flagged as untrained in the step S606,

[037] However, in the event when the classification accuracy computed on a respective object blob is not decreasing for a number of frames, the training unit 201 will determine if the classification accuracy is of or beyond an acceptance level of a preset threshold value in step S607. The training status of the video analytic system 10 will then be flagged as trained when its classification accuracy lies beyond the acceptance level of the preset threshold value in step S608. It will be otherwise flagged as untrained when the classification accuracy is below the preset threshold value. [038] Figure 6 illustrates operational flow of the object evaluator 105 for evaluating an object blob using a plurality of optimized parameter configurations. As shown, extracted objects in a current frame from the background, in step S701 will be computed in subsequent step S702 so that its properties are obtained. The computation process is carried out by first determining the position coordinate of the objects in the image in step S703 and thereafter, in step S704, based on the positional information of the objects, determining a corresponding cluster for each object property of the extracted objects. The identified optimized parameter configuration for each cluster is extracted from the parameter configuration catalogue 203 in step S705. During the evaluation process, each object property of an observed object is compared with its corresponding optimized parameter configuration in a parallel manner in step S706- S708. Hypothetical results obtained from each of these parallel feature comparison operations are then fused in step S709 to derive a weighted mean. The resultant weighted mean is the best estimate to describing the object's properties, and it is used to identify the particular object in step S710.

[039] As will be readily apparent to those skilled in the art, the present invention may easily be produced in other specific forms without departing from its essential characteristics. The present embodiments is, therefore, to be considered as merely illustrative and not restrictive, the scope of the invention being indicated by the claims rather than the foregoing description, and all changes which come within therefore intended to be embraced therein.

Claims

1. A machine learning engine 105 for a video analytic system 10 adapted for classifying at least one object within a video input 101 in an unsupervised manner, the machine learning engine 105 comprising:

a properties extraction unit 201a for extracting properties of the at least one object from one image of the video input 101;

a pixel cluster optimizer 201b for clustering the at least one object into independent clusters based on similarity of the extracted properties, wherein pixels of each resultant cluster are being segmented into sub-clusters of substantially correlated pixels, wherein the pixel cluster optimizer combines property values that associate with the sub-clusters to form a plurality of optimized parameter configuration;

a parameter configuration catalogue 203 for storing the plurality of optimized parameter configuration that is generated from the training unit 201; and

a training status identifier 202 for checking if properties of the at least one object have been learned by the video analytic system 10, wherein the training status identifier 202 flags the video analytic system 10 as untrained if the properties of the at least one object have not been learned, and otherwise, the training status identifier 202 flags the video analytic system 10 as trained if the properties of the least one object have been learned.

2. A machine learning engine 105 as claimed in claim 1, wherein the video analytic system 10 further includes an object evaluator 104, the object evaluator 104 having: a property comparator 104b for comparing each object property of the at least one object in a parallel manner with respect to a corresponding optimized parameter configuration acquired from the training unit 201 and, based on the parallel comparison, computing an estimated property value of each property of the at least one object; and an object identifier 104a for detenmning the best estimated property values of the at least one object by weighted-averaging the estimated property values that are computed by the property comparator 104b, and thereafter confidently identifying the at least one object, based on the resultant best estimated property values.

3. A method for enabling a video analytic system 10 to classify at least one object within a video input 101 in an unsupervised manner, the method comprising:

identifying whether object properties of the at least one object in the video input 101 are learned by the video analytic system 10 based on a plurality of optimized parameter configurations, wherein the plurality of optimized parameter configurations has been previously acquired and stored in a parameter configuration catalogue 203;

extracting the object properties of the at least one object when the at least one object includes at least one object property that is novel to the video analytic system 10;

determining if the at least one object have been confidently classified; clustering the at least one confidently classified object based on similarity of the object properties after the properties extraction step;

segmenting pixels within each of resultant independent clusters into sub-clusters of substantially correlated pixels; and

generating an optimized parameter configuration by combining property values that are associated with each of the sub-clusters.

4. A method as claimed in claim 3, wherein the step of determining if the object is confidently classified further includes computing a confidence level to each of the objects based on the frequency of a trained classified that has identified the object blobs under the same classification.

5. A method as claimed in Claim 3 wherein the step of determining whether the object properties of the at least one object in the video input 101 are learned by the video analytic system 10 the video analytic system 10, further comprising:

classifying the at least one object;

evaluating whether the at least one object has been correctly classified using the plurality of optimized parameter configuration, the plurality of optimized parameter configurations has been previously stored in the parameter configuration catalogue 203;

computing classification accuracy on the at least one object by determining the number the at least one object has been correctly classified by the trained classifier, wherein the at least one object is deemed correctly classified when the evaluation operation labels the at least one object an object label as what the trained classifier has previously classified the at least one object;

re-checking the classification accuracy computed on each of the labeled objects by finding the ratio of correct classification of the at least one object over a number of subsequent images;

flagging the video analytic system 10 as untrained if the classification accuracy of the at least one object is decreasing with the subsequent number of images; and

flagging the video analytic system 10 as trained if the classification accuracy of the least one object is not decreasing with the subsequent number of images and the classification accuracy lies beyond the acceptance level of the preset threshold value.

6. A method as claimed in claim 5, wherein the step of evaluating whether the object has been correctly classified using the plurality of optimized parameter configurations, comprising:

extracting foreground objects in one image of the video input from the background;

determining object properties of the foreground objects;

determining the (x,y) coordinate position of the foreground objects within the image;

determining a corresponding cluster for each object properties of the extracted objects; retrieving a corresponding optimized parameter configuration from the parameter configuration catalogue 203;

evaluating each object property of at the least one object that is in the form of cluster by comparing each of the object properties with a corresponding reference property value that resides in the optimized parameter configuration in a parallel manner; fusing all the hypothetical results from each of the parallel property comparison steps to derive a weighted mean, wherein the weighted mean accurately describes properties of the objects; and

identifying the object based on the obtained weighted mean.