EP3635628A2 - Système neuromorphique de reconnaissance d'activité visuelle en temps réel - Google Patents

Système neuromorphique de reconnaissance d'activité visuelle en temps réel

Info

Publication number
EP3635628A2
EP3635628A2 EP18823688.9A EP18823688A EP3635628A2 EP 3635628 A2 EP3635628 A2 EP 3635628A2 EP 18823688 A EP18823688 A EP 18823688A EP 3635628 A2 EP3635628 A2 EP 3635628A2
Authority
EP
European Patent Office
Prior art keywords
interest
activity
neural network
objects
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP18823688.9A
Other languages
German (de)
English (en)
Other versions
EP3635628A4 (fr
Inventor
Deepak Khosla
Ryan M. UHLENBROCK
Yang Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HRL Laboratories LLC
Original Assignee
HRL Laboratories LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/883,822 external-priority patent/US11055872B1/en
Application filed by HRL Laboratories LLC filed Critical HRL Laboratories LLC
Publication of EP3635628A2 publication Critical patent/EP3635628A2/fr
Publication of EP3635628A4 publication Critical patent/EP3635628A4/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present invention relates to visual activity recognition and, more specifically, to a neuromorphic system for real-time visual activity recognition.
  • Automated pattern recognition and more specifically, visual image and/or activity recognition, have applications in a wide array of fields including navigation, manufacturing, surveillance, medicine, and other areas.
  • Some conventional methods for attempting to recognize activity include those disclosed in Large-Scale Video Classification With Convolutional Neural Networks (See the List of Incorporated Literature References, Literature Reference No. 1), and
  • the system includes one or more processors and a memory.
  • the memory is a non-transitory computer-readable medium having executable instructions encoded thereon, such that upon execution of the instructions, the one or more processors perform operations that include detecting a set of objects of interest in video data and determining an object classification for each object in the set of objects of interest, the set comprising at least one object of interest; forming a corresponding activity track for each object in the set of objects of interest by tracking each object across frames; for each object of interest and using a feature extractor, determining a corresponding feature in the video data by performing feature extraction based on the corresponding activity track, the feature extractor comprising a convolutional neural network; and for each object of interest, based on the output of the feature extractor, determining a corresponding activity classification for each object of interest.
  • controlling the device includes using a machine to send at least one of a visual, audio, or electronic alert regarding the activity classification.
  • controlling the device includes causing a ground-based or aerial vehicle to initiate a physical action.
  • the feature extractor includes a recurrent neural
  • the one or more processors further perform operations of, for each object of interest and using the recurrent neural network, extracting a corresponding temporal sequence feature based on at least one of the
  • the recurrent neural network uses Long Short-Term Memory as a temporal component.
  • the convolutional neural network includes at least five layers of convolution-rectification-pooling.
  • the convolutional neural network further includes at least two fully-connected layers.
  • the activity classification includes at least one of a probability and a confidence score.
  • the set of objects of interest includes multiple objects of interest, and the convolutional neural network, the recurrent neural network, and the activity classifier operate in parallel on multiple corresponding activity tracks.
  • the activity classification includes at least one of a
  • the present invention also includes a computer program product and a computer implemented method.
  • the computer program product includes computer-readable instructions stored on a non-transitory computer-readable medium that are executable by a computer having one or more processors, such that upon execution of the instructions, the one or more processors perform the operations listed herein.
  • the computer implemented method includes an act of causing a computer to execute such instructions and perform the resulting operations.
  • FIG. 1 is a block diagram depicting the components of a system according to various embodiments of the present invention.
  • FIG. 2 is an illustration of a computer program product embodying an aspect of the present invention
  • FIG. 3 is a block diagram for real-time activity recognition according to some embodiments
  • FIG. 4 is a block diagram illustrating additional details for real time activity recognition, according to some embodiments.
  • FIG. 5 is a table showing percentages of correct classifications for various methods, according to some embodiments.
  • FIG. 6 is an illustration of an example image for in/out facility activity
  • FIG. 7 includes illustrations of example images for open/close trunk activity classification, according to some embodiments.
  • FIG. 8 includes illustrations of example images for open/close trunk activity and in/out vehicle classification, according to some embodiments.
  • FIG. 9 is a table illustrating percentage accuracy of classification for various scenarios.
  • FIG. 10 is a table illustrating results from testing of a full activity recognition pipeline, according to some embodiments.
  • FIG. 11 is a graph illustrating results from testing of a full activity recognition pipeline, according to some embodiments;
  • FIG. 12 is a block diagram depicting control of a device, according to various embodiments.
  • FIG. 13 a flowchart illustrating operations for predicting movement of an object, according to various embodiments.
  • the present invention relates to visual activity recognition and, more
  • Various embodiments of the invention include three "principal" aspects.
  • the first is a system for visual activity recognition and, more specifically, to a neuromorphic system for real-time visual activity recognition.
  • the system is typically in the form of a computer system operating software or in the form of a "hard-coded" instruction set. This system may be incorporated into a wide variety of devices that provide different functionalities.
  • the second principal aspect is a method, typically in the form of software, operated using a data processing system (computer).
  • the third principal aspect is a computer program product.
  • the computer program product generally represents computer-readable instructions stored on a non-transitory computer-readable medium such as an optical storage device, e.g., a compact disc (CD) or digital versatile disc (DVD), or a magnetic storage device such as a floppy disk or magnetic tape.
  • a non-transitory computer-readable medium such as an optical storage device, e.g., a compact disc (CD) or digital versatile disc (DVD), or a magnetic storage device such as a floppy disk or magnetic tape.
  • ROM read-only memory
  • flash-type memories flash-type memories
  • the computer system 100 is configured to perform calculations, processes, operations, and/or functions associated with a program or algorithm.
  • certain processes and steps discussed herein are realized as a series of instructions (e.g., software program) that reside within computer readable memory units and are executed by one or more processors of the computer system 100. When executed, the instructions cause the computer system 100 to perform specific actions and exhibit specific behavior, such as described herein.
  • the computer system 100 may include an address/data bus 102 that is
  • processors configured to communicate information.
  • one or more data processing units such as a processor 104 (or processors) are coupled with the address/data bus 102.
  • the processor 104 is configured to process information and instructions.
  • the processor 104 is a microprocessor.
  • the processor 104 may be a different type of processor such as a parallel processor, application-specific integrated circuit (ASIC), programmable logic array (PLA), complex programmable logic device (CPLD), or a field
  • FPGA programmable gate array
  • the computer system 100 is configured to utilize one or more data storage units.
  • the computer system 100 may include a volatile memory unit 106 (e.g., random access memory (“RAM”), static RAM, dynamic RAM, etc.) coupled with the address/data bus 102, wherein a volatile memory unit 106 is configured to store information and instructions for the processor 104.
  • the computer system 100 further may include a non-volatile memory unit 108 (e.g., read-only memory (“ROM”), programmable ROM (“PROM”), erasable programmable ROM
  • the computer system 100 may execute instructions retrieved from an online data storage unit such as in "Cloud” computing.
  • the computer system 100 also may include one or more interfaces, such as an interface 110, coupled with the address/data bus 102. The one or more interfaces are configured to enable the computer system 100 to interface with other electronic devices and computer systems.
  • the communication interfaces implemented by the one or more interfaces may include wireline (e.g., serial cables, modems, network adaptors, etc.) and/or wireless (e.g., wireless modems, wireless network adaptors, etc.) communication technology.
  • wireline e.g., serial cables, modems, network adaptors, etc.
  • wireless e.g., wireless modems, wireless network adaptors, etc.
  • the computer system 100 may include an input device 112
  • the input device 112 is coupled with the address/data bus 102, wherein the input device 112 is configured to communicate information and command selections to the processor 100.
  • the input device 112 is an alphanumeric input device, such as a keyboard, that may include alphanumeric and/or function keys.
  • the input device 112 may be an input device other than an alphanumeric input device.
  • the computer system 100 may include a cursor control device 114 coupled with the address/data bus 102, wherein the cursor control device 114 is configured to communicate user input information and/or command selections to the processor 100.
  • the cursor control device 114 is implemented using a device such as a mouse, a track-ball, a trackpad, an optical tracking device, or a touch screen.
  • the cursor control device 114 is directed and/or activated via input from the input device 112, such as in response to the use of special keys and key sequence commands associated with the input device 112.
  • the cursor control device 114 is configured to be directed or guided by voice commands.
  • the computer system 100 further may include one or more
  • a storage device 116 coupled with the address/data bus 102.
  • the storage device 116 is configured to store information and/or computer executable instructions.
  • the storage device 116 is a storage device such as a magnetic or optical disk drive (e.g., hard disk drive (“HDD”), floppy diskette, compact disk read only memory (“CD-ROM”), digital versatile disk (“DVD”)).
  • a display device 118 is coupled with the address/data bus 102, wherein the display device 118 is configured to display video and/or graphics.
  • the display device 118 may include a cathode ray tube ("CRT"), liquid crystal display (“LCD”), field emission display (“FED”), plasma display, or any other display device suitable for displaying video and/or graphic images and alphanumeric characters recognizable to a user.
  • CTR cathode ray tube
  • LCD liquid crystal display
  • FED field emission display
  • plasma display or any other display device suitable for displaying video and/or graphic images and alphanumeric characters recognizable to a user.
  • the computer system 100 presented herein is an example computing environment in accordance with an aspect.
  • the non-limiting example of the computer system 100 is not strictly limited to being a computer system.
  • an aspect provides that the computer system 100 represents a type of data processing analysis that may be used in accordance with various aspects described herein.
  • other computing systems may also be implemented. Indeed, the spirit and scope of the present technology is not limited to any single data processing environment.
  • one or more operations of various aspects of the present technology are controlled or implemented using computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components and/or data structures that are configured to perform particular tasks or implement particular abstract data types.
  • an aspect provides that one or more aspects of the present technology are implemented by utilizing one or more distributed computing environments, such as where tasks are performed by remote processing devices that are linked through a communications network, or such as where various program modules are located in both local and remote computer-storage media including memory- storage devices.
  • FIG. 2 An illustrative diagram of a computer program product embodying the present invention is depicted in FIG. 2.
  • the computer program product is depicted as floppy disk 200 or an optical disk 202 such as a CD or DVD.
  • the computer program product generally represents computer-readable instructions stored on any compatible non-transitory computer-readable medium.
  • the term "instructions” as used with respect to this invention generally indicates a set of operations to be performed on a computer, and may represent pieces of a whole program or individual, separable, software modules.
  • Non-limiting examples of "instruction” include computer program code (source or object code) and "hard-coded” electronics (i.e. computer operations coded into a computer chip).
  • the "instruction" is stored on any non-transitory computer-readable medium, such as in the memory of a computer or on a floppy disk, a CD-ROM, and a flash drive. In either event, the instructions are encoded on a non-transitory computer-readable medium.
  • This disclosure describes a novel real-time neuromorphic method and system for activity recognition such as in streaming or recorded videos from static and/or moving platforms.
  • a novel aspect of some systems involves the specific use of, implementation of, and integrating of five modules: object detection, tracking, convolutional neural network image feature extractor, recurrent neural network sequence feature extractor, and an activity classifier.
  • the system and method provide real-time visual processing even on small, low power, low cost platforms such as Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs). This approach is also amendable for implementation on emerging spiking
  • UAVs Unmanned Aerial Vehicles
  • UUVs Unmanned Ground Vehicles
  • Some other example applications may include navigation, manufacturing, medical technology, Intelligence, Surveillance, and Reconnaissance (ISR), border security, autonomous UAV and UGV, mission safety, human activity detection, threat detection, distributed mobile operations, etc. Further details regarding the system and various embodiments are provided below. [00063] (4) Specific Details of Various Embodiments
  • FIG. 3 is a block diagram of a system 300 for real-time activity recognition according to some embodiments.
  • the system performs real-time activity recognition in streaming or recorded videos 302 from static or moving platforms.
  • the system integrates one or more of five modules: object detection 304, tracks formation 306 (e.g., tracking), a convolutional neural network image feature extractor 308, a recurrent neural network sequence feature extractor 310, and a final activity classifier 312.
  • object detection 304 tracks formation 306 (e.g., tracking)
  • tracks formation 306 e.g., tracking
  • convolutional neural network image feature extractor 308 e.g., a convolutional neural network image feature extractor
  • a recurrent neural network sequence feature extractor 310 e.g., recurrent neural network sequence feature extractor 310
  • systems and methods covered by this disclosure may use some, parts of, or all of the functions performed by the modules above.
  • the system identifies all objects of interest, tracks them, and handles processing for all activity tracks to give activity classification of all objects of interest. For example, in various embodiments, if 1, 5, 10 or more objects of interest are present, the system will detect, track, and give activity classification for all the objects of interest.
  • the object detection 304 module finds objects of interest in the input video 302 and outputs their bounding box location and class label.
  • the input video may be prerecorded, or may be acquired real time via a camera or other sensor.
  • the input video may include video data comprising multiple sequentially recorded image frames.
  • the objective is human activity recognition
  • this module detects and classifies all human or "MAN" objects (e.g., persons) in the incoming video 302.
  • the objective is vehicle activity recognition
  • object detection 304 detects a set of objects of interest in video data and determines an object classification for each object of interest in the set of objects of interest.
  • the set of objects may include all objects of interest in the video, which may include quantities and/or ranges of quantities such as 1-3, 1-5, 1-10, or more objects of interest.
  • any suitable object detector may be implemented for this object detection module 304, non-limiting examples of which include those as described in Literature Reference Nos. 5, 6, 7, and 8 (see the List of Incorporated Literature References).
  • the system includes object detection 304, tracks formation 306, and CNN 308.
  • the system also includes RNN 310 in addition to the other modules. Other embodiments are possible.
  • the detected objects of interest serve as seeds for the next module (e.g., tracks formation 306), as explained in greater detail below.
  • activity tracks are formed by tracking each of object detection 304's detected objects across frames, and forming a corresponding activity track for each detected object.
  • the system uses a multi- target Kalman filter tracker.
  • alternate trackers may include OpenTLD or Mean Shift Tracking (see Literature Reference Nos. 15 and 16).
  • the system further performs customized non-maximum suppression (see Literature Reference No. 9), and uses heuristics to identify and eliminate false alarm tracks.
  • the Kalman filter is used to predict the centroid of each track in the current frame, and update a bounding box of a corresponding tracked object accordingly.
  • a track is a frame-number indexed list of bounding box positions (centered around detected object(s) whose position can change from frame to frame as the object moves) with a unique ID.
  • the current frame is the frame that is being processed whether it is a recorded video or a streaming live video.
  • "update" refers to determining where to draw the defining boundaries of the bounding box. Based on this update, in some embodiments, the whole bounding box should be moved to be centered on the predicted centroid.
  • the width and height of the bounding box in previous frame is used as the current prediction of the size.
  • the cost is computed using bounding box overlap ratio
  • the cost is a ratio (e.g., a number between 0-1) computed by determining the area of overlap between two rectangles.
  • the Munkres' version of the Hungarian algorithm is used to compute an assignment which minimizes the total cost (see Literature Reference Nos. 10 and 11).
  • sporadic detections of moving trees, shadows, etc. are removed by only considering tracks with a minimum duration of T seconds (e.g., T is nominally 2 seconds).
  • the output of the tracks formation 306 module are persistent object tracks that have a minimum duration of T seconds. For example, if a person is carrying a gun in the video and is visible for 5 seconds, tracks formation 306 will output a track of the tracked object (e.g., the gun, the person with the gun, part of the gun such as the gun barrel, etc.) with a unique track number during those 5 seconds.
  • the tracked object e.g., the gun, the person with the gun, part of the gun such as the gun barrel, etc.
  • the Feature Extractor 314, which includes a Convolutional Neural Network (CNN 308) module, receives the activity tracks (e.g., persistent tracks or other tracks) as an input from tracks formation 306, and based on each track, automatically learns what intermediate features are most useful (e.g., determines a corresponding feature for each object of interest based on the corresponding activity track) from raw image information within each track bounding box.
  • no explicit features are extracted.
  • lower layers of the CNN may learn edge or orientation features and upper layers of the CNN may learn higher- level shape or color information.
  • the values at the nodes of the various CNN layers are the features. For example, if the last layer of the CNN has 4096 nodes, the feature vector may be of size 4096.
  • Track bounding boxes may be enlarged by X% (typically 20%) before feature extraction to help with jitter in the underlying detection bounding boxes. In some embodiments, the bounding boxes may be enlarged by between 5% and 40%, although smaller and lower ranges may be possible.
  • the structure of the CNN 308 in the model is based on AlexNet (see Literature Reference No. 12) and has 5 layers of convolution-rectification-pooling followed by 2 fully-connected layers. In an embodiment, the dimensionality of the CNN 308 output is 4096 features for each frame of the track.
  • the system uses the 5-layer custom-designed and trained CNN 308 of Literature Reference No. 8.
  • the Feature Extractor 314 not only includes CNN 308, but also a Recurrent Neural Network (RNN 310) that extracts temporal sequence features based on the outputs from CNN 308 (e.g., a CNN feature).
  • CNN 308 encodes features per frame
  • the RNN 310 concatenates features from multiple frames (i.e., a temporal sequence).
  • the RNN 310 is not part of the system.
  • the Long Short-Term Memory (LSTM) network was used as the temporal component for the RNN 310 (see Literature Reference No. 13).
  • a final layer classifier e.g., activity classifier 3112.
  • the system includes the activity classifier 312, which receives the output from Feature Extractor 314, which may be the output from tracks formation 306 (not shown), from CNN 308 (e.g., when RNN 310 is not part of the system, also not shown), or from RNN 310. Based on the output of one or more of tracks formation 306 (e.g., an activity track), CNN 308 (e.g., a feature), and RNN 310 (e.g. a temporal feature), the activity classifier 312 determines an activity classification for the object of interest. In various embodiments, the activity classifier 312 receives inputs from RNN 310 if used, and otherwise from the CNN 308 if the RNN 310 was not used. In some embodiments, the activity classifier 312 is configured to send alerts and tweets comprising the activity classification, time, and image or video to a user's cell phone or a central monitoring station.
  • Feature Extractor 314 may be the output from tracks formation 306 (not shown), from CNN 308 (e.
  • a final fully-connected layer e.g., activity classifier 312 with K outputs gives the final class probability (e.g., the last layer values are the activity classification results).
  • values are typically between 0 and 1, and a high score for an activity type indicates a high confidence for that activity type.
  • the activity classifier 312 may be a Support Vector Machine (SVM)(e.g., a support vector network) classifier with K outputs, and the RNN features from RNN 310 can be sent to the SVM (see Literature Reference No. 14).
  • SVM Support Vector Machine
  • the SVM is a supervised learning model with one or more associated learning algorithms that analyze data used for classification and/or regression analysis.
  • Some algorithms for finding the SVM classifier include sub- gradient descent and coordinate descent.
  • the final output is a probability or confidence score (e.g., 75%, or a range such as from 0 to 1) for each of the K classes.
  • no softmax may be used, and instead a threshold is placed on the output response of the K output nodes to determine when an activity of interest is detected. Other activities, e.g. a person simply walking, should have no output above the threshold and receive effectively a label of "no relevant activity.”
  • softmax refers to normalizing the node values so they sum to 1 , and the highest value then becomes the declared activity. In winner take all embodiments, the activity with the highest confidence is the activity label of that track.
  • each node in the final layer may represent an activity, and the methods describe above are used to determine the final output based on those node values (e.g., 80% person digging, 15%) person standing, 5% person aiming a gun).
  • the Feature Extractor 314 e.g., the CNN 308 and/or RNN 310
  • activity classifier 312 modules are run in parallel for each track from the tracks formation 306 module.
  • the Feature Extractor 314 e.g., which includes the CNN 308 and/or RNN 310
  • activity classifier 312 may operate sequentially based on the activity tracks and the output of the previously operating modules.
  • every track from tracks formation 306 goes through its own 308-310-312 or 308-312 processing that is always sequential (per track). Since there can be several tracks in a video, they all have their own independent processing pipeline 308-310-312 or 308-312 and generate independent activity classification results.
  • FIG. 4 shows more details of the Feature Extractor 314, CNN 308, RNN 310, and activity classifier 312 modules.
  • FIG. 4 includes representations of an activity classification subsystem 400 that includes CNN 308, RNN 310, and activity classifier 312.
  • the subsystem 400 receives the object tracks 402 and outputs an activity label 416 (e.g., an activity classification).
  • Layer 410 represents the last layer of CNN 308, and in some embodiments, has 4096 nodes (e.g., features).
  • Embodiments were downloaded and evaluated on the Video and Image
  • VIRAT Retrieval and Analysis Tool
  • the implemented evaluation focused on activity classification only (i.e., CNN 308, RNN 310, and activity classifier 312 modules).
  • Four different methods were evaluated using ground-truth based video clips (16 evenly spaced frames from each activity and rescaling the images to 360x360 pixels).
  • the SVMs were trained on either the CNN features averaged across the 16 frames, RNN features averaged across the 16 frames, RNN features concatenated across the 16 frames, or RNN features selected from the last frame.
  • the performance was evaluated with cross-validation using a split of 80% training and 20% testing.
  • the table of FIG. 5 shows the percentages of correct classifications with these four methods
  • FIG. 6 shows an example in/out facility classification image.
  • FIG. 6 includes image 602 of a person getting in or out of a facility, and a representation of a bounding box 604.
  • the data was collected from two pan-tilt-zoom surveillance cameras mounted to buildings looking down to a parking lot. About 45 minutes of video were recorded from each camera while people went through the parking lot,
  • FIGS. 7 and 8 specifically performing the activities of opening/closing a trunk and getting in/out of a vehicle (see FIGS. 7 and 8 for example video images).
  • FIG. 7 includes images 702, 704, and 706 representing a person next to an open trunk, a person between two vehicles, and a person near a closed trunk, respectively.
  • FIG. 7 further includes images 708 and 710 representing a person next to a closed trunk and a person not next to a vehicle trunk, respectively.
  • FIG. 8 includes an image 802 of a person text to an open trunk, and an image 804 of a person getting in or out of a vehicle.
  • End-to-end test and evaluation has also been completed of an embodiment of a full activity recognition pipeline (e.g., object detection 304, tracks formation 306, CNN 308, RNN 310, activity classifier 312) on visible domain color videos from a moving ground vehicle.
  • a full activity recognition pipeline e.g., object detection 304, tracks formation 306, CNN 308, RNN 310, activity classifier 312
  • additional or different modules may be used to constitute a full activity recognition pipeline.
  • the second evaluation completed evaluation of the full system with five integrated modules (object detection 304, tracks formation 306, CNN 308, RNN 310, activity classifier 312) on 30 videos, and results are shown in FIG. 11. These videos were sequestered and not used or seen during any training steps.
  • the object detection module for this evaluation uses information from Literature Reference No. 8. Tracks were classified every 16 frames; and the most frequently occurring label for that track was then applied to all frames in that track. The output was compared against ground truth and ROC evaluations done (FIG. 6). These evaluations did not involve filtering by the number of pixels on a target (e.g., total pixels making up an object of interest). The results show an overall accuracy of 29% at FPPI ⁇ 0.02. The chance accuracy for this dataset is 8%.
  • a processor 104 may be used to control a device 1204 (e.g., a mobile device display, a virtual reality display, an augmented reality display, a computer monitor, a motor, a machine, a drone, a camera, etc.) based on an output of one or more of the modules described above.
  • a device 1204 e.g., a mobile device display, a virtual reality display, an augmented reality display, a computer monitor, a motor, a machine, a drone, a camera, etc.
  • the device 1204 may be controlled based on an activity classification determined by activity classifier 312.
  • the control of the device 1204 may be used to send at least one of a visual, audio, or electronic alert, such as regarding the activity classification of an object of interest (e.g., a pedestrian may be given an activity classification of moving into the path of a vehicle).
  • a visual alert may be a warning light, a message provided on a display, or an image of the detected object.
  • An audible alert may be a tone or other sound.
  • An electronic alert may be an email, text message, or social media message.
  • the device 1204 may be controlled to cause the device 1204 to move or otherwise initiate a physical action (e.g., a maneuver) based on the prediction.
  • a physical action e.g., a maneuver
  • an aerial or ground based vehicle e.g., a drone
  • the device 1204 may be an actuator or motor that is used to cause a camera (or sensor) or other machine to move.
  • the device 1204 may be a camera or vehicle or other machine.
  • the device 1204 can receive or send alerts and/or tweets comprising an activity classification, time, and image or video to a user's cell phone or a central monitoring station.
  • Example operations may include changing a field of view (e.g., orientation) of a camera to encompass or otherwise be directed towards the location where a classified activity was detected, which may allow the video image to be centered and/or zoomed in on an object of interest that is performing a classified activity of interest.
  • Basic motor commands are known in the art, as are systems and algorithms for keeping or changing position, speed, acceleration, and orientation.
  • FIG. 13 is a flowchart illustrating operations for predicting movement of one or more objects of interest, according to an embodiment.
  • a set of objects of interest in video data is detected and an object classification is determined for each object in the set of objects of interest, the set comprising at least one object of interest.
  • a corresponding activity track is formed for each object in the set of objects of interest by tracking each object across frames.
  • a corresponding feature in the video data is determined by performing feature extraction based on the corresponding activity track, the feature extractor comprising a convolutional neural network.
  • a corresponding activity classification is determined for each object of interest.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Neurology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Closed-Circuit Television Systems (AREA)

Abstract

L'invention concerne un système de reconnaissance d'activité visuelle qui comprend un ou plusieurs processeurs ainsi qu'une mémoire, la mémoire étant un support lisible par ordinateur non transitoire sur lequel sont codées des instructions exécutables de façon à ce que, lors de l'exécution des instructions, le ou les processeurs exécutent des opérations consistant à détecter un ensemble d'objets d'intérêt dans des données vidéo ainsi qu'à déterminer une classification d'objet pour chaque objet de l'ensemble d'objets d'intérêt, l'ensemble comprenant au moins un objet d'intérêt. Le ou les processeurs exécutent également des opérations consistant à former une piste d'activité correspondante pour chaque objet de l'ensemble d'objets d'intérêt en suivant chaque objet à travers des trames. Le ou les processeurs exécutent également des opérations consistant, pour chaque objet d'intérêt et à l'aide d'un extracteur de caractéristiques, à déterminer une caractéristique correspondante dans les données vidéo. Le système peut fournir un rapport au téléphone cellulaire d'un utilisateur ou à une installation de surveillance centrale.
EP18823688.9A 2017-06-07 2018-04-06 Système neuromorphique de reconnaissance d'activité visuelle en temps réel Pending EP3635628A4 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201762516217P 2017-06-07 2017-06-07
US15/883,822 US11055872B1 (en) 2017-03-30 2018-01-30 Real-time object recognition using cascaded features, deep learning and multi-target tracking
PCT/US2018/026432 WO2019005257A2 (fr) 2017-06-07 2018-04-06 Système neuromorphique de reconnaissance d'activité visuelle en temps réel

Publications (2)

Publication Number Publication Date
EP3635628A2 true EP3635628A2 (fr) 2020-04-15
EP3635628A4 EP3635628A4 (fr) 2021-03-10

Family

ID=64741843

Family Applications (1)

Application Number Title Priority Date Filing Date
EP18823688.9A Pending EP3635628A4 (fr) 2017-06-07 2018-04-06 Système neuromorphique de reconnaissance d'activité visuelle en temps réel

Country Status (3)

Country Link
EP (1) EP3635628A4 (fr)
CN (1) CN110603542B (fr)
WO (1) WO2019005257A2 (fr)

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2522589T3 (es) * 2007-02-08 2014-11-17 Behavioral Recognition Systems, Inc. Sistema de reconocimiento conductual
US8175333B2 (en) * 2007-09-27 2012-05-08 Behavioral Recognition Systems, Inc. Estimator identifier component for behavioral recognition system
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
TWI430212B (zh) * 2010-06-08 2014-03-11 Gorilla Technology Inc 利用多特徵自動集群法之異常行為偵測系統與方法
US9008366B1 (en) * 2012-01-23 2015-04-14 Hrl Laboratories, Llc Bio-inspired method of ground object cueing in airborne motion imagery
US9576214B1 (en) * 2012-01-23 2017-02-21 Hrl Laboratories, Llc Robust object recognition from moving platforms by combining form and motion detection with bio-inspired classification
US8700251B1 (en) * 2012-04-13 2014-04-15 Google Inc. System and method for automatically detecting key behaviors by vehicles
MX346218B (es) * 2012-09-05 2017-03-09 Element Inc Sistema y método de autenticación biométrica en conexión con dispositivos equipados con cámara.
EP2720172A1 (fr) * 2012-10-12 2014-04-16 Nederlandse Organisatie voor toegepast -natuurwetenschappelijk onderzoek TNO Système et procédé d'accès vidéo sur la base de la détection de type d'action
WO2014098783A1 (fr) * 2012-12-21 2014-06-26 Echostar Ukraine, LLC Identification et suivi à l'écran
US9449230B2 (en) * 2014-11-26 2016-09-20 Zepp Labs, Inc. Fast object tracking framework for sports video recognition
GB201512283D0 (en) * 2015-07-14 2015-08-19 Apical Ltd Track behaviour events

Also Published As

Publication number Publication date
CN110603542A (zh) 2019-12-20
WO2019005257A3 (fr) 2019-05-02
WO2019005257A2 (fr) 2019-01-03
EP3635628A4 (fr) 2021-03-10
CN110603542B (zh) 2023-04-25

Similar Documents

Publication Publication Date Title
US10997421B2 (en) Neuromorphic system for real-time visual activity recognition
Tripathi et al. Convolutional neural networks for crowd behaviour analysis: a survey
US10891488B2 (en) System and method for neuromorphic visual activity classification based on foveated detection and contextual filtering
Tripathi et al. Suspicious human activity recognition: a review
Sharma et al. Fisher’s linear discriminant ratio based threshold for moving human detection in thermal video
Arroyo et al. Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls
US11055872B1 (en) Real-time object recognition using cascaded features, deep learning and multi-target tracking
Yu et al. Multiple target tracking using spatio-temporal markov chain monte carlo data association
CN111985385B (zh) 一种行为检测方法、装置及设备
US20180232904A1 (en) Detection of Risky Objects in Image Frames
CN111566661B (zh) 用于视觉活动分类的系统、方法、计算机可读介质
Duque et al. Prediction of abnormal behaviors for intelligent video surveillance systems
Ferryman et al. Robust abandoned object detection integrating wide area visual surveillance and social context
Roy et al. Suspicious and violent activity detection of humans using HOG features and SVM classifier in surveillance videos
Campo et al. Static force field representation of environments based on agents’ nonlinear motions
López-Rubio et al. Anomalous object detection by active search with PTZ cameras
Narayanan et al. Real-time video surveillance system for detecting malicious actions and weapons in public spaces
CN110603542B (zh) 用于视觉活动识别的系统、方法和计算机可读介质
Yani et al. An efficient activity recognition for homecare robots from multi-modal communication dataset.
Karishma et al. Artificial Intelligence in Video Surveillance
Becker et al. Detecting abandoned objects using interacting multiple models
Castillo et al. A review on intelligent monitoring and activity interpretation
Mahajan et al. An Introduction to Deep Learning‐Based Object Recognition and Tracking for Enabling Defense Applications
Madrigal et al. Improving multiple pedestrians tracking with semantic information
Srivastava et al. Anomaly Detection Approach for Human Detection in Crowd Based Locations

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20191127

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20210205

RIC1 Information provided on ipc code assigned before grant

Ipc: G06K 9/62 20060101ALI20210201BHEP

Ipc: G06K 9/00 20060101ALI20210201BHEP

Ipc: G06K 9/32 20060101AFI20210201BHEP

Ipc: G06N 3/02 20060101ALI20210201BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230315

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230525