WO2020075147A1

WO2020075147A1 - Intelligent vision system and methods

Info

Publication number: WO2020075147A1
Application number: PCT/IB2019/058721
Authority: WO
Inventors: Mohamed SOMER; Abdul Rahman ABDUL GHANI
Original assignee: Fovea Technology
Priority date: 2018-10-13
Filing date: 2019-10-13
Publication date: 2020-04-16

Abstract

Intelligent vision systems and methods which are capable of focusing on subtle visual cues at high resolution using principles of a foveal view coupled with an attention mechanism, that functions like the human eye are described. Intelligent vision systems and methods are based on artificial intelligence-deep learning neural network models that rely on a hierarchy of visual streams towards achieving humanand greater levels of sharp vision. Intelligent visions systems and methods processmultiple high-resolution field of view visual streams to create a foveal view representation of the environment which is fed forward to the neural network, which controls the pan/tilt positioning of the cameras based on an attention mechanism to achieve focus on a Subject of Interest (SOI).

Description

Intelligent Vision System and Methods

Background

Visual acuity is defined as sharpness or clarity of vision. Humans visually perceive their environment with very high levels of detail at very high resolutions (60 to 120 pixels/degree) in a 360-degree environment. This visual acuity allows humans to perform tasks such as driving and engaging in social interactions that require attention to subtle visual cues such as the intention of a pedestrian or facial expression.

Visual acuity is a function of the fovea which is the portion of the eye that has the capability to achieve the greatest acuity. At the center of the fovea, visual acuity is at its greatest. Moving away from the fovea, visual acuity progressively diminishes, thus establishing a visual representation model that is a hierarchical layering of varying degrees of acuity as is depicted in Figure 1 which depicts the fovea of the human eye.

This hierarchy plays a crucial role for the human attention mechanism as the wider representations contain low-resolution information of the environment that guides the more highly detailed foveal view towards the Subject-of-Interest (SOI) through the hierarchical processing of progressive views with ever increasing acuity.

An example of this occurs when something in the corner of the eye catches the person’s attention and they move their eyes to achieve a more highly detailed view of the object. The eyes move to align and focus the fovea onto the SOI so that it can be seen with a high degree of visual acuity. While in the corner of the eye, the person may not be able to identify the object, or discern details, but once the object is within the foveal view of the eye, the object and details are easily identifiable due to the high degree of acuity of the fovea.

Disclosed herein is a method for an Intelligent Vision System that is able to function as good as, or better than, human vision and its attention mechanism. This Intelligent Vision System is based on a unique combination of Artificial Intelligence (AI) and commercially available cameras and servos.

To replace human vision with an Intelligent Vision System, an AI capability is desired to perceive the environment in a manner that is similar to humans. This must be done at resolutions with very high levels of detail of 60 to 120 pixels/degree or greater and it must also support visual hierarchical layering in varying degrees of visual acuity, just like human vision.

The improvement that this Intelligent Vision System commands over existing cameras and camera systems both with and without AI support is that it is based on the human vision model of hierarchical layering in varying degrees of visual acuity coupled with the human attention mechanism and a unique coupling of AI, cameras and servos.

For any other vision system to achieve this performance it would require an AI system with approximately 1,000 directional full- HD cameras to observe the 360° view of the environment. An AI system that can handle the input of 1,000 cameras in real-time is not practical for real-life applications due to extreme cost, high bandwidth requirements and implementation complexity.

Practical applications of the Intelligent Vision Systems would include autonomous driving, surveillance systems, law enforcement, security systems, improving the performance of identifying faces in a crowded environments such as airports, crowded cities, metro stations, events, and similar environments, reading auto and truck details and license plates from increased distances, Automatic Plate Number Recognition (ANPR) for vehicle on-the-move systems, free movement expressway systems and more.

Summary

Accordingly, intelligent vision systems and methods for use in security/surveillance camera systems is desired. Intelligent vision systems and methods, for use in security/surveillance camera systems, that emulates the human retina and the attention mechanism of the human brain, replicating the central region of human vision ("Fovea") and its surrounding hierarchical vision and attention system results in a significant performance increase over existing security/surveillance camera systems and any other camera system.

As disclosed and described herein, such intelligent vision systems and methods can be achieved through the innovative use of AI coupled with hardware, software, firmware, a combination or the like

To achieve the equivalent of human-level all-around high perception and vision acuity, deep learning AI is used to control one or more cameras with a field of view (FOV) that generates a set of Progressive Centrist Hierarchical Views (PCHV) of the visual environment. Each of the cameras can be equipped with a servo operated rapid pan/tilt mechanism that is controlled by a state-of-the-art attention mechanism to achieve the enhanced resolution, accurate movement and rapid focusing necessary to perceive the environment in high details.

The Intelligent Vision System state-of-the-art attention mechanism uses a controller mechanism with deep learning capabilities and algorithms. This achieves enhanced resolutions and improved attention driven focus, with resolutions that are greater than can be achieved by the human eye and with greater distance viewing capabilities and 360° surround vision. Such intelligent vision systems and methods may consist of an all in one camera system that thinks and acts intelligently like a human and has the accuracy, efficiency, and speed for real-time applications with better than human sight.

Controlling a hierarchical view camera system is achieved by the disclosed attention mechanisms that include deep- learning, reinforcement learning, and human behavioral cloning imitation. The disclosed intelligent vision systems and methods provide a wider area of vision coverage, higher resolution, and further distance acuity than previously achieved by known systems. This includes a 360° field of view coverage, with greater than human eye resolution (beyond 120 pixels/degree) providing up to 10 times (lOx) the focal distance of other known solutions.

When monitoring a designated area, the disclosed intelligent vision systems and methods reduce the number of cameras required to cover the area. This is accomplished through the use of an advanced artificial intelligence system which drives the camera's rapid movement to the SOI utilizing an attention system.

Detailed Description

Referring to Figures 1 through 10 an embodiment of the disclosed intelligent vision system and method is illustrated. The intelligent vision systems and methods herein are based on a foveal view with an attention model that utilizes a hierarchical set of video image streams. Each video image stream includes an appropriate FOV to bring high resolution attention to the SOI.

Development of the disclosed intelligent vision systems and methods included seven Human-Robot Interaction (HRI) attention game experiments which were performed using one, two or three cameras with varying FOV. The subjects were nine balls with simulated human faces. HRI is a challenging task that requires an AI Deep Learning Neural Network for detecting subtle visual cues such as an eye gaze, and the like.

The varying FOVs where achieved by changing the lenses between experiments. The balls had two circles drawn on them to represent eyes and a small cone was glued onto each ball below the eyes to represent a nose as is depicted in Figure 2.

Referring to Figure 3, the basic principles of the HRI attention game experiments are depicted. The camera(s) is situated on the bottom left of the HRI Attention game table and the nine simulated human faces (in this case annotated balls) are positioned at various locations within the FOV of the camera(s).

For each experiment there are a series of episodes in which a randomly selected face looks toward the camera(s) while the other faces will look in random directions. The task for the camera(s) AI is to locate the face that is looking at it. For each step, within the episode, the AI rotates the camera(s) by one degree in any direction. The results include two components: for each time step the AI receives a reward of +1 if it can locate the correct face or a punishment of -0.1 if it cannot. Each episode has a maximum of 50 steps.

During each experiment, the position of the human face balls was adjusted by either moving the faces towards the camera or away from the camera and measuring the time required for the AI-Deep Learning Neural Network to focus its attention on the ball with the simulated human face which is best facing the camera.

The goal of the experiments was to measure the time to achieve the attention focus for different numbers of cameras and varying FOVs.

Prior to starting the experiments, a baseline for a successful attention focus was established to set the maximum score, which is the best score that can be achieved.

Table 1 summarizes the camera and FOV configurations for each experiment and the results of the experiments in the last two columns. Figure 4 provides a graphical representation of experiment results.

One camera is used for experiments 1, 2 and 3. For each of these experiments a single lens is used but with a progressively wider field of view ranging from narrow to wide. The outcome of the experiments showed the HRI Game result ranging from -0.73 to -0.60, and are depicted graphically in Figure 4a) is that with a single camera regardless of the FOV it results in the lowest performance for achieving SOI attention focus.

Two cameras are used for experiments 4, 5 and 6. For each of these experiments one narrow angle lens is used on one camera and a wide-angle lens is used on the other camera; but for each experiment it is a different combination of lenses. The outcome of the experiments showed the HRI Game results ranging from -0.59 to - 0.16, and are depicted graphically in Figure 4b) is that with a single camera regardless of the FOV it results in an improved performance for achieving SOI attention focus as compared to experiments 1, 2 and 3. The use of two cameras has improved the convergence of the performance towards that of the baseline with experiment 6 achieving the best performance.

Three cameras are used for experiment 3. For this experiment, three lenses are used; one narrow angle lens is used on one camera, a midrange lens on another camera and a wide-angle lens is used on the third camera. The outcome of the experiments was an HRI Game result of 0.04 which is depicted graphically in Figure 4c) is that with a single camera regardless of the FOV it results in an improved performance for achieving SOI attention focus as compared to experiments 1 through 6. The use of three cameras with appropriately selected lenses has further improved the convergence of the performance towards the baseline with experiment 7 achieving the best performance out of the seven experiments.

As illustrated in Figure 4, each task setting was trained through an AI agent with each of the foveal representations shown until convergence. As can also be seen in Figure 4, using the three set of parameters and the learning curves shown, the performance of all the experiments are summarized in Table 1. The values were computed by the average score over the last 100 trials. It can be seen that there are large variations in performance depending on the foveal representation used and it is a crucial hyper-parameter for the attention model.

In in Figure 4, it can be also seen that the camera with FOV of 30 degree is a preferred representation for this setoff parameters as a single observation, while cameras with FOV of 10 degrees and 60 degrees fail. This indicates that with a single observation, the FOV can greatly affect performance. Experiments 4 and 5, depicted in Figure 4b and Table 1, achieve low performance as the combination of two observations have large variations in FOV and the wider camera is unable to guide the narrower camera towards the SOI. On the other hand, experiment 6 has better performance as both cameras are able to work together efficiently by forming a hierarchy that improves ability to zero in of the SOI. Finally, from Figure 4b it can be seen that increasing the number of cameras increases performance as the cameras capture information from the environment at an increasing number of hierarchical views with improved resolutions.

Similar performance is achieved with the image classification task as shown in Figure 4c with experiment number 7 having the best performance in both cases. However, there were slight variations in comparison to the HRI task such as experiments with three cameras being better than experiments with one or two cameras. From these experiments it can be seen that the AI Deep Learning neural network needs to focus on global images features and operation is improved with a wider FOV in general.

To train the attention model, Proximal Policy Optimization (PPO) was used. PPO is an efficient policy search algorithm capable of handling high-dimensional observations. The policy function is represented by a Dynamic Neural Network (DNN) as shown in Figure4. The input to the network includes both the environment state and observations.

The state representation is given by s_t R²"^c with n_c as the number of cameras and the state for each camera as its current orientation along two axes of rotation. Each camera also provides an observation in the form of an RGB image o^icl> R^{64 4} \

There are two outputs generated by the AI Deep Learning neural network which includes the policy and a scalar state value estimate. The policy output is a “softmax” probability distribution e(a_t s_t) R⁹ where the output actions drive the motion of the cameras along one of the eight cardinal directions.

Figures 5 through 10 illustrate one preferred embodiment of the Intelligent Vision Systems and Methods described herein. This preferred embodiment in Figure 5 includes a head (10) to which are mounted cameras (12), (13) and (14). Head (10) includes a pan/tilt end-effector (15) which controls the direction of the cameras (12), (13) and (14) with their corresponding lenses (18), (19) and (20). A separate pan mechanism (15) and tilt mechanism (17) provides additional control of the camera positioning. Head (10) has a fast real-time actuator (22) movement frequency of 10 to over 60 actions/second. Fast real-time actuator (22) is controlled by processor (23). This provides for rapid movement and rotation of the cameras (12), (13), (14) to desired specific pan/tilt angles and locations.

While three cameras where used for the prototype it will be appreciated by those skilled in the art that the Intelligent Vision Systems and Methods can include any number of cameras, camera combinations, camera proximity, camera network (IP or otherwise) or integrated cameras (i.e. cameras with multiple sensors) which may be used to realize the desired and improved Intelligent Vision Systems and Methods.

The images shown in Figure 6 and Figure 7 are photographs of the actual prototype.

The Intelligent Vision System utilizes a controller for performing the real-time control of the servos and cameras. The controller is hardware with firmware that is embedded in the hardware. The controller utilizes the output from the artificial intelligence-deep learning neural network algorithm to manage the positioning of the pan/tilt cameras by issuing commands to the servos.

Servo driven mechanical movement is achieved using gimbals, gearing, pivots mechanical movement devices. The real-time rotation of the end-effector pan/tilt mechanisms is achieved with fast servos.

The interaction between cameras, image processing, image streams, neural networks, controllers, pan/tilt functionality and the repeating of one or more of these functionalities is achieved by the controller firmware. In each process module or functionality, it is necessary to execute a process flow which includes one or more of the following steps which are shown in the flowchart depicted in Figure 8:

1. capturing images from one or more cameras;

2. processing the images to generate the required streams;

3. feeding forward or otherwise making the images available to the neural network;

4. reading the output from the neural network to control the pan/tilt position or other positioning of one or more cameras for the next time step; or

Start again from step 1 to train the system and further the deep learning process. All these steps are happening simultaneously and rapidly.

The Intelligent Vision Systems and Methods leverages Open Source software for image capture and for preprocessing the images.

1. (Open Source Computer Vision Library) OpenCV™ for image capture and processing OpenCV provides a common infrastructure for computer vision applications and machine perception. This software library can be used to detect and recognize faces, identify objects, classify human actions in videos, track camera movements, track moving objects, extract 3D models of objects, produce 3D point clouds from stereo cameras, stitch images together to produce a high resolution image of an entire scene, find similar images from an image database, remove red eyes from images taken using flash, follow eye movements, recognize scenery and establish markers to overlay it with augmented reality, etc.

2. Tensorflow™ is used for control and to construct and train the neural network to produce the desired behavior thus achieving the desired classification outputs.

TensorFlow is an end-to-end open source platform for machine learning supported numerical computations that makes machine learning faster and easier. TensorFlow provides an abstraction that includes machine learning and deep learning (i.e. neural networking) models and algorithms and manages the various options for directing the output of one function to the input of another.

3. Robot Operating System (ROS™) to provide the servo control of the pan/tilt mechanism in real-time

ROS is a flexible robot software framework. It is a collection of tools, libraries, and conventions that aim to simplify the task of creating complex and robust robot behavior across a wide variety of robotic platforms.

The Intelligent Vision Systems and Methods disclosed include software with computer vision code to read the images from the cameras and process them in real time.

The image shown in Figure 9 depicts all of these elements coming together to realize the intelligent vision system.

The input images from the cameras are divided into multiple streams using a 120° (for example) FOV camera which can capture the center part of the image that represent 60° (for example) FOV then resize both images to a specific resolution to feed a deep controller neural network. Such configuration will reduce the computational demand significantly without losing fine details in the 60° FOV.

As illustrated in Figure 10, each camera sensor may provide a signal that generates a stream of data as detected by the sensor. Such sensors can be one or more of the following:

• Zoom camera sensor (12) develops the zoom camera stream (24); • Wide camera sensor (13) develops wide camera stream (26);

• Surround camera sensor (14) develops surround camera stream (28).

The zoom camera (24), wide camera (26) and surround camera (26) streams may be fed into software level streams at level (0) to level (4). Such feeds may include:

• Zoom camera stream (24) is fed into software level streams 0 and 1 (28), (30)

• Wide camera stream (26) is fed into software level streams 2 and 3 (32), (34)

• Surround camera stream (28) is fed into software level stream 4 (36).

Software level stream 0 to software level stream 4 are processed by deep learning layers (38, 40, 42, 44 and 46). These deep learning layers (38, 40, 42, 44 and 46) are concatenated (48) and fed to long short-term memory (LSTM) modules Tl, T2, T3... Tn (50, 52, 54, 56). The outputs (58, 60, 62, 64) of LSTM modules Tl to Tn (50, 52, 54, 56) can be classified and placed into memory (66, 68, 70 and 72) for future use.

LSTM outputs (66, 68, 70, 72) provide feedback through feedback loop (74) to adjust the pan and tilt mechanism controlled by computational processor (23). A configuration of lenses (18), (19) and (20) with multiple progressive centrist FOV parameters are used to provide the desired focal lengths and properties for the Intelligent Vision System and Methods. Such FOV parameters include but are not limited to, 360°, 120°, 10° which provides for human level resolution of approximately 120 pixels/degree.

Another embodiment may include, but are not limited to, FOV parameters of 360°, 120°, 60°, 10°, 2° which provides beyond human level resolution of more than 120 pixels/degree.

Reference throughout this specification to“the embodiment,”“this embodiment,” “the previous embodiment,”“one embodiment,”“an embodiment,”“a preferred embodiment”“another preferred embodiment” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases“in the embodiment,“in this embodiment,”“in the previous embodiment, in one embodiment, in an embodiment,”“in a preferred embodiment,”“in another preferred embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention. While the present invention has been described in connection with certain exemplary or specific embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications, alternatives, modifications and equivalent arrangement as will be apparent to those skilled in the art. Any such changes, modifications, alternative, equivalents and the like may be made without departing from the spirit and scope of the disclosure.

Claims

1. An intelligent vision system comprising:

a. A camera that can be positioned for creating an image data stream output for the camera and capturing images from the camera;

b. a processor for processing the image data stream output of the camera to generate one or more image data streams for the camera; c. a neural network for receiving and further processing the image data stream output from the camera;

d. a reader for reading the output from the neural network to control the position of the camera; and

e. a feedback network for further processing the image data stream output of the camera to ultimately focus in on the Subject of Interest (SOI).

2. The intelligent vision system of claim 1, further including an artificial intelligence neural network and a second (or more) camera, wherein the artificial neural network manages a hierarchy of different view angles generated by the first camera and the second (or more) camera.

3. The intelligent vision system of claim 2, wherein the first and second (or more) cameras function as a single camera system.

4. The intelligent vision system of claim 1, wherein the artificial intelligence neural network includes a deep learning module controlling rapid movement of the cameras to find, detect or track one or more SOI targets.

5. The intelligent vision system of claim 1, wherein the camera produces a normal resolution image data stream and the processor provides a high- resolution image data stream of one or more targets, wherein the normal resolution image data stream and the high-resolution image data stream are displayed together.

6. A method of providing intelligent vision images including:

a. positioning at least one camera for creating an image data stream output for each camera;

b. capturing image data stream output from the at least one camera; c. processing the image data stream output from the at least one camera to generate one or more data streams;

d. further processing the image data stream output from the at least one camera through a neural network;

e. reading the output from the neural network to control the position of the at least one camera; and

f. further processing the image data stream of the at least one camera through a feedback network.

7. The method of claim 6, further positioning a second (or more) camera for creating a second image data stream output, and an artificial neural network managing a hierarchy of different view angles generated by the first camera and the second (or more) camera.

8. The method of claim 7, wherein the first and second (or more) cameras function as a single camera system.

9. The method of claim 6, wherein further processing the image data stream of the at least one camera through a feedback network includes processing the image data stream of the at least one camera through a deep learning module controlling rapid movement of the camera to find, detect or track one or more SOI targets.

10. The method of claim 6, wherein processing the image data stream output of the at least one camera to generate one or more data streams produces a normal resolution image data stream and further processing the image data stream output from the at least one camera through a neural network provides a high-resolution image data stream of one or more SOI targets, wherein the normal resolution image data stream and the high-resolution image data stream are displayed together.