WO2020075147A1 - Intelligent vision system and methods - Google Patents

Intelligent vision system and methods Download PDF

Info

Publication number
WO2020075147A1
WO2020075147A1 PCT/IB2019/058721 IB2019058721W WO2020075147A1 WO 2020075147 A1 WO2020075147 A1 WO 2020075147A1 IB 2019058721 W IB2019058721 W IB 2019058721W WO 2020075147 A1 WO2020075147 A1 WO 2020075147A1
Authority
WO
WIPO (PCT)
Prior art keywords
camera
image data
data stream
neural network
intelligent vision
Prior art date
Application number
PCT/IB2019/058721
Other languages
French (fr)
Inventor
Mohamed SOMER
Abdul Rahman ABDUL GHANI
Original Assignee
Fovea Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fovea Technology filed Critical Fovea Technology
Publication of WO2020075147A1 publication Critical patent/WO2020075147A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
    • H04N7/183Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast for receiving images from a single remote source
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/45Cameras or camera modules comprising electronic image sensors; Control thereof for generating image signals from two or more image sensors being of different type or operating in different modes, e.g. with a CMOS sensor for moving images in combination with a charge-coupled device [CCD] for still images
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/50Constructional details
    • H04N23/51Housings
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/58Means for changing the camera field of view without moving the camera body, e.g. nutating or panning of optics or image sensors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Definitions

  • Visual acuity is defined as sharpness or clarity of vision. Humans visually perceive their environment with very high levels of detail at very high resolutions (60 to 120 pixels/degree) in a 360-degree environment. This visual acuity allows humans to perform tasks such as driving and engaging in social interactions that require attention to subtle visual cues such as the intention of a pedestrian or facial expression.
  • Visual acuity is a function of the fovea which is the portion of the eye that has the capability to achieve the greatest acuity. At the center of the fovea, visual acuity is at its greatest. Moving away from the fovea, visual acuity progressively diminishes, thus establishing a visual representation model that is a hierarchical layering of varying degrees of acuity as is depicted in Figure 1 which depicts the fovea of the human eye.
  • This hierarchy plays a crucial role for the human attention mechanism as the wider representations contain low-resolution information of the environment that guides the more highly detailed foveal view towards the Subject-of-Interest (SOI) through the hierarchical processing of progressive views with ever increasing acuity.
  • SOI Subject-of-Interest
  • This Intelligent Vision System is based on a unique combination of Artificial Intelligence (AI) and commercially available cameras and servos.
  • AI Artificial Intelligence
  • an AI capability is desired to perceive the environment in a manner that is similar to humans. This must be done at resolutions with very high levels of detail of 60 to 120 pixels/degree or greater and it must also support visual hierarchical layering in varying degrees of visual acuity, just like human vision.
  • This Intelligent Vision System commands over existing cameras and camera systems both with and without AI support is that it is based on the human vision model of hierarchical layering in varying degrees of visual acuity coupled with the human attention mechanism and a unique coupling of AI, cameras and servos.
  • Intelligent Vision Systems would include autonomous driving, surveillance systems, law enforcement, security systems, improving the performance of identifying faces in a crowded environments such as airports, crowded cities, metro stations, events, and similar environments, reading auto and truck details and license plates from increased distances, Automatic Plate Number Recognition (ANPR) for vehicle on-the-move systems, free movement expressway systems and more.
  • ANPR Automatic Plate Number Recognition
  • Intelligent vision systems and methods for use in security/surveillance camera systems is desired.
  • Intelligent vision systems and methods, for use in security/surveillance camera systems that emulates the human retina and the attention mechanism of the human brain, replicating the central region of human vision ("Fovea") and its surrounding hierarchical vision and attention system results in a significant performance increase over existing security/surveillance camera systems and any other camera system.
  • Fovea central region of human vision
  • deep learning AI is used to control one or more cameras with a field of view (FOV) that generates a set of Progressive Centrist Hierarchical Views (PCHV) of the visual environment.
  • FOV field of view
  • PCHV Progressive Centrist Hierarchical Views
  • Each of the cameras can be equipped with a servo operated rapid pan/tilt mechanism that is controlled by a state-of-the-art attention mechanism to achieve the enhanced resolution, accurate movement and rapid focusing necessary to perceive the environment in high details.
  • the Intelligent Vision System state-of-the-art attention mechanism uses a controller mechanism with deep learning capabilities and algorithms. This achieves enhanced resolutions and improved attention driven focus, with resolutions that are greater than can be achieved by the human eye and with greater distance viewing capabilities and 360° surround vision.
  • Such intelligent vision systems and methods may consist of an all in one camera system that thinks and acts intelligently like a human and has the accuracy, efficiency, and speed for real-time applications with better than human sight.
  • Controlling a hierarchical view camera system is achieved by the disclosed attention mechanisms that include deep- learning, reinforcement learning, and human behavioral cloning imitation.
  • the disclosed intelligent vision systems and methods provide a wider area of vision coverage, higher resolution, and further distance acuity than previously achieved by known systems. This includes a 360° field of view coverage, with greater than human eye resolution (beyond 120 pixels/degree) providing up to 10 times (lOx) the focal distance of other known solutions.
  • the disclosed intelligent vision systems and methods When monitoring a designated area, the disclosed intelligent vision systems and methods reduce the number of cameras required to cover the area. This is accomplished through the use of an advanced artificial intelligence system which drives the camera's rapid movement to the SOI utilizing an attention system.
  • the intelligent vision systems and methods herein are based on a foveal view with an attention model that utilizes a hierarchical set of video image streams.
  • Each video image stream includes an appropriate FOV to bring high resolution attention to the SOI.
  • HRI Human-Robot Interaction
  • the camera(s) is situated on the bottom left of the HRI Attention game table and the nine simulated human faces (in this case annotated balls) are positioned at various locations within the FOV of the camera(s).
  • the task for the camera(s) AI is to locate the face that is looking at it. For each step, within the episode, the AI rotates the camera(s) by one degree in any direction.
  • the results include two components: for each time step the AI receives a reward of +1 if it can locate the correct face or a punishment of -0.1 if it cannot.
  • Each episode has a maximum of 50 steps.
  • the position of the human face balls was adjusted by either moving the faces towards the camera or away from the camera and measuring the time required for the AI-Deep Learning Neural Network to focus its attention on the ball with the simulated human face which is best facing the camera.
  • the goal of the experiments was to measure the time to achieve the attention focus for different numbers of cameras and varying FOVs.
  • Table 1 summarizes the camera and FOV configurations for each experiment and the results of the experiments in the last two columns.
  • Figure 4 provides a graphical representation of experiment results.
  • each task setting was trained through an AI agent with each of the foveal representations shown until convergence.
  • the performance of all the experiments are summarized in Table 1. The values were computed by the average score over the last 100 trials. It can be seen that there are large variations in performance depending on the foveal representation used and it is a crucial hyper-parameter for the attention model.
  • PPO Proximal Policy Optimization
  • DNN Dynamic Neural Network
  • the state representation is given by s t R 2 " c with n c as the number of cameras and the state for each camera as its current orientation along two axes of rotation.
  • Each camera also provides an observation in the form of an RGB image o icl> R 64 4 ⁇
  • the policy output is a “softmax” probability distribution e(a t s t ) R 9 where the output actions drive the motion of the cameras along one of the eight cardinal directions.
  • FIGS 5 through 10 illustrate one preferred embodiment of the Intelligent Vision Systems and Methods described herein.
  • This preferred embodiment in Figure 5 includes a head (10) to which are mounted cameras (12), (13) and (14).
  • Head (10) includes a pan/tilt end-effector (15) which controls the direction of the cameras (12), (13) and (14) with their corresponding lenses (18), (19) and (20).
  • a separate pan mechanism (15) and tilt mechanism (17) provides additional control of the camera positioning.
  • Head (10) has a fast real-time actuator (22) movement frequency of 10 to over 60 actions/second.
  • Fast real-time actuator (22) is controlled by processor (23). This provides for rapid movement and rotation of the cameras (12), (13), (14) to desired specific pan/tilt angles and locations.
  • the Intelligent Vision Systems and Methods can include any number of cameras, camera combinations, camera proximity, camera network (IP or otherwise) or integrated cameras (i.e. cameras with multiple sensors) which may be used to realize the desired and improved Intelligent Vision Systems and Methods.
  • IP camera network
  • integrated cameras i.e. cameras with multiple sensors
  • the Intelligent Vision System utilizes a controller for performing the real-time control of the servos and cameras.
  • the controller is hardware with firmware that is embedded in the hardware.
  • the controller utilizes the output from the artificial intelligence-deep learning neural network algorithm to manage the positioning of the pan/tilt cameras by issuing commands to the servos.
  • Servo driven mechanical movement is achieved using gimbals, gearing, pivots mechanical movement devices.
  • the real-time rotation of the end-effector pan/tilt mechanisms is achieved with fast servos.
  • step 1 start again from step 1 to train the system and further the deep learning process. All these steps are happening simultaneously and rapidly.
  • the Intelligent Vision Systems and Methods leverages Open Source software for image capture and for preprocessing the images.
  • OpenCVTM for image capture and processing OpenCV provides a common infrastructure for computer vision applications and machine perception. This software library can be used to detect and recognize faces, identify objects, classify human actions in videos, track camera movements, track moving objects, extract 3D models of objects, produce 3D point clouds from stereo cameras, stitch images together to produce a high resolution image of an entire scene, find similar images from an image database, remove red eyes from images taken using flash, follow eye movements, recognize scenery and establish markers to overlay it with augmented reality, etc.
  • TensorflowTM is used for control and to construct and train the neural network to produce the desired behavior thus achieving the desired classification outputs.
  • TensorFlow is an end-to-end open source platform for machine learning supported numerical computations that makes machine learning faster and easier.
  • TensorFlow provides an abstraction that includes machine learning and deep learning (i.e. neural networking) models and algorithms and manages the various options for directing the output of one function to the input of another.
  • ROSTM Robot Operating System
  • ROS is a flexible robot software framework. It is a collection of tools, libraries, and conventions that aim to simplify the task of creating complex and robust robot behavior across a wide variety of robotic platforms.
  • the Intelligent Vision Systems and Methods disclosed include software with computer vision code to read the images from the cameras and process them in real time.
  • the input images from the cameras are divided into multiple streams using a 120° (for example) FOV camera which can capture the center part of the image that represent 60° (for example) FOV then resize both images to a specific resolution to feed a deep controller neural network.
  • a 120° (for example) FOV camera which can capture the center part of the image that represent 60° (for example) FOV then resize both images to a specific resolution to feed a deep controller neural network.
  • Such configuration will reduce the computational demand significantly without losing fine details in the 60° FOV.
  • each camera sensor may provide a signal that generates a stream of data as detected by the sensor.
  • sensors can be one or more of the following:
  • Zoom camera sensor (12) develops the zoom camera stream (24);
  • Wide camera sensor (13) develops wide camera stream (26);
  • Surround camera sensor (14) develops surround camera stream (28).
  • the zoom camera (24), wide camera (26) and surround camera (26) streams may be fed into software level streams at level (0) to level (4).
  • Such feeds may include:
  • Zoom camera stream (24) is fed into software level streams 0 and 1 (28), (30)
  • Software level stream 0 to software level stream 4 are processed by deep learning layers (38, 40, 42, 44 and 46). These deep learning layers (38, 40, 42, 44 and 46) are concatenated (48) and fed to long short-term memory (LSTM) modules Tl, T2, T3... Tn (50, 52, 54, 56). The outputs (58, 60, 62, 64) of LSTM modules Tl to Tn (50, 52, 54, 56) can be classified and placed into memory (66, 68, 70 and 72) for future use.
  • LSTM long short-term memory
  • LSTM outputs (66, 68, 70, 72) provide feedback through feedback loop (74) to adjust the pan and tilt mechanism controlled by computational processor (23).
  • a configuration of lenses (18), (19) and (20) with multiple progressive centrist FOV parameters are used to provide the desired focal lengths and properties for the Intelligent Vision System and Methods.
  • FOV parameters include but are not limited to, 360°, 120°, 10° which provides for human level resolution of approximately 120 pixels/degree.
  • Another embodiment may include, but are not limited to, FOV parameters of 360°, 120°, 60°, 10°, 2° which provides beyond human level resolution of more than 120 pixels/degree.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Studio Devices (AREA)

Abstract

Intelligent vision systems and methods which are capable of focusing on subtle visual cues at high resolution using principles of a foveal view coupled with an attention mechanism, that functions like the human eye are described. Intelligent vision systems and methods are based on artificial intelligence-deep learning neural network models that rely on a hierarchy of visual streams towards achieving humanand greater levels of sharp vision. Intelligent visions systems and methods processmultiple high-resolution field of view visual streams to create a foveal view representation of the environment which is fed forward to the neural network, which controls the pan/tilt positioning of the cameras based on an attention mechanism to achieve focus on a Subject of Interest (SOI).

Description

Intelligent Vision System and Methods
Background
Visual acuity is defined as sharpness or clarity of vision. Humans visually perceive their environment with very high levels of detail at very high resolutions (60 to 120 pixels/degree) in a 360-degree environment. This visual acuity allows humans to perform tasks such as driving and engaging in social interactions that require attention to subtle visual cues such as the intention of a pedestrian or facial expression.
Visual acuity is a function of the fovea which is the portion of the eye that has the capability to achieve the greatest acuity. At the center of the fovea, visual acuity is at its greatest. Moving away from the fovea, visual acuity progressively diminishes, thus establishing a visual representation model that is a hierarchical layering of varying degrees of acuity as is depicted in Figure 1 which depicts the fovea of the human eye.
This hierarchy plays a crucial role for the human attention mechanism as the wider representations contain low-resolution information of the environment that guides the more highly detailed foveal view towards the Subject-of-Interest (SOI) through the hierarchical processing of progressive views with ever increasing acuity.
An example of this occurs when something in the corner of the eye catches the person’s attention and they move their eyes to achieve a more highly detailed view of the object. The eyes move to align and focus the fovea onto the SOI so that it can be seen with a high degree of visual acuity. While in the corner of the eye, the person may not be able to identify the object, or discern details, but once the object is within the foveal view of the eye, the object and details are easily identifiable due to the high degree of acuity of the fovea.
Disclosed herein is a method for an Intelligent Vision System that is able to function as good as, or better than, human vision and its attention mechanism. This Intelligent Vision System is based on a unique combination of Artificial Intelligence (AI) and commercially available cameras and servos.
To replace human vision with an Intelligent Vision System, an AI capability is desired to perceive the environment in a manner that is similar to humans. This must be done at resolutions with very high levels of detail of 60 to 120 pixels/degree or greater and it must also support visual hierarchical layering in varying degrees of visual acuity, just like human vision.
The improvement that this Intelligent Vision System commands over existing cameras and camera systems both with and without AI support is that it is based on the human vision model of hierarchical layering in varying degrees of visual acuity coupled with the human attention mechanism and a unique coupling of AI, cameras and servos.
For any other vision system to achieve this performance it would require an AI system with approximately 1,000 directional full- HD cameras to observe the 360° view of the environment. An AI system that can handle the input of 1,000 cameras in real-time is not practical for real-life applications due to extreme cost, high bandwidth requirements and implementation complexity.
Practical applications of the Intelligent Vision Systems would include autonomous driving, surveillance systems, law enforcement, security systems, improving the performance of identifying faces in a crowded environments such as airports, crowded cities, metro stations, events, and similar environments, reading auto and truck details and license plates from increased distances, Automatic Plate Number Recognition (ANPR) for vehicle on-the-move systems, free movement expressway systems and more.
Summary
Accordingly, intelligent vision systems and methods for use in security/surveillance camera systems is desired. Intelligent vision systems and methods, for use in security/surveillance camera systems, that emulates the human retina and the attention mechanism of the human brain, replicating the central region of human vision ("Fovea") and its surrounding hierarchical vision and attention system results in a significant performance increase over existing security/surveillance camera systems and any other camera system.
As disclosed and described herein, such intelligent vision systems and methods can be achieved through the innovative use of AI coupled with hardware, software, firmware, a combination or the like
To achieve the equivalent of human-level all-around high perception and vision acuity, deep learning AI is used to control one or more cameras with a field of view (FOV) that generates a set of Progressive Centrist Hierarchical Views (PCHV) of the visual environment. Each of the cameras can be equipped with a servo operated rapid pan/tilt mechanism that is controlled by a state-of-the-art attention mechanism to achieve the enhanced resolution, accurate movement and rapid focusing necessary to perceive the environment in high details.
The Intelligent Vision System state-of-the-art attention mechanism uses a controller mechanism with deep learning capabilities and algorithms. This achieves enhanced resolutions and improved attention driven focus, with resolutions that are greater than can be achieved by the human eye and with greater distance viewing capabilities and 360° surround vision. Such intelligent vision systems and methods may consist of an all in one camera system that thinks and acts intelligently like a human and has the accuracy, efficiency, and speed for real-time applications with better than human sight.
Controlling a hierarchical view camera system is achieved by the disclosed attention mechanisms that include deep- learning, reinforcement learning, and human behavioral cloning imitation. The disclosed intelligent vision systems and methods provide a wider area of vision coverage, higher resolution, and further distance acuity than previously achieved by known systems. This includes a 360° field of view coverage, with greater than human eye resolution (beyond 120 pixels/degree) providing up to 10 times (lOx) the focal distance of other known solutions.
When monitoring a designated area, the disclosed intelligent vision systems and methods reduce the number of cameras required to cover the area. This is accomplished through the use of an advanced artificial intelligence system which drives the camera's rapid movement to the SOI utilizing an attention system.
Detailed Description
Referring to Figures 1 through 10 an embodiment of the disclosed intelligent vision system and method is illustrated. The intelligent vision systems and methods herein are based on a foveal view with an attention model that utilizes a hierarchical set of video image streams. Each video image stream includes an appropriate FOV to bring high resolution attention to the SOI.
Development of the disclosed intelligent vision systems and methods included seven Human-Robot Interaction (HRI) attention game experiments which were performed using one, two or three cameras with varying FOV. The subjects were nine balls with simulated human faces. HRI is a challenging task that requires an AI Deep Learning Neural Network for detecting subtle visual cues such as an eye gaze, and the like.
The varying FOVs where achieved by changing the lenses between experiments. The balls had two circles drawn on them to represent eyes and a small cone was glued onto each ball below the eyes to represent a nose as is depicted in Figure 2.
Referring to Figure 3, the basic principles of the HRI attention game experiments are depicted. The camera(s) is situated on the bottom left of the HRI Attention game table and the nine simulated human faces (in this case annotated balls) are positioned at various locations within the FOV of the camera(s).
For each experiment there are a series of episodes in which a randomly selected face looks toward the camera(s) while the other faces will look in random directions. The task for the camera(s) AI is to locate the face that is looking at it. For each step, within the episode, the AI rotates the camera(s) by one degree in any direction. The results include two components: for each time step the AI receives a reward of +1 if it can locate the correct face or a punishment of -0.1 if it cannot. Each episode has a maximum of 50 steps.
During each experiment, the position of the human face balls was adjusted by either moving the faces towards the camera or away from the camera and measuring the time required for the AI-Deep Learning Neural Network to focus its attention on the ball with the simulated human face which is best facing the camera.
The goal of the experiments was to measure the time to achieve the attention focus for different numbers of cameras and varying FOVs.
Prior to starting the experiments, a baseline for a successful attention focus was established to set the maximum score, which is the best score that can be achieved.
Table 1 summarizes the camera and FOV configurations for each experiment and the results of the experiments in the last two columns. Figure 4 provides a graphical representation of experiment results.
One camera is used for experiments 1, 2 and 3. For each of these experiments a single lens is used but with a progressively wider field of view ranging from narrow to wide. The outcome of the experiments showed the HRI Game result ranging from -0.73 to -0.60, and are depicted graphically in Figure 4a) is that with a single camera regardless of the FOV it results in the lowest performance for achieving SOI attention focus.
Two cameras are used for experiments 4, 5 and 6. For each of these experiments one narrow angle lens is used on one camera and a wide-angle lens is used on the other camera; but for each experiment it is a different combination of lenses. The outcome of the experiments showed the HRI Game results ranging from -0.59 to - 0.16, and are depicted graphically in Figure 4b) is that with a single camera regardless of the FOV it results in an improved performance for achieving SOI attention focus as compared to experiments 1, 2 and 3. The use of two cameras has improved the convergence of the performance towards that of the baseline with experiment 6 achieving the best performance.
Three cameras are used for experiment 3. For this experiment, three lenses are used; one narrow angle lens is used on one camera, a midrange lens on another camera and a wide-angle lens is used on the third camera. The outcome of the experiments was an HRI Game result of 0.04 which is depicted graphically in Figure 4c) is that with a single camera regardless of the FOV it results in an improved performance for achieving SOI attention focus as compared to experiments 1 through 6. The use of three cameras with appropriately selected lenses has further improved the convergence of the performance towards the baseline with experiment 7 achieving the best performance out of the seven experiments.
As illustrated in Figure 4, each task setting was trained through an AI agent with each of the foveal representations shown until convergence. As can also be seen in Figure 4, using the three set of parameters and the learning curves shown, the performance of all the experiments are summarized in Table 1. The values were computed by the average score over the last 100 trials. It can be seen that there are large variations in performance depending on the foveal representation used and it is a crucial hyper-parameter for the attention model.
In in Figure 4, it can be also seen that the camera with FOV of 30 degree is a preferred representation for this setoff parameters as a single observation, while cameras with FOV of 10 degrees and 60 degrees fail. This indicates that with a single observation, the FOV can greatly affect performance. Experiments 4 and 5, depicted in Figure 4b and Table 1, achieve low performance as the combination of two observations have large variations in FOV and the wider camera is unable to guide the narrower camera towards the SOI. On the other hand, experiment 6 has better performance as both cameras are able to work together efficiently by forming a hierarchy that improves ability to zero in of the SOI. Finally, from Figure 4b it can be seen that increasing the number of cameras increases performance as the cameras capture information from the environment at an increasing number of hierarchical views with improved resolutions.
Similar performance is achieved with the image classification task as shown in Figure 4c with experiment number 7 having the best performance in both cases. However, there were slight variations in comparison to the HRI task such as experiments with three cameras being better than experiments with one or two cameras. From these experiments it can be seen that the AI Deep Learning neural network needs to focus on global images features and operation is improved with a wider FOV in general.
To train the attention model, Proximal Policy Optimization (PPO) was used. PPO is an efficient policy search algorithm capable of handling high-dimensional observations. The policy function is represented by a Dynamic Neural Network (DNN) as shown in Figure4. The input to the network includes both the environment state and observations.
The state representation is given by st R2"c with nc as the number of cameras and the state for each camera as its current orientation along two axes of rotation. Each camera also provides an observation in the form of an RGB image oicl> R64 4 \
There are two outputs generated by the AI Deep Learning neural network which includes the policy and a scalar state value estimate. The policy output is a “softmax” probability distribution e(at st) R9 where the output actions drive the motion of the cameras along one of the eight cardinal directions.
Figures 5 through 10 illustrate one preferred embodiment of the Intelligent Vision Systems and Methods described herein. This preferred embodiment in Figure 5 includes a head (10) to which are mounted cameras (12), (13) and (14). Head (10) includes a pan/tilt end-effector (15) which controls the direction of the cameras (12), (13) and (14) with their corresponding lenses (18), (19) and (20). A separate pan mechanism (15) and tilt mechanism (17) provides additional control of the camera positioning. Head (10) has a fast real-time actuator (22) movement frequency of 10 to over 60 actions/second. Fast real-time actuator (22) is controlled by processor (23). This provides for rapid movement and rotation of the cameras (12), (13), (14) to desired specific pan/tilt angles and locations.
While three cameras where used for the prototype it will be appreciated by those skilled in the art that the Intelligent Vision Systems and Methods can include any number of cameras, camera combinations, camera proximity, camera network (IP or otherwise) or integrated cameras (i.e. cameras with multiple sensors) which may be used to realize the desired and improved Intelligent Vision Systems and Methods.
The images shown in Figure 6 and Figure 7 are photographs of the actual prototype.
The Intelligent Vision System utilizes a controller for performing the real-time control of the servos and cameras. The controller is hardware with firmware that is embedded in the hardware. The controller utilizes the output from the artificial intelligence-deep learning neural network algorithm to manage the positioning of the pan/tilt cameras by issuing commands to the servos.
Servo driven mechanical movement is achieved using gimbals, gearing, pivots mechanical movement devices. The real-time rotation of the end-effector pan/tilt mechanisms is achieved with fast servos.
The interaction between cameras, image processing, image streams, neural networks, controllers, pan/tilt functionality and the repeating of one or more of these functionalities is achieved by the controller firmware. In each process module or functionality, it is necessary to execute a process flow which includes one or more of the following steps which are shown in the flowchart depicted in Figure 8:
1. capturing images from one or more cameras;
2. processing the images to generate the required streams;
3. feeding forward or otherwise making the images available to the neural network;
4. reading the output from the neural network to control the pan/tilt position or other positioning of one or more cameras for the next time step; or
Start again from step 1 to train the system and further the deep learning process. All these steps are happening simultaneously and rapidly.
The Intelligent Vision Systems and Methods leverages Open Source software for image capture and for preprocessing the images.
1. (Open Source Computer Vision Library) OpenCV™ for image capture and processing OpenCV provides a common infrastructure for computer vision applications and machine perception. This software library can be used to detect and recognize faces, identify objects, classify human actions in videos, track camera movements, track moving objects, extract 3D models of objects, produce 3D point clouds from stereo cameras, stitch images together to produce a high resolution image of an entire scene, find similar images from an image database, remove red eyes from images taken using flash, follow eye movements, recognize scenery and establish markers to overlay it with augmented reality, etc.
2. Tensorflow™ is used for control and to construct and train the neural network to produce the desired behavior thus achieving the desired classification outputs.
TensorFlow is an end-to-end open source platform for machine learning supported numerical computations that makes machine learning faster and easier. TensorFlow provides an abstraction that includes machine learning and deep learning (i.e. neural networking) models and algorithms and manages the various options for directing the output of one function to the input of another.
3. Robot Operating System (ROS™) to provide the servo control of the pan/tilt mechanism in real-time
ROS is a flexible robot software framework. It is a collection of tools, libraries, and conventions that aim to simplify the task of creating complex and robust robot behavior across a wide variety of robotic platforms.
The Intelligent Vision Systems and Methods disclosed include software with computer vision code to read the images from the cameras and process them in real time.
The image shown in Figure 9 depicts all of these elements coming together to realize the intelligent vision system.
The input images from the cameras are divided into multiple streams using a 120° (for example) FOV camera which can capture the center part of the image that represent 60° (for example) FOV then resize both images to a specific resolution to feed a deep controller neural network. Such configuration will reduce the computational demand significantly without losing fine details in the 60° FOV.
As illustrated in Figure 10, each camera sensor may provide a signal that generates a stream of data as detected by the sensor. Such sensors can be one or more of the following:
• Zoom camera sensor (12) develops the zoom camera stream (24); • Wide camera sensor (13) develops wide camera stream (26);
• Surround camera sensor (14) develops surround camera stream (28).
The zoom camera (24), wide camera (26) and surround camera (26) streams may be fed into software level streams at level (0) to level (4). Such feeds may include:
• Zoom camera stream (24) is fed into software level streams 0 and 1 (28), (30)
• Wide camera stream (26) is fed into software level streams 2 and 3 (32), (34)
• Surround camera stream (28) is fed into software level stream 4 (36).
Software level stream 0 to software level stream 4 are processed by deep learning layers (38, 40, 42, 44 and 46). These deep learning layers (38, 40, 42, 44 and 46) are concatenated (48) and fed to long short-term memory (LSTM) modules Tl, T2, T3... Tn (50, 52, 54, 56). The outputs (58, 60, 62, 64) of LSTM modules Tl to Tn (50, 52, 54, 56) can be classified and placed into memory (66, 68, 70 and 72) for future use.
LSTM outputs (66, 68, 70, 72) provide feedback through feedback loop (74) to adjust the pan and tilt mechanism controlled by computational processor (23). A configuration of lenses (18), (19) and (20) with multiple progressive centrist FOV parameters are used to provide the desired focal lengths and properties for the Intelligent Vision System and Methods. Such FOV parameters include but are not limited to, 360°, 120°, 10° which provides for human level resolution of approximately 120 pixels/degree.
Another embodiment may include, but are not limited to, FOV parameters of 360°, 120°, 60°, 10°, 2° which provides beyond human level resolution of more than 120 pixels/degree.
Reference throughout this specification to“the embodiment,”“this embodiment,” “the previous embodiment,”“one embodiment,”“an embodiment,”“a preferred embodiment”“another preferred embodiment” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases“in the embodiment,“in this embodiment,”“in the previous embodiment, in one embodiment, in an embodiment,”“in a preferred embodiment,”“in another preferred embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
The described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention. While the present invention has been described in connection with certain exemplary or specific embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications, alternatives, modifications and equivalent arrangement as will be apparent to those skilled in the art. Any such changes, modifications, alternative, equivalents and the like may be made without departing from the spirit and scope of the disclosure.

Claims

Claims
1. An intelligent vision system comprising:
a. A camera that can be positioned for creating an image data stream output for the camera and capturing images from the camera;
b. a processor for processing the image data stream output of the camera to generate one or more image data streams for the camera; c. a neural network for receiving and further processing the image data stream output from the camera;
d. a reader for reading the output from the neural network to control the position of the camera; and
e. a feedback network for further processing the image data stream output of the camera to ultimately focus in on the Subject of Interest (SOI).
2. The intelligent vision system of claim 1, further including an artificial intelligence neural network and a second (or more) camera, wherein the artificial neural network manages a hierarchy of different view angles generated by the first camera and the second (or more) camera.
3. The intelligent vision system of claim 2, wherein the first and second (or more) cameras function as a single camera system.
4. The intelligent vision system of claim 1, wherein the artificial intelligence neural network includes a deep learning module controlling rapid movement of the cameras to find, detect or track one or more SOI targets.
5. The intelligent vision system of claim 1, wherein the camera produces a normal resolution image data stream and the processor provides a high- resolution image data stream of one or more targets, wherein the normal resolution image data stream and the high-resolution image data stream are displayed together.
6. A method of providing intelligent vision images including:
a. positioning at least one camera for creating an image data stream output for each camera;
b. capturing image data stream output from the at least one camera; c. processing the image data stream output from the at least one camera to generate one or more data streams;
d. further processing the image data stream output from the at least one camera through a neural network;
e. reading the output from the neural network to control the position of the at least one camera; and
f. further processing the image data stream of the at least one camera through a feedback network.
7. The method of claim 6, further positioning a second (or more) camera for creating a second image data stream output, and an artificial neural network managing a hierarchy of different view angles generated by the first camera and the second (or more) camera.
8. The method of claim 7, wherein the first and second (or more) cameras function as a single camera system.
9. The method of claim 6, wherein further processing the image data stream of the at least one camera through a feedback network includes processing the image data stream of the at least one camera through a deep learning module controlling rapid movement of the camera to find, detect or track one or more SOI targets.
10. The method of claim 6, wherein processing the image data stream output of the at least one camera to generate one or more data streams produces a normal resolution image data stream and further processing the image data stream output from the at least one camera through a neural network provides a high-resolution image data stream of one or more SOI targets, wherein the normal resolution image data stream and the high-resolution image data stream are displayed together.
PCT/IB2019/058721 2018-10-13 2019-10-13 Intelligent vision system and methods WO2020075147A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862745346P 2018-10-13 2018-10-13
US62/745,346 2018-10-13

Publications (1)

Publication Number Publication Date
WO2020075147A1 true WO2020075147A1 (en) 2020-04-16

Family

ID=70164611

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2019/058721 WO2020075147A1 (en) 2018-10-13 2019-10-13 Intelligent vision system and methods

Country Status (1)

Country Link
WO (1) WO2020075147A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090324010A1 (en) * 2008-06-26 2009-12-31 Billy Hou Neural network-controlled automatic tracking and recognizing system and method
US9760806B1 (en) * 2016-05-11 2017-09-12 TCL Research America Inc. Method and system for vision-centric deep-learning-based road situation analysis
US20180060675A1 (en) * 2016-09-01 2018-03-01 Samsung Electronics Co., Ltd. Method and apparatus for controlling vision sensor for autonomous vehicle
US20180285648A1 (en) * 2017-03-30 2018-10-04 The Boeing Company Automated object tracking in a video feed using machine learning
JP6404527B1 (en) * 2016-11-30 2018-10-10 株式会社オプティム Camera control system, camera control method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090324010A1 (en) * 2008-06-26 2009-12-31 Billy Hou Neural network-controlled automatic tracking and recognizing system and method
US9760806B1 (en) * 2016-05-11 2017-09-12 TCL Research America Inc. Method and system for vision-centric deep-learning-based road situation analysis
US20180060675A1 (en) * 2016-09-01 2018-03-01 Samsung Electronics Co., Ltd. Method and apparatus for controlling vision sensor for autonomous vehicle
JP6404527B1 (en) * 2016-11-30 2018-10-10 株式会社オプティム Camera control system, camera control method, and program
US20180285648A1 (en) * 2017-03-30 2018-10-04 The Boeing Company Automated object tracking in a video feed using machine learning

Similar Documents

Publication Publication Date Title
US20220343138A1 (en) Analysis of objects of interest in sensor data using deep neural networks
CN105744163B (en) A kind of video camera and image capture method based on depth information tracking focusing
EP3740860A1 (en) Gesture-and gaze-based visual data acquisition system
TW201520633A (en) Automatic focusing method, and automatic focusing device, image capturing device using the same
KR19980071007A (en) Image information input device and method
JP2007074033A (en) Imaging apparatus and control method thereof, computer program, and storage medium
Hoffmann et al. Action selection and mental transformation based on a chain of forward models
CN114930393A (en) System for unmanned aerial vehicle of a plurality of groups catches
TWI683575B (en) Method and apparatus for gaze recognition and interaction
Barreto et al. Active Stereo Tracking of $ N\le 3$ Targets Using Line Scan Cameras
JP2006329747A (en) Imaging device
WO2021184341A1 (en) Autofocus method and camera system thereof
WO2020075147A1 (en) Intelligent vision system and methods
US20180065247A1 (en) Configuring a robotic camera to mimic cinematographic styles
EP4354853A1 (en) Thermal-image-monitoring system using plurality of cameras
US20120019620A1 (en) Image capture device and control method
US20220138965A1 (en) Focus tracking system
US20210362342A1 (en) Robotic camera software and controller
Kim et al. Simulation of face pose tracking system using adaptive vision switching
Kundur et al. A vision-based pragmatic strategy for autonomous navigation
Meneses et al. Low-cost Autonomous Navigation System Based on Optical Flow Classification
CN112540676B (en) Projection system-based variable information display device
Daulatabad et al. Learning Strategies For Successful Crowd Navigation
RU2811574C1 (en) Video cameras control system
US11463674B1 (en) Imaging system and display apparatus incorporating super resolution using fixed focus cameras

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871255

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19871255

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19871255

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.08.2023)