WO2020075147A1 - Intelligent vision system and methods - Google Patents
Intelligent vision system and methods Download PDFInfo
- Publication number
- WO2020075147A1 WO2020075147A1 PCT/IB2019/058721 IB2019058721W WO2020075147A1 WO 2020075147 A1 WO2020075147 A1 WO 2020075147A1 IB 2019058721 W IB2019058721 W IB 2019058721W WO 2020075147 A1 WO2020075147 A1 WO 2020075147A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- camera
- image data
- data stream
- neural network
- intelligent vision
- Prior art date
Links
- 230000004438 eyesight Effects 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000013528 artificial neural network Methods 0.000 claims abstract description 21
- 238000013135 deep learning Methods 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims abstract description 7
- 238000013473 artificial intelligence Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 abstract description 16
- 230000000007 visual effect Effects 0.000 abstract description 8
- 238000003062 neural network model Methods 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 description 32
- 230000004304 visual acuity Effects 0.000 description 8
- 210000000887 face Anatomy 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 241000282412 Homo Species 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 210000003128 head Anatomy 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000000750 progressive effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 239000012636 effector Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004424 eye movement Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000003362 replicative effect Effects 0.000 description 1
- 210000001525 retina Anatomy 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003997 social interaction Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/18—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
- H04N7/183—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast for receiving images from a single remote source
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/45—Cameras or camera modules comprising electronic image sensors; Control thereof for generating image signals from two or more image sensors being of different type or operating in different modes, e.g. with a CMOS sensor for moving images in combination with a charge-coupled device [CCD] for still images
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/50—Constructional details
- H04N23/51—Housings
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/58—Means for changing the camera field of view without moving the camera body, e.g. nutating or panning of optics or image sensors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Definitions
- Visual acuity is defined as sharpness or clarity of vision. Humans visually perceive their environment with very high levels of detail at very high resolutions (60 to 120 pixels/degree) in a 360-degree environment. This visual acuity allows humans to perform tasks such as driving and engaging in social interactions that require attention to subtle visual cues such as the intention of a pedestrian or facial expression.
- Visual acuity is a function of the fovea which is the portion of the eye that has the capability to achieve the greatest acuity. At the center of the fovea, visual acuity is at its greatest. Moving away from the fovea, visual acuity progressively diminishes, thus establishing a visual representation model that is a hierarchical layering of varying degrees of acuity as is depicted in Figure 1 which depicts the fovea of the human eye.
- This hierarchy plays a crucial role for the human attention mechanism as the wider representations contain low-resolution information of the environment that guides the more highly detailed foveal view towards the Subject-of-Interest (SOI) through the hierarchical processing of progressive views with ever increasing acuity.
- SOI Subject-of-Interest
- This Intelligent Vision System is based on a unique combination of Artificial Intelligence (AI) and commercially available cameras and servos.
- AI Artificial Intelligence
- an AI capability is desired to perceive the environment in a manner that is similar to humans. This must be done at resolutions with very high levels of detail of 60 to 120 pixels/degree or greater and it must also support visual hierarchical layering in varying degrees of visual acuity, just like human vision.
- This Intelligent Vision System commands over existing cameras and camera systems both with and without AI support is that it is based on the human vision model of hierarchical layering in varying degrees of visual acuity coupled with the human attention mechanism and a unique coupling of AI, cameras and servos.
- Intelligent Vision Systems would include autonomous driving, surveillance systems, law enforcement, security systems, improving the performance of identifying faces in a crowded environments such as airports, crowded cities, metro stations, events, and similar environments, reading auto and truck details and license plates from increased distances, Automatic Plate Number Recognition (ANPR) for vehicle on-the-move systems, free movement expressway systems and more.
- ANPR Automatic Plate Number Recognition
- Intelligent vision systems and methods for use in security/surveillance camera systems is desired.
- Intelligent vision systems and methods, for use in security/surveillance camera systems that emulates the human retina and the attention mechanism of the human brain, replicating the central region of human vision ("Fovea") and its surrounding hierarchical vision and attention system results in a significant performance increase over existing security/surveillance camera systems and any other camera system.
- Fovea central region of human vision
- deep learning AI is used to control one or more cameras with a field of view (FOV) that generates a set of Progressive Centrist Hierarchical Views (PCHV) of the visual environment.
- FOV field of view
- PCHV Progressive Centrist Hierarchical Views
- Each of the cameras can be equipped with a servo operated rapid pan/tilt mechanism that is controlled by a state-of-the-art attention mechanism to achieve the enhanced resolution, accurate movement and rapid focusing necessary to perceive the environment in high details.
- the Intelligent Vision System state-of-the-art attention mechanism uses a controller mechanism with deep learning capabilities and algorithms. This achieves enhanced resolutions and improved attention driven focus, with resolutions that are greater than can be achieved by the human eye and with greater distance viewing capabilities and 360° surround vision.
- Such intelligent vision systems and methods may consist of an all in one camera system that thinks and acts intelligently like a human and has the accuracy, efficiency, and speed for real-time applications with better than human sight.
- Controlling a hierarchical view camera system is achieved by the disclosed attention mechanisms that include deep- learning, reinforcement learning, and human behavioral cloning imitation.
- the disclosed intelligent vision systems and methods provide a wider area of vision coverage, higher resolution, and further distance acuity than previously achieved by known systems. This includes a 360° field of view coverage, with greater than human eye resolution (beyond 120 pixels/degree) providing up to 10 times (lOx) the focal distance of other known solutions.
- the disclosed intelligent vision systems and methods When monitoring a designated area, the disclosed intelligent vision systems and methods reduce the number of cameras required to cover the area. This is accomplished through the use of an advanced artificial intelligence system which drives the camera's rapid movement to the SOI utilizing an attention system.
- the intelligent vision systems and methods herein are based on a foveal view with an attention model that utilizes a hierarchical set of video image streams.
- Each video image stream includes an appropriate FOV to bring high resolution attention to the SOI.
- HRI Human-Robot Interaction
- the camera(s) is situated on the bottom left of the HRI Attention game table and the nine simulated human faces (in this case annotated balls) are positioned at various locations within the FOV of the camera(s).
- the task for the camera(s) AI is to locate the face that is looking at it. For each step, within the episode, the AI rotates the camera(s) by one degree in any direction.
- the results include two components: for each time step the AI receives a reward of +1 if it can locate the correct face or a punishment of -0.1 if it cannot.
- Each episode has a maximum of 50 steps.
- the position of the human face balls was adjusted by either moving the faces towards the camera or away from the camera and measuring the time required for the AI-Deep Learning Neural Network to focus its attention on the ball with the simulated human face which is best facing the camera.
- the goal of the experiments was to measure the time to achieve the attention focus for different numbers of cameras and varying FOVs.
- Table 1 summarizes the camera and FOV configurations for each experiment and the results of the experiments in the last two columns.
- Figure 4 provides a graphical representation of experiment results.
- each task setting was trained through an AI agent with each of the foveal representations shown until convergence.
- the performance of all the experiments are summarized in Table 1. The values were computed by the average score over the last 100 trials. It can be seen that there are large variations in performance depending on the foveal representation used and it is a crucial hyper-parameter for the attention model.
- PPO Proximal Policy Optimization
- DNN Dynamic Neural Network
- the state representation is given by s t R 2 " c with n c as the number of cameras and the state for each camera as its current orientation along two axes of rotation.
- Each camera also provides an observation in the form of an RGB image o icl> R 64 4 ⁇
- the policy output is a “softmax” probability distribution e(a t s t ) R 9 where the output actions drive the motion of the cameras along one of the eight cardinal directions.
- FIGS 5 through 10 illustrate one preferred embodiment of the Intelligent Vision Systems and Methods described herein.
- This preferred embodiment in Figure 5 includes a head (10) to which are mounted cameras (12), (13) and (14).
- Head (10) includes a pan/tilt end-effector (15) which controls the direction of the cameras (12), (13) and (14) with their corresponding lenses (18), (19) and (20).
- a separate pan mechanism (15) and tilt mechanism (17) provides additional control of the camera positioning.
- Head (10) has a fast real-time actuator (22) movement frequency of 10 to over 60 actions/second.
- Fast real-time actuator (22) is controlled by processor (23). This provides for rapid movement and rotation of the cameras (12), (13), (14) to desired specific pan/tilt angles and locations.
- the Intelligent Vision Systems and Methods can include any number of cameras, camera combinations, camera proximity, camera network (IP or otherwise) or integrated cameras (i.e. cameras with multiple sensors) which may be used to realize the desired and improved Intelligent Vision Systems and Methods.
- IP camera network
- integrated cameras i.e. cameras with multiple sensors
- the Intelligent Vision System utilizes a controller for performing the real-time control of the servos and cameras.
- the controller is hardware with firmware that is embedded in the hardware.
- the controller utilizes the output from the artificial intelligence-deep learning neural network algorithm to manage the positioning of the pan/tilt cameras by issuing commands to the servos.
- Servo driven mechanical movement is achieved using gimbals, gearing, pivots mechanical movement devices.
- the real-time rotation of the end-effector pan/tilt mechanisms is achieved with fast servos.
- step 1 start again from step 1 to train the system and further the deep learning process. All these steps are happening simultaneously and rapidly.
- the Intelligent Vision Systems and Methods leverages Open Source software for image capture and for preprocessing the images.
- OpenCVTM for image capture and processing OpenCV provides a common infrastructure for computer vision applications and machine perception. This software library can be used to detect and recognize faces, identify objects, classify human actions in videos, track camera movements, track moving objects, extract 3D models of objects, produce 3D point clouds from stereo cameras, stitch images together to produce a high resolution image of an entire scene, find similar images from an image database, remove red eyes from images taken using flash, follow eye movements, recognize scenery and establish markers to overlay it with augmented reality, etc.
- TensorflowTM is used for control and to construct and train the neural network to produce the desired behavior thus achieving the desired classification outputs.
- TensorFlow is an end-to-end open source platform for machine learning supported numerical computations that makes machine learning faster and easier.
- TensorFlow provides an abstraction that includes machine learning and deep learning (i.e. neural networking) models and algorithms and manages the various options for directing the output of one function to the input of another.
- ROSTM Robot Operating System
- ROS is a flexible robot software framework. It is a collection of tools, libraries, and conventions that aim to simplify the task of creating complex and robust robot behavior across a wide variety of robotic platforms.
- the Intelligent Vision Systems and Methods disclosed include software with computer vision code to read the images from the cameras and process them in real time.
- the input images from the cameras are divided into multiple streams using a 120° (for example) FOV camera which can capture the center part of the image that represent 60° (for example) FOV then resize both images to a specific resolution to feed a deep controller neural network.
- a 120° (for example) FOV camera which can capture the center part of the image that represent 60° (for example) FOV then resize both images to a specific resolution to feed a deep controller neural network.
- Such configuration will reduce the computational demand significantly without losing fine details in the 60° FOV.
- each camera sensor may provide a signal that generates a stream of data as detected by the sensor.
- sensors can be one or more of the following:
- Zoom camera sensor (12) develops the zoom camera stream (24);
- Wide camera sensor (13) develops wide camera stream (26);
- Surround camera sensor (14) develops surround camera stream (28).
- the zoom camera (24), wide camera (26) and surround camera (26) streams may be fed into software level streams at level (0) to level (4).
- Such feeds may include:
- Zoom camera stream (24) is fed into software level streams 0 and 1 (28), (30)
- Software level stream 0 to software level stream 4 are processed by deep learning layers (38, 40, 42, 44 and 46). These deep learning layers (38, 40, 42, 44 and 46) are concatenated (48) and fed to long short-term memory (LSTM) modules Tl, T2, T3... Tn (50, 52, 54, 56). The outputs (58, 60, 62, 64) of LSTM modules Tl to Tn (50, 52, 54, 56) can be classified and placed into memory (66, 68, 70 and 72) for future use.
- LSTM long short-term memory
- LSTM outputs (66, 68, 70, 72) provide feedback through feedback loop (74) to adjust the pan and tilt mechanism controlled by computational processor (23).
- a configuration of lenses (18), (19) and (20) with multiple progressive centrist FOV parameters are used to provide the desired focal lengths and properties for the Intelligent Vision System and Methods.
- FOV parameters include but are not limited to, 360°, 120°, 10° which provides for human level resolution of approximately 120 pixels/degree.
- Another embodiment may include, but are not limited to, FOV parameters of 360°, 120°, 60°, 10°, 2° which provides beyond human level resolution of more than 120 pixels/degree.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Studio Devices (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862745346P | 2018-10-13 | 2018-10-13 | |
US62/745,346 | 2018-10-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020075147A1 true WO2020075147A1 (en) | 2020-04-16 |
Family
ID=70164611
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2019/058721 WO2020075147A1 (en) | 2018-10-13 | 2019-10-13 | Intelligent vision system and methods |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020075147A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090324010A1 (en) * | 2008-06-26 | 2009-12-31 | Billy Hou | Neural network-controlled automatic tracking and recognizing system and method |
US9760806B1 (en) * | 2016-05-11 | 2017-09-12 | TCL Research America Inc. | Method and system for vision-centric deep-learning-based road situation analysis |
US20180060675A1 (en) * | 2016-09-01 | 2018-03-01 | Samsung Electronics Co., Ltd. | Method and apparatus for controlling vision sensor for autonomous vehicle |
US20180285648A1 (en) * | 2017-03-30 | 2018-10-04 | The Boeing Company | Automated object tracking in a video feed using machine learning |
JP6404527B1 (en) * | 2016-11-30 | 2018-10-10 | 株式会社オプティム | Camera control system, camera control method, and program |
-
2019
- 2019-10-13 WO PCT/IB2019/058721 patent/WO2020075147A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090324010A1 (en) * | 2008-06-26 | 2009-12-31 | Billy Hou | Neural network-controlled automatic tracking and recognizing system and method |
US9760806B1 (en) * | 2016-05-11 | 2017-09-12 | TCL Research America Inc. | Method and system for vision-centric deep-learning-based road situation analysis |
US20180060675A1 (en) * | 2016-09-01 | 2018-03-01 | Samsung Electronics Co., Ltd. | Method and apparatus for controlling vision sensor for autonomous vehicle |
JP6404527B1 (en) * | 2016-11-30 | 2018-10-10 | 株式会社オプティム | Camera control system, camera control method, and program |
US20180285648A1 (en) * | 2017-03-30 | 2018-10-04 | The Boeing Company | Automated object tracking in a video feed using machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220343138A1 (en) | Analysis of objects of interest in sensor data using deep neural networks | |
CN105744163B (en) | A kind of video camera and image capture method based on depth information tracking focusing | |
EP3740860A1 (en) | Gesture-and gaze-based visual data acquisition system | |
TW201520633A (en) | Automatic focusing method, and automatic focusing device, image capturing device using the same | |
KR19980071007A (en) | Image information input device and method | |
JP2007074033A (en) | Imaging apparatus and control method thereof, computer program, and storage medium | |
Hoffmann et al. | Action selection and mental transformation based on a chain of forward models | |
CN114930393A (en) | System for unmanned aerial vehicle of a plurality of groups catches | |
TWI683575B (en) | Method and apparatus for gaze recognition and interaction | |
Barreto et al. | Active Stereo Tracking of $ N\le 3$ Targets Using Line Scan Cameras | |
JP2006329747A (en) | Imaging device | |
WO2021184341A1 (en) | Autofocus method and camera system thereof | |
WO2020075147A1 (en) | Intelligent vision system and methods | |
US20180065247A1 (en) | Configuring a robotic camera to mimic cinematographic styles | |
EP4354853A1 (en) | Thermal-image-monitoring system using plurality of cameras | |
US20120019620A1 (en) | Image capture device and control method | |
US20220138965A1 (en) | Focus tracking system | |
US20210362342A1 (en) | Robotic camera software and controller | |
Kim et al. | Simulation of face pose tracking system using adaptive vision switching | |
Kundur et al. | A vision-based pragmatic strategy for autonomous navigation | |
Meneses et al. | Low-cost Autonomous Navigation System Based on Optical Flow Classification | |
CN112540676B (en) | Projection system-based variable information display device | |
Daulatabad et al. | Learning Strategies For Successful Crowd Navigation | |
RU2811574C1 (en) | Video cameras control system | |
US11463674B1 (en) | Imaging system and display apparatus incorporating super resolution using fixed focus cameras |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19871255 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19871255 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19871255 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.08.2023) |