WO2018052714A2 - Conversion de vidéo en radar - Google Patents

Conversion de vidéo en radar Download PDF

Info

Publication number
WO2018052714A2
WO2018052714A2 PCT/US2017/049327 US2017049327W WO2018052714A2 WO 2018052714 A2 WO2018052714 A2 WO 2018052714A2 US 2017049327 W US2017049327 W US 2017049327W WO 2018052714 A2 WO2018052714 A2 WO 2018052714A2
Authority
WO
WIPO (PCT)
Prior art keywords
user
cnn
objects
processor
image
Prior art date
Application number
PCT/US2017/049327
Other languages
English (en)
Other versions
WO2018052714A3 (fr
Inventor
Iain MALVIN
Eric Cosatto
Igor Durdanovic
Hans Peter Graf
Original Assignee
Nec Laboratories America, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/689,755 external-priority patent/US10330787B2/en
Application filed by Nec Laboratories America, Inc. filed Critical Nec Laboratories America, Inc.
Publication of WO2018052714A2 publication Critical patent/WO2018052714A2/fr
Publication of WO2018052714A3 publication Critical patent/WO2018052714A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Definitions

  • the present invention relates to Advanced Driver- Assistance Systems (ADAS) and more particularly to ADAS involving video to radar.
  • ADAS Advanced Driver- Assistance Systems
  • ADAS Advanced Driver- Assistance Systems
  • a key step is to obtain the relative positions and velocities of these surrounding objects from various sensors and create a top- view map representation of the surrounding driving scene.
  • current ADAS are not without deficiency. Accordingly, there is a need for an improved approach for ADAS.
  • a system includes an image capture device configured to capture image data relative to an ambient environment of a user.
  • the system further includes a processor configured to detect and localize objects, in a real-world map space, from the image data using a trainable object localization Convolutional Neural Network (CNN).
  • the CNN is trained to detect and localize the objects from image and radar pairs that include the image data and radar data for different scenes of a natural environment.
  • the processor is further configured to perform a user-perceptible action responsive to a detection and a localization of an object in an intended path of the user.
  • a computer- implemented method includes capturing, by an image capture device, image data relative to an ambient environment of a user.
  • the method further includes detecting and localizing, by a processor, objects, in a real-world map space, from the image data using a trainable object localization Convolutional Neural Network (CNN).
  • CNN trainable object localization Convolutional Neural Network
  • the method also includes providing, by the processor, performing a user-perceptible action responsive to a detection and a localization of an object in an intended path of the user.
  • the CNN is trained to detect and localize the objects from image and radar pairs that include the image data and radar data for different scenes of a natural environment.
  • a computer program product includes a non-transitory computer readable storage medium having program instructions embodied therewith.
  • the program instructions are executable by a computer to cause the computer to perform a method.
  • the method includes capturing, by an image capture device, image data relative to an ambient environment of a user.
  • the method further includes detecting and localizing, by a processor, objects, in a real-world map space, from the image data using a trainable object localization Convolutional Neural Network (CNN).
  • CNN trainable object localization Convolutional Neural Network
  • the method also includes providing, by the processor, performing a user-perceptible action responsive to a detection and a localization of an object in an intended path of the user.
  • the CNN is trained to detect and localize the objects from image and radar pairs that include the image data and radar data for different scenes of a natural environment.
  • FIG. 1 shows the present invention in an exemplary environment to which the present invention can be applied, in accordance with an embodiment of the present invention
  • FIG. 2 shows the present invention in another exemplary environment to which the present invention can be applied, in accordance with an embodiment of the present invention
  • FIG. 3 shows the present invention in yet another exemplary environment to which the present invention can be applied, in accordance with an embodiment of the present invention
  • FIG. 4 shows an exemplary processing system to which the present principles may be applied, in accordance with an embodiment of the present principles
  • FIG. 5 shows an exemplary system for object tracking, in accordance with an embodiment of the present invention
  • FIG. 6 shows an exemplary method for training the system of FIG. 5, in accordance with an embodiment of the present invention
  • FIG. 7 shows an exemplary Convolutional Neural Network (CNN), in accordance with an embodiment of the present invention.
  • FIG. 8-10 show an exemplary method for assisted driving, in accordance with an embodiment of the present invention.
  • the present invention is directed to Advanced Driver- Assistance Systems (ADAS) involving video to radar.
  • ADAS Advanced Driver- Assistance Systems
  • the present invention provides a trained Convolutional Neural Network (CNN) model (hereinafter interchangeably referred to as "CNN” in short) able to output such a top- view map representation of the surrounding driving scene directly from a monocular video input stream.
  • CNN Convolutional Neural Network
  • the present invention provides a system having a "trainable object localization convolutional neural network” which enables the simultaneous detection and localization in a "2-dimensional map view" of one or multiple objects from a camera image (or series of images).
  • the system is trained using pairs of: (a) a camera image, or series of video images; and (b) a list of objects and their real world positions acquired from some other source (e.g., radar).
  • some other source e.g., radar
  • embodiments may use only image data (and possible velocity data) for detection, the detection will exploit correlations in the training data between the image data and the radar data so as to essentially use radar data (from training) during the detection so as to involve multiple information sources.
  • the present invention predicts the positions of the objects in "real world x,y map space" and not in the space of the input image as per the prior art.
  • the complete system simultaneously solves object detection, depth estimation and projection into real world coordinates, thus overcoming at least one significant deficiency of the prior art. It is to be appreciated that other advantages/features of the present invention over the prior art are described in further detail herein below.
  • the present invention can be enabled by one or more of the following three ideas: (1) transformation of positional labelled data to the internal geometry of a CNN architecture; (2) the use of internal CNN layers as "depth” layers; and (3) the use of a MAX function to vertically collapse the internal layers of the CNN into the map view.
  • FIG. 1 shows the present invention in an exemplary environment 100 to which the present invention can be applied, in accordance with an embodiment of the present invention.
  • a user 188 is located in a scene with multiple objects 199, each having their own locations and trajectories.
  • the user 188 may or may not be in a vehicle.
  • user 188 is shown walking in the scene, while another user 189 is shown within a vehicle.
  • the system of the present invention e.g., system 500
  • the system of the present invention may interface with the user through a smart phone 171 or other device of the user.
  • the system of the present invention e.g., system 500
  • the system of the present invention may interface with the user through a vehicle 172 that the user is operating.
  • FIGs. 2 and 3 are specifically directed to embodiments where the user is operating an emergency vehicle and a non-emergency vehicle, respectively.
  • the system of the present invention can interface with the user in order to be made aware of any objects in the user's trajectory. That is, in an embodiment, the user can be provided with a list of objects and their respective locations (e.g., through smart phone 171 and/or vehicle 172). The list of objects can be provided visually, audibly, and/or so forth. In this way, the user can navigate around these objects 199 to avoid potential collisions there between.
  • the present invention can detect and locate objects such as poles, garbage cans, vehicles, persons, and so forth in the path of a user (or, e.g., in the path of an object (e.g., vehicle) in which the user is traveling).
  • objects such as poles, garbage cans, vehicles, persons, and so forth in the path of a user (or, e.g., in the path of an object (e.g., vehicle) in which the user is traveling).
  • detection results will likely be more accurate for larger objects to be detected.
  • Such user may be texting and walking without being cognizant of their surroundings. Often, people have walked into bodies of water or other objects while being distracted while walking and performing another function such as, but not limited to, texting.
  • the present invention can provide an audible alert to indicate to the user 188 that an object is in their path so that they can avoid colliding with the object.
  • FIG. 2 shows the present invention in another exemplary environment 200 to which the present invention can be applied, in accordance with an embodiment of the present invention.
  • a user 288 In the environment 200, a user 288 is located in a scene with multiple objects 299, each having their own locations and trajectories.
  • the user 288 is operating an emergency vehicle 272 (e.g., an ambulance, a police car, a fire truck, and so forth).
  • the emergency vehicle 272 is a police car.
  • the system of the present invention may interface with the user through one or more systems of the emergency vehicle 272 that the user is operating.
  • the system of the present invention can provide the user information through a system 272A (e.g., a display system, a speaker system, and/or some other system) of the emergency vehicle 272.
  • the system of the present invention e.g., system 500
  • the emergency vehicle 272 itself (e.g., through one or more systems of the emergency vehicle 272 including, but not limited to, a steering system, a braking system, an acceleration system, a steering system, etc.) in order to control the vehicle or cause the emergency vehicle 272 to perform one or more actions.
  • the user or the emergency vehicle 272 itself can navigate around these objects 299 to avoid potential collisions there between.
  • FIG. 3 shows the present invention in yet another exemplary environment 300 to which the present invention can be applied, in accordance with an embodiment of the present invention.
  • a user 388 is located in a scene with multiple objects 399, each having their own locations and trajectories.
  • the user 388 is operating a non-emergency vehicle 372 (e.g., a car, a truck, a motorcycle, etc., that is not operated specifically for emergencies).
  • a non-emergency vehicle 372 e.g., a car, a truck, a motorcycle, etc., that is not operated specifically for emergencies.
  • the system of the present invention may interface with the user through one or more systems of the non-emergency vehicle 372 that the user is operating.
  • the system of the present invention can provide the user information through a system 372A (e.g., a display system, a speaker system, and/or some other system) of the nonemergency vehicle 372.
  • system 500 may interface with the non-emergency vehicle 372 itself (e.g., through one or more systems of the non-emergency vehicle 372 including, but not limited to, a steering system, a braking system, an acceleration system, a steering system, etc.) in order to control the vehicle or cause the non-emergency vehicle 372 to perform one or more actions.
  • the user or the non-emergency vehicle 372 itself can navigate around these objects 399 to avoid potential collisions there between.
  • FIG. 4 shows an exemplary processing system 400 to which the present principles may be applied, in accordance with an embodiment of the present principles.
  • the processing system 400 includes at least one processor (CPU) 404 operatively coupled to other components via a system bus 402.
  • a first storage device 422 and a second storage device 424 are operatively coupled to system bus 402 by the I/O adapter 420.
  • the storage devices 422 and 424 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth.
  • the storage devices 422 and 424 can be the same type of storage device or different types of storage devices.
  • a speaker 432 is operatively coupled to system bus 402 by the sound adapter 430.
  • a transceiver 442 is operatively coupled to system bus 402 by network adapter 440.
  • a display device 462 is operatively coupled to system bus 402 by display adapter 460.
  • a first user input device 452, a second user input device 454, and a third user input device 456 are operatively coupled to system bus 402 by user interface adapter 450.
  • the user input devices 452, 454, and 456 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles.
  • the user input devices 452, 454, and 456 can be the same type of user input device or different types of user input devices.
  • the user input devices 452, 454, and 456 are used to input and output information to and from system 400.
  • processing system 400 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements.
  • various other input devices and/or output devices can be included in processing system 400, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art.
  • various types of wireless and/or wired input and/or output devices can be used.
  • additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art.
  • system 500 described above with respect to FIG. 5 is a system for implementing respective embodiments of the present principles. Part or all of processing system 400 may be implemented in one or more of the elements of system 500.
  • processing system 400 may perform at least part of the method described herein including, for example, at least part of method 600 of FIG. 6 and/or at least part of method 800 of FIGs. 8-10.
  • part or all of system 500 may be used to perform at least part of method 600 of FIG. 6 and/or at least part of method 800 of FIGs. 8-10.
  • FIG. 5 shows an exemplary system 500 for object tracking, in accordance with an embodiment of the present invention.
  • System 500 is also referred to herein as "object detection and locating system" 500.
  • the system 500 includes a camera 510 and a processor 511.
  • the system 500 can receive one or more images from the camera 510 (or some other source), and processes the images using the processor 511 to output a list of object (e.g., car, person, pole, tree, garbage can, etc.) positions 512.
  • object e.g., car, person, pole, tree, garbage can, etc.
  • video can correspond to two or more of the images (which can be used to show motion of object relative to each of the frames).
  • the system 500 can include a radar system 501 for generating radar data.
  • the system can omit the radar system 501, and obtain the radar data from an external source.
  • the system 500 can access one or more remote radar systems and/or remote repositories of radar data.
  • the images from the camera 510 and the radar data from the radar system 501 can be used to form and/or otherwise derive image and radar pairs that can be processed by the processor 511 to train the CNN.
  • the processor 511 performs processing including image preprocessing 520, Convolutional Neural Network (CNN) processing (also referred to as "CNN” in short) 521, and post-processing 522.
  • CNN Convolutional Neural Network
  • the processor 511 is capable of interfacing with systems of an emergency motor vehicle in order to control the functioning of such systems, for example, as described in further detail with respect to method 500.
  • the image preprocessing 520 can involve, for example, extracting 530 N (where N is an integer, e.g., but not limited to, 3) RGB frames from the video input, correcting 531 for barrel distortion (the "fish-eye" lens effect one gets from a wide angle camera lens such as those that are installed in cars), and cropping and/or scaling 532 the images.
  • N is an integer, e.g., but not limited to, 3
  • correcting 531 for barrel distortion the "fish-eye" lens effect one gets from a wide angle camera lens such as those that are installed in cars
  • cropping and/or scaling 532 the images.
  • the lower region of an images is cropped, which is where the road ahead and the cars are (thus, cutting off the sky).
  • the post-processing 522 processes the output of the CNN 521.
  • one pixel in the output image of the CNN corresponds to a single possible detection of an object in the input image. It is therefore typical to use Non-maximal suppression directly on the output image pixels to cull detections with overlapping corresponding input windows so that only a set of non-overlapping "maximal" detections remain, and those are reported.
  • individual object detections are composed of a "filled rectangle" of pixels.
  • a "filled rectangle” of pixels For any given position of a car in real-world coordinates, there exists a corresponding rectangle in the (distorted) output space of the CNN 521. These rectangles are also different sizes in the output space of the CNN 521 according to how distant the car is.
  • the output of the CNN 521 should have high output values covering the area of a proposed rectangle, that is, the output of the CNN 521 should "paint" the car as a rectangle of appropriate size at the appropriate location in the output.
  • FIG. 6 shows an exemplary method 600 for training the system 500 of FIG. 5, in accordance with an embodiment of the present invention.
  • step 610 input video and radar pairs 671 corresponding to driving trips.
  • the system 500 is trained with approximately 660 video and radar pairs corresponding to driving trips.
  • the video and radar pairs 671 can include video 671B taken from inside the subject car and radar 671A which is recorded from a device attached to the front bumper of the car.
  • step 620 extract N sequential video image frames 672 from the dataset at random.
  • step 630 preprocess the N sequential video image frames 672 to obtain N preprocessed sequential video image frames 673.
  • step 640 input the N preprocessed sequential video frames 673 to the modified CNN 674.
  • step 640 can include step 640 A.
  • Step 640A input the subject car's velocity 671C from the CAN (Car Area Network) data for that frame.
  • Step 640A can be performed, for example, when velocities are to be estimated for the cars in the scene.
  • step 650 extract the radar data 671 A from the dataset corresponding to the last image frame and create a target label image "Y" for the CNN 674 to produce from the radar data.
  • FIG. 7 shows an exemplary Convolutional Neural Network (CNN) 700, in accordance with an embodiment of the present invention.
  • the CNN 700 can be, for example, CNN 521 of FIG. 5.
  • CNN 700 The description of CNN 700 is made relative to a camera 781 and a radar device 782.
  • the camera 781 has a video Field of View (FOV) 781 A
  • the radar device 782 has a radar FOV 782 A.
  • FOV Field of View
  • the CNN 700 includes an input 701 and "N" feature planes 702.
  • the input 701 can include N x RGB image frames. In an embodiment, 3 image frames are used that span approximately 0.5 seconds. Of course, other numbers of image frames, involving different spans, can also be used, while maintaining the spirit of the present invention.
  • CAN data can be input in the case when output velocities are to be estimated/used.
  • components 701 and 702 can be traditional CNN components.
  • a traditional CNN would output class predictions corresponding to every window position at the input (an example window position is the square 799 shown on the input image in FIG. 7 by using the N features for each window position at the "N" feature planes 702 to output "M" probabilities (for example) of the class of detected object being framed by the input window.
  • the present invention takes the "N” features for every input window and trains “NxM” lxl kernels 703 to map those features into a "depth ray", a vector of length "max depth” 706.
  • Each such column in the output of the CNN 700 at this point 706 corresponds to a ray in space from the camera origin through the center of the CNN window in the input image.
  • the present invention uses lxl kernels to map from "N" traditional CNN feature planes to "Max Depth” output planes.
  • a MAX operation 704 is used to completely remove the Z dimension in the output of the CNN 700. This leaves us with a 2 dimensional output in the top-down (or map-view) perspective.
  • the input data from real-world coordinates is projected (and distorted) 705 into the output space of the CNN 700. Objects are "painted" in this distorted space to reflect the positions and dimensions of the cars.
  • the distance from the base of the view cone is converted to a distance on a log scale to map a range of approximately 0 to 150 meters to a range from 0 to 45 pixels (max depth) in the output.
  • other distances, numbers of pixels, and so forth can also be used, while maintaining the spirit of the present invention.
  • This log scale conversion is to give more focus to learning to predict the position of cars nearer the camera.
  • the width of the object is used to determine the dimensions of the painted object in the output space at the projected location. In this way, the "painted" size of an object in the output space of the CNN changes as it moves from close to the subject car (large) to the distance (small).
  • FIGs. 8-10 show an exemplary method 800, in accordance with an embodiment of the present invention.
  • step 810 corresponds to a training stage and steps 820-850 correspond to an inference stage. Further detail regarding the training stage is described relative to FIG. 6.
  • step 810 train the CNN to detect and localize objects.
  • step 810 can include one or more of steps 810A and 810B.
  • step 810A train the CNN to detect and localize objects from image and radar pairs that include (i) image data and (ii) radar data.
  • the image data and/or the radar data can preferably be for different scenes of a natural environment.
  • the image data and/or radar data can preferably be for different driving scenes of a natural driving environment.
  • step 810B train the CNN to detect and localize objects from image and object location list pairs that include (i) image data and (ii) objection location data for objects including and other than vehicles.
  • step 820 capture image data relative to an ambient environment of the user. For example, capture image data relative to an outward view from a vehicle operated by the user.
  • the image pre-processing can involve, for example, one or more of the following operations: (1) extract RGB frames from an input image sequence; (2) correct for barrel lens distortion; and (3) crop and/or scale.
  • step 840 detect and localize objects, in a real-world map space, from the image data using a trainable object localization Convolutional Neural Network (CNN).
  • CNN trainable object localization Convolutional Neural Network
  • step 840 can be performed to detect and localize objects, at all scales, in a single-pass.
  • step 840 can include one or more of steps 840 A, 840B, and 840C. Other aspects of the operation of the CNN are shown relative to FIG. 7.
  • step 840A collapse a Z-dimension of an output of the CNN using a max function.
  • step 840B use a detection window that is sized smaller than an expected object (e.g., vehicle, pedestrian, etc.) size, to enhance detection context for object detection.
  • an expected object e.g., vehicle, pedestrian, etc.
  • step 840C directly output locations for all of the objects into a map-view space that has a direct one-to-one projection onto the real-world map space.
  • the post processing can involve, for example, one or more of the following operations: (1) forming an integral image; (2) integrating over projected bounding boxes; performing non-maximal suppression; and (4) projection to real-world coordinates.
  • the non-maximal suppression on can be performed on a set of probabilities that a particular object is at a particular location in order to obtain a list of car detections in an output space of the CNN.
  • step 850 can include one or more of steps 850A-850C.
  • step 850A generate an image showing positions of the objects in a top-down map- view perspective.
  • the image can be a bitmap.
  • the top-down map- view perspective can be intentionally distorted in a pre-processing stage to correct for image capture related distortions.
  • the user- perceptible object detection result can be in the form of a list of detected objects and their (real- world) positions which is displayed and/or provided through a speaker.
  • the user-perceptible object detection result can be in the form of one or more recommendations to the vehicle operator (e.g., brake now, brake hard, steer right, accelerate, etc.).
  • the vehicle operator e.g., brake now, brake hard, steer right, accelerate, etc.
  • such recommendation can be directed to avoiding objects in a path of the motor vehicle (such as, e.g., an emergency vehicle or a nonemergency vehicle), where such objects can be inanimate or animate objects.
  • step 850C automatically perform one or more actions responsive to the detection results (e.g., responsive to the locations of detected objects. For example, automatically control one or more driving functions responsive to the detection results.
  • the present invention is integrated with and/or otherwise coupled to an Advanced Driver- Assistance System (ADAS).
  • ADAS Advanced Driver- Assistance System
  • the ADAS could apply a decision making process to, e.g., a list of object positions determined by step 850B, in order to determine whether a dangerous condition(s) exists or not (with respect to the motor vehicle) and to further determine a proper corrective action to take to avoid or at least mitigate any potential harm that can result from the dangerous condition.
  • the decision making process can be any known type of decision making process including, but not limited to, preprogrammed rules, a neural network, a decision tree, and so forth.
  • the CNN described herein could be further used for this purpose. It is to be appreciated that the preceding decision making processes are merely illustrative and, thus, other decision making processes can also be used in accordance with the teachings of the present invention, while maintaining the spirit of the present invention.
  • the control that can be imposed by step 850C can involve, for example, but is not limited to, steering, braking, and accelerating functions.
  • the processor may initiate a control signal to the braking system to apply the brakes in order to avoid hitting the object with the motor vehicle.
  • the vehicle can be automatically steered by the processor initiating a control signal to the steering system.
  • object detection performed directly on video as per the present invention is cheaper than systems involving several sensors such as LIDAR and RADAR.
  • approaches involving LIDAR and/or RADAR acquire depth information on a per pixel basis in the space of an input image and not in real, world x,y map space.
  • the present invention provides significantly faster detection of objects at multiple scales than prior art approaches. This is because prior art approaches using video that include CNNs are typically trained to recognize an object at a particular scale in image space, usually such that the object fills the frame of the input window to the CNN. Hence, it is common practice in the prior art to scale the input image to many different scales and scan the CNN over each scale of the input image, thus having to perform object detecting at each of the scales, resulting inN output maps that represent the object detections per pixel at each of the N scales (noting that these are scales, and not real world coordinates as per the present invention).
  • the present invention advantageously uses direct training and runtime output of real world positions for objects (not positions in image space as per the prior art).
  • the present invention does not require human data labelling. Video and radar labels are acquired by just "driving around”.
  • the present invention can use a dataset that is a large natural driving dataset that includes many real-life scenarios.
  • identification, by the present invention, of distant cars in an image is enhanced because the detection window is much larger than the car. That is, the window takes in more context (a small black dot on the road ahead in the distance is likely to be a car) that simply the expected car. As a further example, a significant black dot in the middle of the input window is likely to be a car if it surrounded by road features that would place the dot in the middle of a lane.
  • Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements.
  • the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • the medium may include a computer-readable medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
  • such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C).
  • This may be extended, as readily apparent by one of ordinary skill in this and related arts, for as many items listed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Traffic Control Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un système et un procédé implémentés par ordinateur. Le système comprend un dispositif de capture (510) d'image configuré pour capturer des données d'image se rapportant à un environnement ambiant d'un utilisateur. Le système comprend en outre un processeur (511) configuré pour détecter et localiser des objets, dans un espace cartographique du monde réel, à partir des données d'image à l'aide d'un réseau neuronal convolutif (CNN) de localisation d'objet apte à un apprentissage. Le CNN est entraîné pour détecter et localiser les objets à partir de paires image/radar qui comprennent les données d'image et des données de radar pour différentes scènes d'un environnement naturel. Le processeur (511) est en outre configuré pour effectuer une action perceptible par l'utilisateur en réponse à une détection et à une localisation d'un objet dans un trajet souhaité de l'utilisateur.
PCT/US2017/049327 2016-09-19 2017-08-30 Conversion de vidéo en radar WO2018052714A2 (fr)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US201662396280P 2016-09-19 2016-09-19
US62/396,280 2016-09-19
US15/689,755 US10330787B2 (en) 2016-09-19 2017-08-29 Advanced driver-assistance system
US15/689,755 2017-08-29
US15/689,656 US10495753B2 (en) 2016-09-19 2017-08-29 Video to radar
US15/689,656 2017-08-29

Publications (2)

Publication Number Publication Date
WO2018052714A2 true WO2018052714A2 (fr) 2018-03-22
WO2018052714A3 WO2018052714A3 (fr) 2019-05-09

Family

ID=61619724

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/049327 WO2018052714A2 (fr) 2016-09-19 2017-08-30 Conversion de vidéo en radar

Country Status (1)

Country Link
WO (1) WO2018052714A2 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109720275A (zh) * 2018-12-29 2019-05-07 重庆集诚汽车电子有限责任公司 基于神经网络的多传感器融合车辆环境感知系统
WO2020072193A1 (fr) * 2018-10-04 2020-04-09 Waymo Llc Apprentissage automatique à l'aide d'une localisation d'objet
EP3624002A3 (fr) * 2018-09-12 2020-06-10 Samsung Electronics Co., Ltd. Procédé de génération de données de formation pour le traitement d'images, procédé de traitement d'images et dispositifs associés
CN111458721A (zh) * 2020-03-31 2020-07-28 江苏集萃华科智能装备科技有限公司 一种暴露垃圾的识别定位方法、装置及系统
EP4102403A1 (fr) * 2021-06-11 2022-12-14 Zenseact AB Plateforme de développement de système de perception pour système de conduite automatisée
US11706546B2 (en) * 2021-06-01 2023-07-18 Sony Semiconductor Solutions Corporation Image sensor with integrated single object class detection deep neural network (DNN)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9460354B2 (en) * 2012-11-09 2016-10-04 Analog Devices Global Object detection
US20150339589A1 (en) * 2014-05-21 2015-11-26 Brain Corporation Apparatus and methods for training robots utilizing gaze-based saliency maps
JP2016006626A (ja) * 2014-05-28 2016-01-14 株式会社デンソーアイティーラボラトリ 検知装置、検知プログラム、検知方法、車両、パラメータ算出装置、パラメータ算出プログラムおよびパラメータ算出方法
US9978013B2 (en) * 2014-07-16 2018-05-22 Deep Learning Analytics, LLC Systems and methods for recognizing objects in radar imagery
US9286524B1 (en) * 2015-04-15 2016-03-15 Toyota Motor Engineering & Manufacturing North America, Inc. Multi-task deep convolutional neural networks for efficient and robust traffic lane detection

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3624002A3 (fr) * 2018-09-12 2020-06-10 Samsung Electronics Co., Ltd. Procédé de génération de données de formation pour le traitement d'images, procédé de traitement d'images et dispositifs associés
US11670087B2 (en) 2018-09-12 2023-06-06 Samsung Electronics Co., Ltd. Training data generating method for image processing, image processing method, and devices thereof
WO2020072193A1 (fr) * 2018-10-04 2020-04-09 Waymo Llc Apprentissage automatique à l'aide d'une localisation d'objet
US11105924B2 (en) 2018-10-04 2021-08-31 Waymo Llc Object localization using machine learning
CN109720275A (zh) * 2018-12-29 2019-05-07 重庆集诚汽车电子有限责任公司 基于神经网络的多传感器融合车辆环境感知系统
CN111458721A (zh) * 2020-03-31 2020-07-28 江苏集萃华科智能装备科技有限公司 一种暴露垃圾的识别定位方法、装置及系统
US11706546B2 (en) * 2021-06-01 2023-07-18 Sony Semiconductor Solutions Corporation Image sensor with integrated single object class detection deep neural network (DNN)
EP4102403A1 (fr) * 2021-06-11 2022-12-14 Zenseact AB Plateforme de développement de système de perception pour système de conduite automatisée

Also Published As

Publication number Publication date
WO2018052714A3 (fr) 2019-05-09

Similar Documents

Publication Publication Date Title
US10495753B2 (en) Video to radar
WO2018052714A2 (fr) Conversion de vidéo en radar
US11488392B2 (en) Vehicle system and method for detecting objects and object distance
US11010622B2 (en) Infrastructure-free NLoS obstacle detection for autonomous cars
CN106952303B (zh) 车距检测方法、装置和系统
US11527077B2 (en) Advanced driver assist system, method of calibrating the same, and method of detecting object in the same
CN110796692A (zh) 用于同时定位与建图的端到端深度生成模型
KR20190026116A (ko) 객체 인식 방법 및 장치
CN111660934A (zh) 定位系统和方法
JP7147420B2 (ja) 物体検出装置、物体検出方法及び物体検出用コンピュータプログラム
US11042999B2 (en) Advanced driver assist systems and methods of detecting objects in the same
US11030723B2 (en) Image processing apparatus, image processing method, and program
JP2023126642A (ja) 情報処理装置、情報処理方法、及び、情報処理システム
EP4054913A1 (fr) Prédiction de probabilités de changement de voie dangereux d'agents environnants
KR102669061B1 (ko) 인공지능 기반의 차량과 객체 간의 충돌 위험도 검출 시스템
KR20140074219A (ko) 수직 대칭을 이용한 대상물 위치 지정
JP2020046882A (ja) 情報処理装置、車両制御装置および移動体制御方法
US20220057992A1 (en) Information processing system, information processing method, computer program product, and vehicle control system
US20200125111A1 (en) Moving body control apparatus
US20230343228A1 (en) Information processing apparatus, information processing system, and information processing method, and program
Jheng et al. A symmetry-based forward vehicle detection and collision warning system on Android smartphone
CN116710971A (zh) 物体识别方法和飞行时间物体识别电路
Memon et al. Self-driving car using lidar sensing and image processing
CN113780050A (zh) 高级驾驶员辅助系统和在其中检测对象的方法
KR20220021125A (ko) 인공지능 기반 충돌 인지 방법 및 장치

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17851305

Country of ref document: EP

Kind code of ref document: A2