WO2024015019A1 - Système de modélisation d'attention de conducteur - Google Patents

Système de modélisation d'attention de conducteur Download PDF

Info

Publication number
WO2024015019A1
WO2024015019A1 PCT/SG2023/050491 SG2023050491W WO2024015019A1 WO 2024015019 A1 WO2024015019 A1 WO 2024015019A1 SG 2023050491 W SG2023050491 W SG 2023050491W WO 2024015019 A1 WO2024015019 A1 WO 2024015019A1
Authority
WO
WIPO (PCT)
Prior art keywords
map
image
attention
driver
node
Prior art date
Application number
PCT/SG2023/050491
Other languages
English (en)
Inventor
Chen LYU
Zhongxu HU
Original Assignee
Nanyang Technological University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanyang Technological University filed Critical Nanyang Technological University
Publication of WO2024015019A1 publication Critical patent/WO2024015019A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Definitions

  • the present invention relates, in general terms, to a system for estimating a driver attention map, and a method implemented by such a system.
  • the human attention mechanism is an essential and indispensable capability allowing humans to allocate limited cognitive processing resources to simultaneously deal with voluminous visual information.
  • the human attention mechanism allows a human driver to quickly identify important visual cues or potential risks and make corresponding decisions in highly dynamic and complex driving events.
  • driver attention models aim to replicate human driver attention behaviour during a driving situation. This is crucial in interpreting and designing a human-like driving system. This can further benefit development of human-centric advanced driver assistance systems for supporting human drivers and enhancing driving safety.
  • Some modern datasets are built around driving events - e.g. lane changes and collisions. Such datasets enable development of deep learning-based approaches for end-to-end driver attention estimation. These methods typically use a pixel-level representation of the visual stimuli, which can replace the classic handcrafted features and descriptors.
  • the human visual cognitive system is complex, relating to both low-level visual texture and high- level semantic information.
  • the present system and method build a driver attention model that can indicate where the driver would look at. This avoids the need to obtain the facial features and gestures of a human driver, to augment these features and gestures with the driver's visual field.
  • the present methods leverage multi-level visual content, including low-level texture features, middle-level optical flows, and high-level semantic information, as the model input. Subsequently, a heterogeneous model is proposed to handle the multi-level input, which integrates graph and convolutional neural networks. These methods leverage objection detection information in an interpretable manner.
  • the system includes one or more processor (processor(s)) and a memory.
  • the memory stores instructions that the processor(s) execute to perform the methods described herein.
  • the processor(s) receive a stream of images captured from a camera directed towards a field of view of a driver of a vehicle, and process at least one image of the stream of images using a first neural network to determine an image feature map.
  • the processor(s) also process those image(s) using an object detection model to detect objects in the image(s).
  • a graph representation of the detected objects is then defined, with each detected object being represented as a node of the graph and with connections between the nodes representing relationships between the detected objects.
  • the image feature map and the graph representation of the detected objects can then be integrated to obtain a driver attention map corresponding to the at least one image.
  • the driver attention map estimates regions in the at least one image corresponding to regions of attention of a driver's visual perception.
  • a computer implemented method for estimating a driver attention map involves receiving a stream of images captured from a camera directed towards a field of view of a driver of a vehicle and processing at least one image of the stream of images using a first neural network to determine an image feature map.
  • the image(s) are also processed using an object detection model to detect objects in the at least one image.
  • a graph representation of the detected objects is then defined, where each detected object is represented as a node of the graph and connections between the nodes represent relationships between the detected objects the image feature map and the graph representation of the detected objects are then integrated to obtain a driver attention map corresponding to the at least one image.
  • the driver attention map estimates region in the at least one image corresponding to regions of attention of a drivers visual perception.
  • Non-transitory computer-readable storage media storing instructions that when executed by one or more processors cause the one or more processors to perform the methods described herein.
  • Driver attention map also referred to as a driver saliency map, is a representation of estimated regions in an image of the driver's field of view that are focused on by the driver's visual perception or visual attention.
  • the first and second neural network may comprise a convolutional neural network or a convolutional encoder network for processing the raw image data and the optical flow map respectively.
  • Semantic attention map is a representation of results of object detection superimposed on the source image that served as the basis for object detection.
  • the semantic attention map reflects the relative importance of each detected object for the estimation of the driver attention/saliency map.
  • Figure 1 illustrates a method for estimating a driver attention map, in accordance with present teachings
  • Figure 2 illustrates multi-level visual content
  • Figure 3 illustrates the architecture of a system for performing the method of Figure 1;
  • Figure 4 is an example of a graph representation construction of detected objects
  • Figure 5 illustrates an example of calculating y nodei of the node /, being equal to the ratio of the sum of the values in the Region of Interest area to the sum of the entire labelled attention map;
  • Figure 6 is a system for implementing the method of Figure 1;
  • Figure 7 shows the comparison of different protocols with respect to different backbones on BDD-A dataset
  • Figure 8 shows the comparison of different protocols with respect to different backbones on DADA-2000 dataset
  • Figure 9 shows the comparison of different protocols with respect to different backbones on DR(eye)VE dataset.
  • Figure 10 shows the percentage correlation of different values in different datasets, where the correlation indicates the relationship between semantic information and ground truth saliency.
  • MVC multi-level visual content
  • middle-level optical flows middle-level optical flows
  • high-level semantic information as the model input.
  • a heterogeneous model is proposed to handle the multi-level input, which integrates graph and convolutional neural networks. These methods leverage objection detection information in an interpretable manner.
  • a computer implemented method 100 for estimating a driver attention map.
  • the method 100 includes the steps of: receiving a stream of images (102), processing one or more images from the stream of images to determine an image feature map (104), processing the same image(s) to detect objects therein (106), and defining a graph representation of the detected objections (108). Thereafter, the image feature map and graph representation are integrated to produce a driver attention map (110).
  • the driver attention map estimates a region or regions in the processed image(s) corresponding to a region or regions of attention of a driver's visual perception.
  • the method of Figure 1 addresses existing issues and improves driver attention map estimation performance, by leveraging multi-level visual content (MVC), including low-level texture features, middle-level optical flows (in some embodiments), and high-level semantic information.
  • MVC multi-level visual content
  • steps 104, 106 and 108 recognise that images in the image stream can low- level texture features (200) and high-level semantic information (202), as the model input.
  • middle-level optical flows (204) are also extracted. From these features, a driver attention map 208 is assembled.
  • Step 102 involves receiving a stream of images.
  • the images will typically be received from one or more image capture devices (e.g. cameras) directed towards a field of view of the driver of a vehicle.
  • the driver's eyes may be tracked and the field of view aligned with the driver's line of sight.
  • images may be captured in the general direction of a driver viewing a field of view through a windscreen.
  • Each image may be a raw RGB image, infrared image or other type of image from which features can be extracted.
  • the method estimates a driver attention map based on a driving scenario image captured in accordance with step 102.
  • MVC is used to formulate the inputs. While embodiments will be described with reference to optical flow, it will be appreciated that optical flow is not incorporated into some embodiments.
  • Step 104 involves processing at least one image of the stream of images using a first neural network to determine an image feature map.
  • a first neural network to determine an image feature map.
  • multiple images will be processed.
  • object flow detection discussed below
  • a single image may be processed.
  • the image feature map (300 in Figure 3) is derived from low-level textures - these are detectable directly in the raw image. Each texture feature can be used to identify a distinct pixel from its neighbouring pixels in each image. Thus, the image feature map encodes pixel-level features.
  • Step 106 operates on high-level semantics - namely, object detection.
  • Step 106 involves processing the image or images from which image feature maps were extracted, using an object detection model to detect objects in the at least one image.
  • Object detection is performed on the raw image since the raw image has the most information, hence, it can be used to extract the multi-scale features.
  • an object detector is utilized. Semantic segmentation does not distinguish between foreground and background. Instead, objects of a particular classification or type (e.g. human, car, light) are segmented from the raw image and classified with equal weight or importance, regardless of the distance of those objects.
  • the present method uses object detection. Object detection can be interpreted as introducing prior knowledge and can concentrate on the foreground traffic- related objects and participants, including pedestrians, vehicles, bikes, traffic lights, and so on, while ignoring the background. This type of multi-level information processing agrees more closely with the human visual attention mechanism, which largely disregards distant obstacles that might, if considerably closer, be of primary importance. Object detection is also more robust for detection of the unknown objects and false objects and is faster than segmentation.
  • the detection results are embedded into the low-dimension feature vectors for reducing the learning and computing costs.
  • This enables a graph representation (304) of the detected objects to be defined per step 108, wherein each detected object is represented as a node of the graph and connections between the nodes represent relationships between the detected objects.
  • the image feature map 300 and the graph representation 304 of the detected objects are integrated to obtain a driver attention map 306 (also referred to as a driver saliency map) corresponding to the image or images processed under steps 104 and 106.
  • a driver attention map 306 also referred to as a driver saliency map
  • multi-level inputs namely the low-level features, high-level features and, where used, mid-level optical flow features, can increase the interpretability of the model. This can also ensure that the model pays attention to specific objects and areas, such as the pedestrian crossing the road in the raw image (300) in Figure 3.
  • the graph representation 304 is transformed prior to integration with the image feature map 300.
  • the graph representation 304 can be processed using a graph neural network (GNN) 308.
  • the GNN handles high-level semantic extraction by constructing the features of the region of interest (ROI).
  • the GNN 308 takes Region of Interest (ROI)-wise features (i.e. features corresponding to each ROI) and transforms the graph representation 304 into a two dimensional (2D) Gaussian map for each detected object.
  • ROI Region of Interest
  • 2D two dimensional
  • the Gaussian map is produced from an ROI- wise probability for each detected object, the probability being the likelihood that the detected object is actually an object - i.e. something to be avoided during driving.
  • step 110 involves integrating the semantic attention map with the image feature map to obtain the driver attention feature map.
  • the result is more interpretable compared to existing approaches.
  • a ROI-wise Graph Attention Network (GAT) is generated.
  • the GAT deals with the detected objects by explicitly constructing the semantic information, rather than handle the features extracted by convolutional modules for building the latent space features.
  • some embodiments employ a semantic attention module to integrate the non-Euclidean output of the graph attention network with the Euclidean feature maps of the convolutional neural networks, the integrated features ultimately being decoded to generate a driver attention map.
  • the YOLOv3 object detection model is adopted as the object detector to detect various objects and produce the ROI areas.
  • R-CNN, Region-Based Convolutional Neural Networks, other YOLO versions and other networks may be used in place of YOLOv3.
  • the detection results 400 from object detection cannot be represented in a grid-like structure as the image. Consequently, the GNN model is introduced. GNNs are able to dynamically handle data in an irregular domain - e.g. a domain that cannot be represented with a predetermined structure.
  • the different objects in one image lend themselves to representation in a graph structure 402 where an object represents a node - hi being the node for the i th detected object.
  • Each node of the graph comprises features relating to the corresponding detected object, the features comprise at least location coordinates of the detected object, a class index of the detected object, and a prediction confidence probability.
  • h t b x l , b y l , b ,, b h l , clsi, conf) (1) where bx, by) represent the central position of the object, b w , bh denote the width and height of the detected bounding box, cis indicates the class index of prediction, and conf is the corresponding prediction confidence.
  • the features representing each node are normalized separately.
  • each object in the detection results 400 is bounded by a bounding box, the bounding box being converted to the respective node in the graph structure 402.
  • each node h is connected to others and itself.
  • Such a structure allows the network to learn the relationship rather than memorizing a predefined one.
  • a graph attention network (GAT) 308, 402 is utilized to deal with an arbitrarily structured graph. This follows the self-attention mechanism for computing the representations of each node.
  • the graph representation of the detected objects is a fully connected graph structure (e.g.
  • bidirectional edges with other nodes as well as a selfconnecting edge and the representations of each node are computed as follows: where e i7 represents an attention coefficient regarding the neighbour nodes, W denotes a shared and learnable weight matrix, corresponds to node i, ft 7 corresponds to node j, f() is a mapping function, Ni denotes a neighbourhood of node i, K is a number of multi-head attention, and II indicates the concatenation operation.
  • GAT 308, 402 adopts a node-wise construction. It this embodiment, there are two GATs, with the output of the first GAT 308 feeding into the second GAT. It can also parallelize the operation of node-neighbour pairs. Furthermore, it is appropriate for inductive learning tasks. This makes GAT robust with respect to changes in a graph structure.
  • the ROI-wise response probability graph can be obtained.
  • the corresponding probability 312 of each node (object) is output from the GAT.
  • the next challenge is therefore determining how to integrate the irregular graph with the grid-like feature map of the raw image and, where used, the optical flow map.
  • the semantic attention module can transform the graph nodes to ROI-wise response maps which can be fused into an attention map - i.e. transform the probability graph to a semantic attention map.
  • Combining the 2D Gaussian maps to obtain a semantic attention map comprises computing : represents a final response probability of node i, (b ⁇ b ⁇ b ⁇ 1 are feature of a bounding box of a detected object, and e is a hyper-parameter to regulate the synthetic Gaussian map gi.
  • the Gaussian maps 310 of all nodes are combined to generate the final semantic attention map G (314).
  • the semantic attention map reflects the importance of each object while aligning with its position and area.
  • the semantic attention map 314 can introduce high-level sematic information to the model and induce the model to pay attention to specific objects.
  • the size of map G is the same as the feature map that it needs to induce.
  • the irregular graph 308 is transformed into a fixed attention map 314.
  • the fixed attention map 314 can be used to element-wise multiply the extracted feature maps similar to the convolutional spatial attention mechanism.
  • integrating the semantic attention map with the image feature map comprises performing an element wise multiplication operation (318) to obtain a multiplication result, decoding the multiplication result using a deconvolution model 320 to obtain the driver attention feature map 322.
  • the semantic attention module 316 is thus critical for constructing the heterogeneous model that integrates GNN and CNN.
  • optical flow 324 is derived from the raw image, though the two are encoded separately.
  • the encoded features are integrated during the decoding processing discussed below, where the semantic attention module bridges the non-grid output of the GAT module, namely the semantic attention map 314, and feature maps extracted by the CNN model comprising the convolutional encoders for the optical flow 324 and raw image 300.
  • the optical flow is a dense optical flow calculated to obtain the motion information of a given scene.
  • Various mechanisms can be used for optical flow determination, including the Farneback algorithm, gradient based estimation (assuming pixel intensities move from one frame to the next), and Iterative Optical Flow Estimation algorithms.
  • the Farneback algorithm will be used.
  • the optical flow is divided into a predetermined number of bins, for example 9 bins, according to the optical vector of each pixel to generate a coarse optical flow map and reduce data noise.
  • the method 100 can involve processing, using a second neural network 326, a subset of the stream of images comprising image or images processed by steps 104 and 106.
  • the subset of images enables pixels and/or features to be tracked between images.
  • the second neural network 326 determines an optical flow map (i.e. 324) that indicates a movement of image features over the subset of the stream of images.
  • the image feature map 300 and graph representation 308 are used, the image feature map is integrated with the optical flow map and the graph representation of the detected objects, to obtain the driver attention map. Integration may occur in any order - e.g. feature map with optical flow map followed by graph representation, or optical flow map with graph representation followed by feature map, or any other sequential or simultaneous integration.
  • a pre-trained CNN-based backbone i.e. first neural network 328
  • first neural network 328 a pre-trained CNN-based backbone
  • the decoder 320 is a deconvolution model that mirrors the structure of the first neural network, and an output of at least one layer of the first neural network 328 is provided as a residual input to a corresponding mirrored layer of the deconvolution model 320.
  • the decoder 320 has multiple deconvolutional blocks, where each block upscales the feature maps and ensures that they have the same shape as the corresponding feature maps of the encoder.
  • the semantic attention map is positioned after the first deconvolutional block by default. It may alternatively be placed after a different deconvolutional block. The earlier the position is, the more involved the decoding process becomes, thus, the stronger the non-linearity and performance is.
  • Another CNN model 326 is utilized to obtain the features of the optical flow map 324. These raw features are fused with the corresponding raw image features.
  • a GNN model 304 is introduced. Furthermore, the relationship between the objects conforms better to a graph format as discussed above. Through the GNN model 304, the corresponding probability of each object can be generated
  • the semantic attention module maps the corresponding probability 312 and position information of each object into a 2D Gaussian map 310. Different ROI-wise Gaussian maps 310 are combined to induce the model decoding. Finally, the deconvolutional model 320 is leveraged to decode the multi-level visual content to estimate the driver attention map, which is described as follows:
  • step 310 being integration of the image feature map and the optical flow map, comprises an element wise addition operation.
  • a heterogeneous model is proposed to handle the MVC that includes different types of formats, such as the semantic attention module to integrate the non- Euclidean output of GNN with the Euclidean feature maps extracted by the convolutional neural network (CNN).
  • the proposed heterogeneous architecture includes two CNN-based encoders, one GAT model with a semantic attention module and a deconvolutional-based decoder. There are two outputs: the output of the decoder, the final driver attention map, and the output of the GAT module, the response probability graph.
  • the corresponding loss function is defined as follows: where y, represents the estimated driver attention map and y S ai is the corresponding labeled attention maps, u() is used to calculate the covariance, £ki and £cc denote the loss functions of Kullback-Leibler Divergence (KL-Div) and correlation coefficient (CC), respectively, and AI,2,3 are used to balance different loss functions.
  • ynode is the ground truth of the nodes, which is where RO represents the region that belongs to the bounding box of node /', as shown in Figure 5. It leverages the ratio of the sum of values in the ROI area to the sum of the entire labelled attention map to obtain the response probability of the node.
  • Figure 6 shows an end-user computing device or system referred to in this disclosure comprises a smartphone device, a tablet device, a laptop device etc. that is used by an end user to estimate a driver attention map.
  • Computing device 600 comprises one or more processing units 610 with access to one or more pre-processors 602 (if used) for preprocessing input data - e.g. selecting raw images, removing motion artefacts, and so on - and having a communication channel to a camera/external device(s) 604 for collecting input data (e.g.
  • a neural network module 606 comprising a first CNN network 608, a second CNN network 610 and GNN module 612, corresponding to CNN 328, 326 and 304, respectively.
  • the external device(s) 604 may be integral with, or unitary with, system 600, or may be separate.
  • System 600 may be in communication (e.g. over network 618) with one or more server system 616 that serve as a back end system for an application executing on the system 600.
  • server system 616 may be a backend application server of a relevant application for CNN implementation on the system 600.
  • the server system 616 may transmit code or information to the system 600 and may receive information from system 600 obtained after pre-processing input data captured by external device(s) 604.
  • memory 620 may constitute or comprise a computer readable storage medium with instructions stored thereon for implementing method 100.
  • the present method and system were evaluated using various datasets that can also be used (in full or by splitting between training and testing and/or validation data) for training the GNN, first and second CNNs.
  • the datasets are BDD-A, DR(eye)VE, and DADA2000.
  • the first database includes images for driving events involving braking
  • the second database includes images at various weather conditions and times of day
  • the third dataset includes images leading to crash situations.
  • I, I+F, I+S, I+F+S which are tested on different datasets with two backbones: ResNetl8 and ResNet34, and the other network architecture parameter can be found in Table I.
  • the I, F, and S represent the raw image, optical flow map, and semantic information, respectively.
  • the model of I is utilized as the baseline, and I+F+S denotes the proposed MVC-Net - i.e. present method.
  • the decoders of these protocols have the same structure.
  • the DR(eye)VE driver focusing on the vanishing point during routine driving and traffic accident in DADA disrupting the normal usage of the optical flow.
  • the correlation between the semantic information and ground truth saliency map of each sample is calculated.
  • a statistical evaluation of the result is performed as depicted in Figure 10, which shows the percentage of different correlation values.
  • the correlation between semantic information and ground truth in the DR(eye)VE dataset is found to be the lowest, followed by the DADA-2000 dataset.
  • the other factor is that the introduced semantic information allows the model estimation to cover a larger area compared to the ground truth.
  • the NSS measures whether the prediction covers all ground truth points, but the excess area reduces the value of true positive points after dividing the sum of the entire prediction map, which results in a lower SIM value. Therefore, the MVC input is more inclined to expand the attention area, which in higher NSS but lower SIM values. According to the comparison results, the backbones do not exert a significant effort, especially when compared to the various input protocols. This demonstrates the importance of the proposed MVC inputs.
  • the MVC is proposed for driver attention estimation. Rather than implicitly utilizing the semantic segmentation map by existing works, the semantic information is utilized in an explicit and interpretable manner.
  • a novel heterogeneous network integrating GNN module with the CNN model is proposed to handle different types of visual content.
  • an innovative semantic attention module is designed to transform the non-Euclidean graph nodes into a semantic attention map, which can be further combined with the Euclidean feature maps extracted by CNN.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

Est ici divulgué un système d'estimation d'une carte d'attention de conducteur. Les processeurs reçoivent un flux d'images capturées à partir d'une caméra orientée vers un champ de vision d'un conducteur d'un véhicule. Une image est traitée au moyen d'un premier réseau neuronal pour déterminer une carte de caractéristiques d'image et au moyen d'un modèle de détection d'objet pour détecter des objets dans la/les image(s). Une représentation graphique des objets détectés est définie, le graphe comprenant des nœuds (objets détectés) et des connexions (relations entre des objets détectés). La carte de caractéristiques d'image et la représentation graphique sont intégrées pour obtenir une carte d'attention de conducteur correspondant à l'image. Une carte de flux optique peut également être incluse pour l'intégration. La carte d'attention de conducteur estime des régions d'attention de la perception visuelle d'un conducteur.
PCT/SG2023/050491 2022-07-12 2023-07-12 Système de modélisation d'attention de conducteur WO2024015019A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10202250434A 2022-07-12
SG10202250434A 2022-07-12

Publications (1)

Publication Number Publication Date
WO2024015019A1 true WO2024015019A1 (fr) 2024-01-18

Family

ID=89537556

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2023/050491 WO2024015019A1 (fr) 2022-07-12 2023-07-12 Système de modélisation d'attention de conducteur

Country Status (1)

Country Link
WO (1) WO2024015019A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190362183A1 (en) * 2018-05-23 2019-11-28 Idemia Identity & Security France Method for processing a stream of video images
CN112001226A (zh) * 2020-07-07 2020-11-27 中科曙光(南京)计算技术有限公司 一种无人驾驶3d目标检测方法、装置及存储介质
US20200384981A1 (en) * 2019-06-10 2020-12-10 Honda Motor Co., Ltd. Methods and apparatuses for operating a self-driving vehicle

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190362183A1 (en) * 2018-05-23 2019-11-28 Idemia Identity & Security France Method for processing a stream of video images
US20200384981A1 (en) * 2019-06-10 2020-12-10 Honda Motor Co., Ltd. Methods and apparatuses for operating a self-driving vehicle
CN112001226A (zh) * 2020-07-07 2020-11-27 中科曙光(南京)计算技术有限公司 一种无人驾驶3d目标检测方法、装置及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FANG JIANWU, YAN DINGXIN, QIAO JIAHUAN, XUE JIANRU, YU HONGKAI: "DADA: Driver Attention Prediction in Driving Accident Scenarios", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, vol. 23, no. 6, 1 June 2022 (2022-06-01), Piscataway, NJ, USA , pages 4959 - 4971, XP093132326, ISSN: 1524-9050, DOI: 10.1109/TITS.2020.3044678 *
PALAZZI ANDREA; ABATI DAVIDE; CALDERARA SIMONE; SOLERA FRANCESCO; CUCCHIARA RITA: "Predicting the Driver's Focus of Attention: The DR(eye)VE Project", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 41, no. 7, 1 July 2019 (2019-07-01), USA , pages 1720 - 1733, XP011728030, ISSN: 0162-8828, DOI: 10.1109/TPAMI.2018.2845370 *

Similar Documents

Publication Publication Date Title
Cortinhal et al. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving
El Madawi et al. Rgb and lidar fusion based 3d semantic segmentation for autonomous driving
KR102565279B1 (ko) 객체 검출 방법, 객체 검출을 위한 학습 방법 및 그 장치들
CN113348422B (zh) 用于生成预测占据栅格地图的方法和系统
CN109753913B (zh) 计算高效的多模式视频语义分割方法
Sivaraman et al. A general active-learning framework for on-road vehicle recognition and tracking
CN112930554A (zh) 用于确定车辆环境的语义网格的电子设备、系统和方法
Chen et al. Multi-cue event information fusion for pedestrian detection with neuromorphic vision sensors
KR20160123668A (ko) 무인자동주차 기능 지원을 위한 장애물 및 주차구획 인식 장치 및 그 방법
US11966234B2 (en) System and method for monocular depth estimation from semantic information
CN112990065B (zh) 一种基于优化的YOLOv5模型的车辆分类检测方法
EP2680226B1 (fr) Superpixels temporellement constants
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
Nguyen et al. Real-time vehicle detection using an effective region proposal-based depth and 3-channel pattern
CN114972763A (zh) 激光雷达点云分割方法、装置、设备及存储介质
CN114245912A (zh) 通过在基于概率信号时间逻辑的约束下求解优化问题来进行感知误差评估和校正的系统和方法
Hu et al. A video streaming vehicle detection algorithm based on YOLOv4
Alkhorshid et al. Road detection through supervised classification
Bisht et al. Integration of hough transform and inter-frame clustering for road lane detection and tracking
CN114445479A (zh) 等矩形投影立体匹配的两阶段深度估计机器学习算法和球面扭曲层
US20200005467A1 (en) Image processing device, image recognition device, image processing program, and image recognition program
Al Mamun et al. Efficient lane marking detection using deep learning technique with differential and cross-entropy loss.
US20230281961A1 (en) System and method for 3d object detection using multi-resolution features recovery using panoptic segmentation information
WO2024015019A1 (fr) Système de modélisation d'attention de conducteur
WO2018143277A1 (fr) Dispositif de sortie de valeur de caractéristique d'image, dispositif de reconnaissance d'image, programme de sortie de valeur de caractéristique d'image et programme de reconnaissance d'image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23840074

Country of ref document: EP

Kind code of ref document: A1