WO2021056327A1

WO2021056327A1 - Systems and methods for analyzing human driving behavior

Info

Publication number: WO2021056327A1
Application number: PCT/CN2019/108120
Authority: WO
Inventors: Bo Jiang; Guangyu LI; Zhengping Che; Xuefeng SHI; Mengyao LIU; Jieping Ye; Yan Liu; Jian Tang
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2021-04-01

Abstract

A system and a method for analyzing a subject behavior are disclosed. The system may obtain data stream acquired by one or more sensors associated with a subject during a time period of an event, wherein the data stream comprises a first data set and a second data set. The system may generate one or more perception results associated with the subject based on the first data set and the second data set. The system may determine one or more actions associated with the subject based on the second data set. The system may determine, based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set.

Description

SYSTEMS AND METHODS FOR ANALIZING HUMAN DRIVING BAHAVIOR

TECHNICAL FIELD

This present disclosure generally relates to systems and methods for analyzing subject behavior, and in particular, to systems and methods for analyzing human driving behavior by recognizing basic driving actions and identifying intentions and attentions of the driver.

BACKGROUND

With the rapid development of information technology, the intelligent transportation system is more and more widely used, thus understanding how human drives on the road becomes more essential for the development of these systems. For example, in an autonomous driving system, understanding human driver’s behavior is crucial for an accurate prediction of surrounding vehicle’s actions, proposing human-like planning and control strategies, as well as building a realistic simulator for the extensive autonomous driving virtual test. Usually, real-world traffic is a complicated multi-agent system in which multiple participants interacts with each other and with infrastructures. Moreover, each driver has her own driving style. Therefore, there is a great diversity in daily driving scenarios and driving behaviors, raising the need for building a comprehensive driving behavior understanding system based on extensive human driving data. However, mining large-scale human driving data for driving behavior understanding presents significant challenges, due to the sophistication of the human intelligent system itself, the multiple types of sources of the collected human driving behavior data. Thus, it is desirable to provide systems and methods for analyzing human driving behavior with improved accuracy and efficiency.

SUMMARY

According to an aspect of the present disclosure, a system for analyzing a subject behavior is provided. The system may include at least one storage medium including a set of instructions, and at least one processor in communication with the at least one storage device. When executing the set of instructions, the at least one processor may perform a method including one or more of the following operations. The at least one processor may obtain data stream acquired by one or more sensors associated with a subject during a time period of an event, wherein the data stream comprises a first data set and a second data set. The at least one processor may generate one or more perception results associated with the subject based on the first data set and the second data set. The at least one processor may determine one or more actions associated with the subject based on the second data set. The at least one processor may determine based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set.

In some embodiments, the first data set comprises data obtained from one or more camera video frames, and wherein the second data set comprises data obtained from Global Position System (GPS) signals and Inertial Measurement Unit (IMU) signals.

In some embodiments, the one or more perception results comprise one or more semantic masks associated with one or more participants and one or more surrounding objects associated with the event, a location of the subject, and a distance between the subject and nearest participants.

In some embodiments, the at least one processor may classify each time step over the time period into a wheeling class based on first feature information included in the second data set. The at least one processor may classify each time step over the time period into an accelerate class based on second feature information included in the second data set. The at least one processor may determine the one or more actions by crossing the wheeling class and the accelerate class.

In some embodiments, the first feature information is obtained from yaw angular velocity signal included in the second data set, and wherein the second feature information is obtained from forward accelerate signal included in the second data set.

In some embodiments, the one or more actions belong to a plurality of predefined action categories, and wherein the plurality of predefined action categories comprise a left accelerate, a left cruise, a left brake, a straight accelerate, a straight cruise, a straight brake, a right accelerates, a right cruise, and a right brake.

In some embodiments, the at least one processor may determine an attention mask representing an attention associated with the subject based on a first machine learning model. The at least one processor may determine an intention output representing an intention associated with the subject based on a second machine learning model.

In some embodiments, the at least one processor may input the perception results and the second data set and the perception result into an attention proposal network (APN) , wherein the APN is pre-trained by convolutional neural networks (CNNs) and recurrent neural networks (RNNs) . The at least one processor may determine the attention mask representing an attention intensity associated with the subject over participants and surrounding objects associated with the event.

In some embodiments, the at least one processor may input the perception results and the second data set into an intention inference network, wherein the semantic mask included in the perception result is pre-converted to an attention weighted semantic mask, and wherein the intention inference network is pre-trained by CNNs and an RNNs. The at least one processor may determine the intention output by generating probabilities over a plurality of predefined intention categories.

In some embodiments, the semantic mask included in the perception result is pre-converted to the attention weighted semantic mask by multiplying with the attention mask produced by the APN.

In some embodiments, the at least one processor may determine an attention object category by matching the attention mask with the semantic mask at each time step. The at least one processor may output a behavior representation associated with the subject for the event, wherein the behavior representation comprises the action, the intention, the attention object category, and the attention mask associated with the subject.

In some embodiments, the at least one processor may receive a query input for matching a scene associated with a subject to one or more scenarios stored in a database. The at least one processor may determine a structured behavior representation associated with the scene. The at least one processor may retrieve one or more scenarios that are most similar to the structured behavior representation from the database by a ball tree search.

According to another aspect of the present disclosure, a method is provided. The method may include one or more of the following operations. At least one processor may perform a method including one or more of the following operations. The at least one processor may obtain data stream acquired by one or more sensors associated with a subject during a time period of an event, wherein the data stream comprises a first data set and a second data set. The at least one processor may generate one or more perception results associated with the subject based on the first data set and the second data set. The at least one processor may determine one or more actions associated with the subject based on the second data set. The at least one processor may determine based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set.

According to another aspect of the present disclosure, a non-transitory computer readable medium is provided. The non-transitory computer readable medium may comprise executable instructions that cause at least one processor to effectuate a method. The method may include one or more of the following operations. The at least one processor may perform a method including one or more of the following operations. The at least one processor may obtain data stream acquired by one or more sensors associated with a subject during a time period of an event, wherein the data stream comprises a first data set and a second data set. The at least one processor may generate one or more perception results associated with the subject based on the first data set and the second data set. The at least one processor may determine one or more actions associated with the subject based on the second data set. The at least one processor may determine based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities, and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 is a schematic diagram illustrating an exemplary driving behavior understanding system according to some embodiments of the present disclosure;

FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary computing device according to some embodiments of the present disclosure;

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary mobile device on which a terminal device may be implemented according to some embodiments of the present disclosure;

FIG. 4 is a block diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure;

FIG. 5 is a flowchart illustrating an exemplary process for analyzing subject behavior according to some embodiments of the present disclosure;

FIG. 6 is a flowchart illustrating an exemplary process for determining inference results associated with the subject behavior according to some embodiments of the present disclosure; and

FIG. 7 is a flowchart illustrating an exemplary process for a behavior-based retrieval of one or more scenarios from a database according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Generally, the word “module, ” “unit, ” or “block, ” as used herein, refers to logic embodied in hardware or firmware, or to a collection of software instructions. A module, a unit, or a block described herein may be implemented as software and/or hardware and may be stored in any type of non-transitory computer-readable medium or another storage device. In some embodiments, a software module/unit/block may be compiled and linked into an executable program. It will be appreciated that software modules can be callable from other modules/units/blocks or from themselves, and/or may be invoked in response to detected events or interrupts. Software modules/units/blocks configured for execution on computing devices may be provided on a computer-readable medium, such as a compact disc, a digital video disc, a flash drive, a magnetic disc, or any other tangible medium, or as a digital download (and can be originally stored in a compressed or installable format that needs installation, decompression, or decryption prior to execution) . Such software code may be stored, partially or fully, on a storage device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an erasable programmable read-only memory (EPROM) . It will be further appreciated that hardware modules/units/blocks may be included in connected logic components, such as gates and flip-flops, and/or can be included of programmable units, such as programmable gate arrays or processors. The modules/units/blocks or computing device functionality described herein may be implemented as software modules/units/blocks, but may be represented in hardware or firmware. In general, the modules/units/blocks described herein refer to logical modules/units/blocks that may be combined with other modules/units/blocks or divided into sub-modules/sub-units/sub-blocks despite their physical organization or storage. The description may be applicable to a system, an engine, or a portion thereof.

It will be understood that when a unit, engine, module or block is referred to as being “on, ” “connected to, ” or “coupled to, ” another unit, engine, module, or block, it may be directly on, connected or coupled to, or communicate with the other unit, engine, module, or block, or an intervening unit, engine, module, or block may be present, unless the context clearly indicates otherwise. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

These and other features, and characteristics of the present disclosure, as well as the methods of operations and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawings, all of which form part of this disclosure. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood; the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

Moreover, while the systems and methods disclosed in the present disclosure are described primarily regarding analyzing human driving behavior in a subject (e.g., an autonomous vehicle) in an autonomous driving system, it should be understood that this is only one exemplary embodiment. The systems and methods of the present disclosure may be applied to any other kind of transportation system. For example, the systems and methods of the present disclosure may be applied to transportation systems of different environments including land, ocean, aerospace, or the like, or any combination thereof. The autonomous vehicle of the transportation systems may include a taxi, a private car, a hitch, a bus, a train, a bullet train, a high-speed rail, a subway, a vessel, an aircraft, a spaceship, a hot-air balloon, or the like, or any combination thereof. The application of the systems and methods of the present disclosure may include a mobile device (e.g. smart phone or pad) application, a webpage, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.

An aspect of the present disclosure relates to systems and methods for understanding human driving behavior. According to some systems and methods of the present disclosure, the present disclosure provides a solution for mining human driving behavior through an integrated human driving understanding system (DBUS) , which supports multi-type sensors input, structured behavior analysis, as well as efficient driving scenario retrieval applicable to large-scale driving data collected from millions of drivers. For example, the systems and methods may obtain large-scale human driving data from multiple types of sensors simultaneously, such as, from an on-vehicle video camera, GPS, IMU, etc. To this end, he systems and methods may determine, based on feature information of human driving behavior captured by collected data, the driving behavior from three levels: basis driving action, driving attention, and driving intentions. The structured behavior representation produced by the systems and methods can be used to build driver profiling system to summarize driving style of each driver with a special focus on driving safety.

As another example, the systems and methods may support behavior-based driving scenario search and retrieval by finding the most similar driving scenarios form millions of cases within a second. That is, given a certain traffic scenario in interest, such as unprotected left turn with crossing pedestrians, relevant real-world driving scenarios would be retrieved automatically and then feed into an autonomous driving simulator. Accordingly, drivers’ intention and attention, as well as the interpretable causal reason behind their driving behavior can be more accurately inferred. Moreover, the behavior-based driving scenario can be more efficiently searched and retrieved, which is essential for practical application when working with large-scale human driving scenario dataset.

FIG. 1 is a schematic diagram illustrating an exemplary driving behavior understanding system (DBUS) 100, according to some embodiments of the present disclosure. For example, the DBUS 100 may be a real-time driving behavior analysis system for transportation services. In some embodiments, the DBUS100 may include a vehicle 110, a server 120, a terminal device 130, a storage device 140, a network 150, and a positioning and navigation system 160.

The vehicles 110 may be operated by a driver and travel to a destination. The vehicles 110 may include a plurality of vehicles 110-1, 110-2…110-n. In some embodiments, the vehicles 110 may be any type of autonomous vehicles. An autonomous vehicle may be capable of sensing its environment and navigating without human maneuvering. In some embodiments, the vehicle (s) 110 may be configured to be operated by an operator occupying the vehicle, remotely controlled, and/or autonomous. In some embodiments, the vehicle (s) 110 may belong to ride-sharing platforms. It is contemplated that vehicle (s) 110 may be an electric vehicle, a fuel cell vehicle, a hybrid vehicle, a conventional internal combustion engine vehicle, etc. The vehicle (s) 110 may have a body and at least one wheel. The body may be any body styles, such as a sports vehicle, a coupe, a sedan, a pick-up truck, a station wagon, a sports utility vehicle (SUV) , a minivan, or a conversion van. In some embodiments, the vehicle (s) 110 may include a pair of front wheels and a pair of rear wheels. However, it is contemplated that the vehicle (s) 110 may have more or fewer wheels or equivalent structures that enable the vehicle (s) 110 to move around. The vehicle (s) 110 may be configured to be all-wheel drive (AWD) , front-wheel drive (FWR) , or rear-wheel drive (RWD) .

As illustrated in FIG. 1, the vehicle (s) 110 may be equipped with one or more sensors 112 mounted to the body of the vehicle (s) 110 via a mounting structure. The mounting structure may be an electro-mechanical device installed or otherwise attached to the body of the vehicle (s) 110. In some embodiments, the mounting structure may use screws, adhesives, or another mounting mechanism. The vehicle (s) 110 may be additionally equipped with the one or more sensors 112 inside or outside the body using any suitable mounting mechanisms.

The sensors 112 may include a Global Position System (GPS) device, a camera, an inertial measurement unit (IMU) sensor, or the like, or any combination thereof. The camera may be configured to obtain image data via performing surveillance of an area within the scope of the camera. As used herein, a camera may refer to an apparatus for visual recording. For example, the camera may include a color camera, a digital video camera, a camera, a camcorder, a PC camera, a webcam, an infrared (IR) video camera, a low-light video camera, a thermal video camera, a CCTV camera, a pan, a tilt, a zoom (PTZ) camera, a video sensing device, or the like, or a combination thereof. The image data may include a video. The video may include a television, a movie, an image sequence, a computer-generated image sequence, or the like, or a combination thereof. The area may be reflected in the video as a scene. In some embodiments, the scene may include one or more objects of interest. The one or more objects may include a person, a vehicle, an animal, a physical subject, or the like, or a combination thereof. The GPS device may refer to a device that is capable of receiving geolocation and time information from GPS satellites and then to calculate the device's geographical position. The IMU sensor may refer to an electronic device that measures and provides a vehicle’s specific force, angular rate, and sometimes the magnetic field surrounding the vehicle, using various inertial sensors, such as accelerometers and gyroscopes, sometimes also magnetometers. By combining the GPS device and the IMU sensor, the sensor 112 can provide real-time pose information of the vehicle (s) 110 as it travels, including the speeds, positions and orientations (e.g., Euler angles) of the vehicle (s) 110 at each time point.

In some embodiments, the server 120 may be a single server or a server group. The server group may be centralized or distributed (e.g., the server 120 may be a distributed system) . In some embodiments, the server 120 may be local or remote. For example, the server 120 may access information and/or data stored in the terminal device 130, the sensors 112, the vehicle 110, the storage device 140, and/or the positioning and navigation system 160 via the network 150. As another example, the server 120 may be directly connected to the terminal device 130, the sensors 112, the vehicle 110, and/or the storage device 140 to access stored information and/or data. In some embodiments, the server 120 may be implemented on a cloud platform or an onboard computer. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 120 may be implemented on a computing device 200 having one or more components illustrated in FIG. 2 in the present disclosure.

In some embodiments, the server 120 may include a processing engine 122. The processing engine 122 may process information and/or data associated with the vehicle 110 to perform one or more functions described in the present disclosure. For example, the processing engine 122 may obtain driving behavior data, each of which may be acquired by one or more sensors associated with a vehicle 110 during a time period. As another example, the processing engine 122 may process the obtained driving behavior data to generate perception results associated with each vehicle 110.

As still another example, the processing engine 122 may determine, based on feature information included in the collected behavior data, one or more basic driving actions associated with each vehicle 110. As still another example, the processing engine 122 may produce, using a machine learning model (e.g., a CNN model, an RNN model, etc. ) , an attention mask representing driver’s attention intensity over detected traffic participants and traffic lights associated with each vehicle 110. As still another example, the processing engine 122 may produce, using a machine learning model (e.g., a CNN model an RNN, etc. ) . an intention output is representing the intention of a driver associated with each vehicle 110. As still another example, the processing engine 122 may obtain a driving scenario request, including a certain traffic scenario of interest. As still another example, the processing engine 122 may retrieve the most similar scenarios in response to the request from the dataset.

In some embodiments, the processing engine 122 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) . Merely by way of example, the processing engine 122 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.

In some embodiments, the server 120 may be connected to the network 150 to communicate with one or more components (e.g., the terminal device 130, the sensors 112, the vehicle 110, the storage device 140, and/or the positioning and navigation system 160) of the DBUS 100. In some embodiments, the server 120 may be directly connected to or communicate with one or more components (e.g., the terminal device 130, the sensors 112, the vehicle 110, the storage device 140, and/or the positioning and navigation system 160) of the mapping system 100. In some embodiments, the server 120 may be integrated into the vehicle 110. For example, server 120 may be a computing device (e.g., a computer) installed in the vehicle 110.

In some embodiments, the terminal devices 130 may include a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, a built-in device in a vehicle 130-4, a smartwatch 130-5, or the like, or any combination thereof. In some embodiments, the mobile device 130-1 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart footgear, a smart glass, a smart helmet, a smartwatch, smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a personal digital assistant (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google ^TM Glass, an Oculus Rift, a HoloLens, a Gear VR, etc. In some embodiments, the built-in device in the vehicle 130-4 may include an onboard computer, an onboard television, etc. In some embodiments, the server 120 may be integrated into the terminal device 130.

The storage device 140 may store data and/or instructions. In some embodiments, the storage device 140 may store data obtained from the terminal device 130, the sensors 112, the vehicle 110, the positioning and navigation system 160, the processing engine 122, and/or an external storage device. For example, the storage device 140 may store driving behavior data received from the sensors 112 (e.g., a GPS device, an IMU sensor) . As another example, the storage device 140 may store driving behavior data received from the sensors 112 (e.g., a camera) . As still another example, the storage device 140 may store driving profiles (that is, a summary of a driving scenario, which may include driver action, intention, and attention associated with each vehicle 110) generated by the processing engine 122. As still another example, the storage device 140 may store data received from an external storage device or a server.

In some embodiments, the storage device 140 may store data and/or instructions that the server 120 may execute or use to perform exemplary methods described in the present disclosure. For example, the storage device 140 may store instructions that the processing engine 122 may execute or use to generate perception results by processing data included in one or more video frames f received from the sensors 112. As another example, the storage device 140 may store instructions that the processing engine 122 may execute or use to generate, based on feature information included in data received from the sensors 112, a driving scenario associated with a specific vehicle 110.

In some embodiments, the storage device 140 may include a mass storage, a removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Exemplary volatile read-and-write memory may include a random access memory (RAM) . Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically-erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage device 140 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the storage device 140 may be connected to the network 150 to communicate with one or more components (e.g., the server 120, the terminal device 130, the sensors 112, the vehicle 110, and/or the positioning and navigation system 160) of the mapping system 100. One or more components of the mapping system 100 may access the data or instructions stored in the storage device 140 via the network 150. In some embodiments, the storage device 140 may be directly connected to or communicate with one or more components (e.g., the server 120, the terminal device 130, the sensors 112, the vehicle 110, and/or the positioning and navigation system 160) of the DBUS 100. In some embodiments, the storage device 140 may be part of the server 120. In some embodiments, the storage device 140 may be integrated into the vehicle 110.

The network 150 may facilitate the exchange of information and/or data. In some embodiments, one or more components (e.g., the server 120, the terminal device 130, the sensors 112, the vehicle 110, the storage device 140, or the positioning and navigation system 160) of the DBUS 100 may send information and/or data to other component (s) of the DBUS 100 via the network 150. For example, the server 120 may obtain/acquire driving behavior data from the sensors 112 and/or the positioning and navigation system 160 via the network 150. As another example, the server 120 may obtain/acquire driving behavior data from the sensors 112 via the network 150. As still another example, the server 120 may obtain/acquire, a driving scenario from the storage device 140 via the network 150. In some embodiments, the network 150 may be any type of wired or wireless network, or combination thereof. Merely by way of example, the network 150 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 150 may include one or more network access points. For example, the network 150 may include wired or wireless network access points (e.g., 150-1, 150-2) , through which one or more components of the DBUS 100 may be connected to the network 150 to exchange data and/or information.

The positioning and navigation system 160 may determine information associated with an object, for example, one or more of the terminal devices 130, the vehicle 110, etc. In some embodiments, the positioning and navigation system 160 may be a global positioning system (GPS) , a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a BeiDou navigation satellite system, a Galileo positioning system, a quasi-zenith satellite system (QZSS) , etc. The information may include a location, an elevation, a velocity, or an acceleration of the object, or a current time. The positioning and navigation system 160 may include one or more satellites, for example, a satellite 160-1, a satellite 160-2, and a satellite 160-3. The satellites 160-1 through 160-3 may determine the information mentioned above independently or jointly. The satellite positioning and navigation system 160 may send the information mentioned above to the network 150, the terminal device 130, or the vehicle 110 via wireless connections.

It should be noted that the DBUS 100 is merely provided for the purposes of illustration, and is not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations or modifications may be made under the teachings of the present disclosure. For example, the DBUS 100 may further include a database, an information source, etc. As another example, the DBUS100 may be implemented on other devices to realize similar or different functions. In some embodiments, the GPS device may also be replaced by other positioning devices, such as BeiDou. However, those variations and modifications do not depart from the scope of the present disclosure.

FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary computing device according to some embodiments of the present disclosure. In some embodiments, the server 120 may be implemented on the computing device 200. For example, the processing engine 122 may be implemented on the computing device 200 and configured to perform functions of the processing engine 122 disclosed in this disclosure.

The computing device 200 may be used to implement any component of the DBUS 100 of the present disclosure. For example, the processing engine 122 of the DBUS100 may be implemented on the computing device 200, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown for convenience, the computer functions related to the DBUS 100 as described herein may be implemented in a distributed manner on a number of similar platforms to distribute the processing load.

The computing device 200, for example, may include communication (COMM) ports 250 connected to and from a network (e.g., the network 150) connected thereto to facilitate data communications. The computing device 200 may also include a processor (e.g., a processor 220) , in the form of one or more processors (e.g., logic circuits) , for executing program instructions. For example, the processor may include interface circuits and processing circuits therein. The interface circuits may be configured to receive electronic signals from a bus 210, wherein the electronic signals encode structured data and/or instructions for the processing circuits to process. The processing circuits may conduct logic calculations, and then determine a conclusion, a result, and/or an instruction encoded as electronic signals. Then the interface circuits may send out the electronic signals from the processing circuits via the bus 210.

The exemplary computing device 200 may further include program storage and data storage of different forms, for example, a disk 270, and a read-only memory (ROM) 230, or a random-access memory (RAM) 240, for various data files to be processed and/or transmitted by the computing device 200. The exemplary computing device 200 may also include program instructions stored in the ROM 230, the RAM 240, and/or other types of non-transitory storage medium to be executed by the processor 220. The methods and/or processes of the present disclosure may be implemented as the program instructions. The computing device 200 also includes an I/O component 260, supporting input/output between the computing device 200 and other components therein. The computing device 200 may also receive programming and data via network communications.

Merely for illustration, only one processor is described in the computing device 200. However, it should be noted that the computing device 200 in the present disclosure may also include multiple processors, and thus operations that are performed by one processor as described in the present disclosure may also be jointly or separately performed by the multiple processors. For example, the processor of the computing device 200 executes both operation A and operation B. As in another example, operation A and operation B may also be performed by two different processors jointly or separately in the computing device 200 (e.g., the first processor executes operation A and the second processor executes operation B, or the first and second processors jointly execute operations A and B) .

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary mobile device on which a terminal device may be implemented according to some embodiments of the present disclosure. As illustrated in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, and storage 390. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300. In some embodiments, a mobile operating system 370 (e.g., iOS ^TM, Android ^TM, Windows Phone ^TM) and one or more applications 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to positioning or other information from the processing engine 122. User interactions with the information stream may be achieved via the I/O 350 and provided to the processing engine 122 and/or other components of the DBUS 100 via the network 150.

To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein. A computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device. A computer may also act as a server if appropriately programmed.

FIG. 4 is a block diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure. The processing engine 122 may include an obtaining module 410, a perception results determination module 420, a driving action recognition module 430, a driving attention inference module 440, a driving intention inference module 450, and a driving attention post-processing module 460.

The obtaining module 410 may be configured to obtain data and/or information associated with the DBUS 100. For example, the obtaining module 410 may obtain driving behavior data from different sources, such as the in-vehicle front-view cameras, the GPS sensors, the IMU sensors, etc. during a time period. As another example, the obtaining module 410 may obtain a number of training samples comprising of existing driving scenarios associated with a number of historical driving profiles. As still another example, the obtaining module 410 may obtain a query for retrieving one or more driving scenario’s stored in the database (e.g., the storage device 140)

The perception results determination module 420 may be configured to determine one or more perception results associated with each vehicle 110 based on obtained driving behavior data. For example, the perception results determination module 420 may determine semantic masks of detected traffic participants and traffic lights, the distance between ego-vehicle and nearest traffic participants in the front and the vehicle’s relative location on the road based on the lane perception results. More descriptions of the determination of the perception results may be found elsewhere in the present disclosure (e.g., FIG. 5, and descriptions thereof) .

The driving action recognition module 430 may be configured to recognize, based on feature information included in the obtained GPS and/or IMU signals associated with each vehicle, basic driving actions of the vehicle. In some embodiments, for each driver in each vehicle, the driving action recognition module 430 may classify each time step into one of three predetermined wheeling classes. In response to a classification of the wheeling classes, the driving action recognition module 430 subsequently classify each time step into one of three predetermined accelerate classes to produce one or more driving action of the driver.

More descriptions of the recognition of the basic driving actions may be found elsewhere in the present disclosure (e.g., FIG. 5, and descriptions thereof) .

The driving attention inference module 440 may be configured to infer a driver’s attention intensity over detected traffic participants and traffic lights. In some embodiments, the driving attention inference module 440 includes an attention proposal network (APN) that is pre-trained by one or more deep learning models. More descriptions of the inference of the driving attention may be found elsewhere in the present disclosure (e.g., FIG. 6, and descriptions thereof) .

The driving intention inference module 450 may be configured to infer a driver’s intention based on the same input feature set as that of the driving attention inference module 440. For each driver of a vehicle, an intention inference network produces probabilities over all intention categories as intention output for the driver. In some embodiments, the intention inference network also pre-trained by one or more deep learning models. More descriptions of the inference of the driving intention may be found elsewhere in the present disclosure (e.g., FIG. 6, and descriptions thereof) .

The driving attention post-processing module 460 may be configured to determine the category of the object with the greatest attention at each time step. For example, the driving attention post-processing module 460 may find the object with the highest average attention intensity and output its category. In another example, the driving attention post-processing module 460 may find the aforementioned object and set its category as a special category to indicate no obvious attention exists in a specific frame. More descriptions of the post-processing of the driving attention may be found elsewhere in the present disclosure (e.g., FIG. 6, and descriptions thereof) .

The modules in the processing engine 122 may be connected to or communicate with each other via a wired connection or a wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof. The wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof. Two or more of the modules may be combined into a single module, and any one of the modules may be divided into two or more units. In some embodiments, one or more modules may be omitted. For example, the driving attention post-processing module 460 may be omitted. In some embodiments, one or more modules may be combined into a single module. For example, the driving attention inference module 440 and the driving intention inference module 450 may be combined into a single inference results determination module.

FIG. 5 is a flowchart illustrating an exemplary process for analyzing a subject behavior according to some embodiments of the present disclosure. The process 500 may be executed by DBUS 100. For example, the process 500 may be implemented as a set of instructions stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 500. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 500 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 500 as illustrated in FIG. 5 and described below is not intended to be limiting.

In 510, the processing engine 122 (e.g., the obtaining module 410) may obtain data stream acquired by one or more sensors (e.g., the sensors 112) associated with a subject (e.g., the driver of vehicle (s) 110) during a time period. In some embodiments, the data stream is a body of data that is associated with a subject located in a defined geological region (e.g., a segment of a road) . The one or more sensors may include a camera, GPS, an IMU etc., as described elsewhere in the present disclosure (e.g., FIG. 1, and descriptions thereof) . In some embodiments, the time period may be a duration for the one or more sensors (e.g., a camera) fulfilling a plurality of scans, such as 20 times, 30 times, etc. For example, the time period may be 20 seconds, 25 seconds, 30 seconds, etc. The data stream can be denoted as D= (V, S) which may include a first data set, denoted as V, and a second data set, denoted as S. As described herein, each of the first data set V and the second data set S represents a vector including data recorded by the one or more sensors during a predetermined time period. As an example, the data stream D can be data describing human driving behavior.

In some embodiments, the first data set

may include data obtained from cameras at each time step of the time horizon T. The specific camera may denote any device for visual recording as described elsewhere in the present disclosure (e.g., FIG. 1 and the descriptions thereof) . In some embodiments, the specific data may include a video as described elsewhere in the present disclosure (e.g., FIG. 1 and the descriptions thereof) . The video may include a plurality of image frames. In some embodiments, the camera may acquire one of the multiple samples via monitoring an area surrounding the camera. In some embodiments, the area may be part of a space in a vehicle. For example, in a vehicle, the specific image data may be a front view camera video recording a travel scene inside and/or outside the vehicle collected by the specific camera (e.g., a driving recorder) installed inside and/or outside of the vehicle. A sample may record a scene inside and/or outside of the vehicle. For example, a sample may include a video recording a driver getting on or off the vehicle. As another example, a sample may include a video recording a moment of turn-on or turn-off of a lighting device (e.g., a fill light of the camera) installed in the vehicle. As still another example, a sample may include a video recording surrounding environment of the vehicle, e.g., other vehicles, pedestrian, traffic lights, signs. As still another example, a sample may include a video recording a rare scene (e.g., a child and/or a disabled person crossing the road) , or an emergency (e.g., accidents, malfunctioned traffic lights, signs, etc. ) The length of the video recorded by the camera can be predetermined; for example, the length of raw video may be 25 to 30 seconds.

In some embodiments, the processing device 112 (e.g., the obtaining module 410) may obtain the specific image data in real-time or periodically. In some embodiments, the processing device 112 (e.g., the obtaining module 410) may obtain the specific image data from the specific camera, the requester terminal 130, the provider terminal 140, the storage device 160, or any other storage device.

In some embodiments, the second data set data

denotes GPS of the subject (e.g., the vehicle (s) 110) may include geographic location information and/or IMU information of the subject (e.g., the vehicle (s) 110) at each time step of the time horizon T. The geographic location information may include a geographic location of the subject (e.g., driver of the vehicle (s) 110) corresponding to each group of the point-cloud data. In some embodiments, the geographic location of the subject (e.g., driver in the vehicle (s) 110) may be represented by 2D or 3D coordinates in a coordinate system (e.g., a geographic coordinate system) . The IMU information may include a pose of the subject (e.g., the driver vehicle (s) 110) defined by an orientation (also referred to a moving direction) , a pitch angle, a roll angle, etc., acquired when the subject locates at the geographic location. As an example, for each driving scenario, the GPS sensor may sample latitude, longitude, bearing speed, etc., per second (1 HZ) . As another example, the IMU sensor may retrieve the 3-axis forward accelerate signal

3-axis yaw angular velocity signal

with a predetermined retrieval frequency (e.g., 30Hz) for further analysis.

In some embodiments, the processing engine 122 (e.g., the obtaining module 410) may obtain the data stream from the one or more sensors (e.g., the sensor 112) associated with the subject (e.g., the vehicle (s) 110) , a storage (e.g., the storage device 140) , etc., in real-time or periodically. For example, the one or more sensors may send data stream generated by the one or more sensors to the processing engine 122 once the one or more sensors fulfill one single scan. As another example, the one or more sensors may send the data stream in every scan to the storage (e.g., the storage device 140) . The processing engine 122 (e.g., the obtaining module 410) may obtain the data stream from the storage periodically, for example, after the time period. In some embodiments, the data stream may be generated by the one or more sensors when the subject is immobile. In some embodiments, the data stream may be generated when the subject is moving.

In some embodiments, the processing engine 122 (e.g., the obtaining module 410) may obtain the data stream from a commercially available database, for example, a database of a ride-sharing company which records and stores their hired drivers’ behavior data through on-vehicle sensors. In some embodiments, the processing engine 122 (e.g., the obtaining module 410) may obtain the data stream from a user, an entity, or from other sources, which are not limited within this disclosure.

In 520, the processing engine 122 (e.g., the perception results determination module 420) may generate one or more perception results associated with the subject based on the first data set and the second data set. In some embodiments, the processing engine 122 (e.g., the perception results determination module 420) processes the data stream D= (V, S) to generate one or more perception results, denoted as P= (O, D, L) , including semantic mask of the event participants (denoted as “O” ) , the distance between the subject and the nearest event participants (denoted as “D” ) , and the relative location of the subject (denoted as “L” ) . The processing engine 122 (e.g., the perception results determination module 420) may be pre-trained and generate a series of perception results through techniques such as object detection, visual perception, and 3D reconstruction. As an example, in which the event is a driving scenario, and the subject is a driver, the perception results can be generated by techniques such as detection, tracking, and segmentation of traffic participants, traffic lights and signs in front-view driving videos, as well as distance estimation of traffic participants in driving scenarios.

In some embodiments,

included in the perception results refers to semantic masks of detected participants and surrounding objects of the event. For example, when accessing a driver’s driving behavior, the participants and the surrounding objects of the events may include other vehicles, pedestrians, traffic lights, stop signs, and any moveable or non-moveable subjects captured by the video frames.

In some embodiments, the processing device 112 (e.g., the perception results determination module 420) may integrate a target machine learning model for object detection, for example, a pre-trained YOLOv3 model. In some embodiments, the target machine learning model may be constructed based on a linear regression model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof. In some embodiments, the target machine learning model may be configured to determine whether the specific image data includes one or more objects. For example, the target machine learning model may output “1” denoting that the specific image data includes one or more objects or “0” denoting that the specific image data does not present one or more objects. In some embodiments, the target machine learning model for object detection may be configured to determine a count of one or more objects presented in the specific image data. In some embodiments, the target machine learning model for object detection may be configured to determine behaviors of the one or more objects presented in the specific image data. For example, the target machine learning model may determine whether an anomaly behavior (e.g., a tailing, a crushing) is involved between detected objects presented in the specific image data.

In some embodiments, the target machine learning model may be determined by training a preliminary machine learning model using a training set. The training set used for object detection may include a plurality of videos (or images) . Each of at least a portion of the plurality of videos (or images) may present one or more objects, such as persons (e.g., adults, children, old people) , cars, signs, traffic lights, physical subjects, etc.

The processing engine 112 (e.g., the perception results determination module 420) further processes the detected objects to generate a semantic mask of the event participants and event surrounding objects through a semantic segmentation. During this process, each and every pixel in the image is classified into a class. As an example, for each frame of the video image, all its original traffic participants detections can be converted to a single-channel semantic mask, and all its original traffic lights detection results can be converted to another single-channel semantic mask. Then, these two masks can be concatenated to a 2-channel semantic mask. As another example, to reduce the computational complexity of DBUS 100, the 2-channel semantic mask can be further resized (e.g., from its original size 1920*1080*2 to 160*90*2) as the final objects of image frames.

In some embodiments,

included in the perception results refer to the distance between the subject and nearest event participants, for example, the distance between an ego-vehicle and nearest traffic participants in its front. As an example, 3D reconstruction technology can be used to rebuild the 3D coordinates of all detected traffic participants in video V of a driving scenario. From the reconstructed 3D coordinates of nearest front traffic participants, an estimation of the following distance d of ego-vehicle for each frame of the video image can be determined

A distance threshold corresponding to the subject may be determined based on the geographic location (s) of the subject. For example, a distance threshold may be determined based on the width of a road related to the subject. The processing engine 122 may determine the road based on the geographic location (s) of the subject and obtain the width of the road from, e.g., a map. The distance threshold may be inversely proportional to the width of the road. Exemplary distance thresholds may include 3 meters, 5 meters, 7 meters, etc.

In some embodiments, the processing engine 122 (e.g., the perception results determination module 420) may detect a set of event participants. The processing engine 122 (e.g., the perception results determination module 420) may determine a distance threshold corresponding to the subject based on the width of a road related to the subject. The processing engine 122 (e.g., the perception results determination module 420) may determine whether a distance between the subject and each of the detected event participants is smaller than the distance threshold corresponding to the subject and whether the distance associated with the at least one event participants is smaller than all the other detected event participant. In response to a determination that the distance between the subject and the at least one event participants is smaller than the distance threshold corresponding to the subject, and that the distance associated with the at least one event participants is smaller than all other detected event participants, the processing engine 122 may determine that distance as the distance between the subject and nearest event participants

In some embodiments,

included in the perception results denotes a location of the subject, for example, a relative location of the vehicle on the road based on the lane perception results. As an example, the processing engine 112 (e.g., the perception results determination module 420) may first detect lanes of the road from the received images frames through semantic segmentation. In this step, the processing engines 112 may use a deep learning architecture; for example, the DeepLabv3 model introduced by GoogleTM to detect lanes for each frame of the video image.

In the present disclosure, because the lane detection results are much sparser than the traffic participants detections and traffic lights detections, the processing engine 122 (e.g., the perception results determination module 420) may generate a vehicle location feature for each frame of video image according to the lane detection results instead of using them directly. The vehicle location feature represents the vehicle’s relative location on the road, which is selected one of the four categories: in the middle of the road, on the left side of the road, on the right side of the road, and unknown location.

In 530, the processing engine 122 (e.g., the driving action recognition module 430) may determine one or more actions associated with the subject based on the second data set. In some embodiments, actions may have simple patterns that can be inferred directly from obtained data streams. For example, basic driving actions can be inferred from GPS/IMU signals S in a rule-based manner.

Specifically, the processing engine 122 (e.g., the driving action recognition module 430) may preprocess the signals by removing noises contains in the signals by using one or more window functions. Exemplary window functions include Han and Hamming windows, Blackman window, Nuttall window or the like, or any combination thereof.

Next, the forward accelerate signal

and the yaw angular velocity signal

are extracted from the GPS/IMU signals, each of which returns a vehicle’s acceleration and velocity, respectively. The processing engines 112 (e.g., the driving action recognition module 430) then may threshold the yaw angular velocity signal

with predefined thresholds to classify each time step into one of three wheeling classes: left wheeling, right wheeling, and going straight. Similarly, the processing engine 112 (e.g., the driving action recognition module 430) may threshold the forward accelerate signal

to classify each time steps into one of three accelerate classes: accelerate, brake and cruise at a certain speed. Subsequently, the selected wheeling class and the accelerate class are crossed so that each time step is eventually classified into a driving action

where m _t belongs to one of nine predefined possible driving actions: left accelerate, left cruise, left brake, straight accelerate, straight cruise, straight brake, straight accelerate, and right brake.

In 540, the processing engine 122 (e.g., the driving attention inference module 440 and the driving intention inference module 450) may determine, based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set. In some embodiments, the inference results may include subject’s intentions, denoting as

and attentions, denoting as

Similar to the actions m _t generated in step 520, intentions w _t belong to one of the predefined intention categories. In the present disclosure, as an example, eight types of driver’s intentions are considered: following, left turn, right turn, left lane change, right lane change, left merger, right merge, and U-turn. The determined attention A includes two elements, where

is the attention mask over the video frame v _t with pixel-wise attention intensity between 0 and 1, and

is the attention object category indicating which category of detected object the subject is focusing on. As an example, eight predefined driver attentions object categories are considered in the present disclosure: car, bus, truck, person, bicycle, motorcycle, tricycle, and traffic light.

In some embodiments, the processing engine 112 (e.g., the driving attention inference module 440 and the driving intention inference module 450) may adopt one or more trained machine learning models for image classification and result prediction. In some embodiments, the trained machine learning model may be constructed based on a decision tree model, a multiclass support vector machine (SVM) model, a K-nearest neighbors classifier, a Gradient Boosting Machine (GBM) model, a neural network model, or the like, or any combination thereof. The processing engine 112 (e.g., the driving attention inference module 440 and the driving intention inference module 450) may obtain the trained machine learning model from the storage device 160, the requester terminal 130, or any other storage device. In some embodiments, the trained machine learning model may be determined by training a machine learning model using a training set. The training set may include a plurality of samples. Each of the plurality of samples in the training set may include a video. In some embodiments, each of the plurality of samples in the training set may correspond to a label. The label corresponding to a sample in the training set may denote a category to which the sample belongs.

In the present disclosure, a deep neural network based model ( “INFER” ) is implemented by the DBUS 100 to predict the intention and attention mask simultaneously. The INFER consists of two components. First, an attention proposal network (APN) produces an attention mask

representing the subject’s attention intensity over detected event participants and surrounding objects. Second, an intention inference network produces probabilities over all intention categories as intention output

In some embodiments, the processing engine 112 (e.g., the driving attention inference module 440 and the driving intention inference module 450) may first train the INFER, and the training procedure may have two stages. First, APN is first pre-trained with attention labels for a decent initialization. Second, the whole inference system (the intention network together with APN) is trained jointly in an end-to-end manner with both attention and intention labels. More descriptions of the determination of the inference results may be found elsewhere in the present disclosure (e.g., FIG. 6, and descriptions thereof) .

It should be noted that the above description is merely provided for the purpose of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more other optional operations (e.g., a storing operation) may be added elsewhere in the exemplary process 500. In the storing operation, the processing engine 122 may store the data stream, the one or more perception results, the inference results in a storage (e.g., the storage device 140) disclosed elsewhere in the present disclosure.

FIG. 6 is a flowchart illustrating an exemplary process for determining inference results associated with the subject according to some embodiments of the present disclosure. The process 600 may be executed by the DBUS 100. For example, the process 600 may be implemented as a set of instructions stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 600. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 600 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 600 as illustrated in FIG. 6 and described below is not intended to be limiting.

In 610, the processing engine 122 (e.g., the driving attention inference module 440) may input the second data set, and the perception result into an attention proposal network (APN) . As described elsewhere in the present disclosure (e.g., FIG. 5, and description thereof) , INFER takes a set of features

as input, in which

denotes semantic masks of detected traffic participants and traffic lights;

denotes the distance between ego-vehicle and nearest traffic participants in the front;

denotes vehicle’s relative location on the road;

refers to basic driving actions;

represents yaw angular velocity, forward accelerate and vehicle speed in GPS/IMU signals.

In 620, the processing engine 122 (e.g., the driving attention inference module 440) may determine the attention mask representing an attention intensity associated with the subject over participants and surrounding objects associated with the event. In some embodiments, the processing engines 112 may use more than one machine learning models to process the input received at step 610. Because the input includes both spatial data and temporal data, in some embodiments, both convolutional neural networks (CNNs) and recurrent neural networks (RNNs) may be adopted

CNNs are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. A simple CNN is a sequence of layers; in some embodiment, it may include three main types of layers: convolutional layer, pooling layer, and fulling-connected layer. CNN architectures make the explicit assumption that the inputs are images, which allows encoding certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the number of parameters in the network.

RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. In some embodiments, each cell of the RNN model may include an input layer, a hidden layer, and an output layer. The hidden layer may have one or more feedback loops. These feedback loops may provide RNNs with a type of “memory, ” in which past outputs from the hidden layer of a cell may inform future outputs from the hidden layer of another cell. Specifically, each feedback loop may provide an output from the hidden layer in a previous cell back to the hidden layer of the current cell as input for the current cell to inform the output of the current cell. This can enable RNNs to recurrently process sequence data (e.g., data that exists in an ordered sequence, like a route having a sequence of links) over a sequence of steps.

Specifically, in the present disclosure, the semantic mask

included in the input is feed into CNNs and their pooling layers to embed the semantic mask. Each CNN uses various blocks of convolution and max pool layer first to decompress the image to capture semantic information included in image. It then makes a class prediction at this level of granularity. Finally, it uses up sampling and deconvolution layer to resize the image to its original dimensions, so the spatial information is recovered. Notably, the resolution of the generated attention mask is lower than the resolution of the semantic mask. To avoid generating an extremely sparse attention mask, the processing engine 112 may select a predetermined resolution ratio to reach a good compromise between attention granularity and sparsity. For example, the resolution ration can be 100, meaning each pixel of attention mask covers 100 pixels of a semantic mask.

Further, temporal information included in the input, including

is feed into RNNs. In some embodiments, the input first goes through a dense layer which takes sequences as input and applies the same dense layer on every vector. Next, the processing engine 112 (e.g., the driving attention inference module 440) may apply a regulation method (e.g., Dropout) where input and recurrent connections to Long Short-Term Memory (LSTM) units are probabilistically excluded from activation and weight updates while training the network. Finally, the processing engine 112 (e.g., the driving attention inference module 440) feeds the processed input into the LSTM network to extract the temporal information. In some embodiment, a bi-LSTM network may be used instead of a LSTM, depending on the on-line or off-line mode of the DBUS 100.

The outputs from the CNNs and the RNNs are concatenated before input into another CNNs and its pooling layers. Finally, an attention mask

representing the driver’s attention intensity over detected traffic participants and traffic lights is produced as the final output of the APN.

In 630, the processing engine 122 (e.g., the driving intention inference module 450) may input the second data set, and the perception results into an intention inference network. In some embodiments, the intention inference network works with the same input feature set as which is used in the APN. In some embodiments, the processing engine 122 (e.g., the driving intention inference module 450) first converts the semantic mask included in the perception results into the attention weighted semantic mask by multiplying with the attention mask generated from the APN in step 620.

In 640, the processing engine 122 (e.g., the driving intention inference module 450) may determine the intention output by generating probabilities over a plurality of predefined intention categories. The attention weighted semantic mask, as well as the output from the LSTM model of the APN, is input into another CNNs and pooling layers. The output from the CNNs and pooling layers also go through a series processes such as concertation, dense, dropout, LSTM, bi-LSTM, or the like, or any suitable combination thereof. The details of these processes are similar to those described in step 620 and would not repeat herein. The final intention output of the intention inference network is

representing probabilities over all intention categories

In 650, the processing engine 122 (e.g., the driving attention post-processing module 460) may determine an attention object category by matching the attention mask with the semantic mask at each time step. After the intention

and attention mask

are inferred from the INFER model, the processing engine 112 (e.g., the driving attention post-processing module 460) may post-process the generated attention mask

by matching the attention mask

with semantic mask

to find the category of the object with the highest attention at each time step, denoting as

Specifically, the processing engine 122 (e.g., the driving attention post-processing module 460) may first obtain the list of all detected event participants (e.g., traffic participants) and surrounding objects (e.g., traffic lights) from the semantic mask, and then calculate the average attention for each detected object on its all pixels. The processing engine 122 (e.g., the driving attention post-processing module 460) may further determine the object with the highest average attention intensity and output its category as

In some embodiments, if no detection has the average attention value above a predefined threshold, the processing engine 122 (e.g., the driving attention post-processing module 460) may set

as a special category to indicate no obvious attention exists in the frame.

In 660, the processing engine 122 (e.g., the driving attention post-processing module 460) may output a behavior representation associated with the subject for the event, In some implementation, wherein the behavior representation, denoted as B= (M, W, A) , including actions

intentions

attentions

Each attention includes two components, object category

and the attention mask

that are associated with the subject over whole time horizon T. In some embodiments, the behavior representation B is considered as the final output of DBUS 100.

In some embodiments, the processing engine 122 may transmit a signal to a terminal (e.g., a mobile terminal associated with the requester terminal 130) , a server associated with an online to offline platform, etc. The signal may include the final output. In some embodiments, the signal may be configured to direct the terminal to display the final output to a user associated with the terminal. In some embodiments, the processing engine 112 and/or the server associated with the online to offline platform may determine an event associated with the one or more detected objects presented in the specific image data based on the final output. For example, the processing engine 112 may determine whether a detected action, a predicted intention of attention of a driver needs to be intervened by a third party. Further, if the processing engine 112 determines that a detected action, a predicted intention of attention of the driver needs to be intervened by a third party, the processing engine 112 may generate an alert, call the police, etc.

In some embodiments, the processing engine 112 may store the final outputs in a database, which can be local or remote. In some implementation, the processing engine 112 may update the target machine learning model by updating the training set using the data and final outputs stored in the database.

It should be noted that the above description is merely provided for the purpose of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more other optional operations (e.g., a storing operation) may be added elsewhere in the exemplary process 600. In the storing operation, the processing engine 122 may store the original pose graph, the one or more sets of candidate nodes, the feature information, and/or the one or more redundant nodes in a storage (e.g., the storage device 140) disclosed elsewhere in the present disclosure. In some embodiments, one or more operations in process 600 may be changed. For example, operation 650 may be omitted if new original edges are determined in operation 640.

FIG. 7 is a flowchart illustrating an exemplary process for a behavior-based retrieval of one or more driving scenarios from a database based on structured behavior representation, according to some embodiments of the present disclosure. The process 700 may be executed by the DBUS 100. For example, the process 700 may be implemented as a set of instructions stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 700. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 700 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 700 as illustrated in FIG. 7 and described below is not intended to be limiting.

In 710, the processing engine 122 (e.g., the obtaining module 410) may receive a query input for matching a scene associated with a subject to one or more driving scenarios stored at a database. As an example, the processing engine 122 may obtain the scene from a storage medium (e.g., the storage device 140, or the ROM 230, or the RAM 240 of the processing engine 122) and/or the terminal device 130. As another example, the processing engine 122 may obtain the scene from a user in the form of a query input.

In 720, the processing engine 122 (e.g., the one or more modules depicted in FIG. 4) may determine a structured behavior representation associated with the scene. Similar to the behavior representation B described in FIG. 6, the structured behavior representation B′ determined in this step also includes actions

intentions

attentions

which includes an object category

and an attention mask

associated with the subject over whole time horizon T. In some embodiments, the scene included in the query input is based on real-world data, for example, a new driving scene received at the processing engine 112 for analysis. In such embodiments, the received scene is first be processed by the DBUS 100 to generate the structured behavior representationB′. More descriptions of the determination of the structured behavior representation B′ may be found elsewhere in the present disclosure (e.g., FIGS. 5 and 6, and descriptions thereof) .

In some embodiments, the scene included in the query input is conceptual, that is, there is no real data associated with the scene. In such embodiments, the query input is formulated in the form of a behavior representation B directly, so the user can check whether there is at least one scenario stored in the database is conceptually similar to what the user is seeking.

In 730, the processing engine 112 (e.g., the obtaining module 410) may retrieve one or more scenarios that are most similar to the structured behavior representation from the database by a ball tree search. In some embodiments, the processing engine 112 takes a flattened concatenation of

included in the structured behavior representation B′ as the feature vector of scene to produce a T × (D _b+D _a) -dimensional vector for the scene, where T is the number of timestamps, D _b is the number of different behaviors, and D _a is the number of different object categories.

Next, the processing engine 112 (e.g., the obtaining module 410) may perform a fast nearest neighbor search in the database to retrieve the most similar scenarios given the query, where the retrieved scenarios also in the form of {w _t, a _obj} , which is either extracted from a query video or directly created by a user. In some embodiments, the original behavior and category vectors are converted to categorical values, and a Hamming distance is used as a metric, significantly reducing the time to build the ball-tree and further improving the search speed.

It should be noted that the above description is merely provided for the purpose of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more other optional operations (e.g., a storing operation) may be added elsewhere in the exemplary process 700. In the storing operation, the processing engine 122 may store the at least one eigenvalue-based feature, the linearity value, the linearity feature threshold, the planarity value, the planarity feature threshold, the scattering value, and/or the scattering feature threshold in a storage (e.g., the storage device 140) disclosed elsewhere in the present disclosure.

Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure and are within the spirit and scope of the exemplary embodiments of this disclosure.

Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.

Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in a combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the “C” programming language, Visual Basic, Fortran 2103, Perl, COBOL 2102, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations, therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, for example, an installation on an existing server or mobile device.

Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various inventive embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, inventive embodiments lie in less than all features of a single foregoing disclosed embodiment.

In some embodiments, the numbers expressing quantities or properties used to describe and claim certain embodiments of the application are to be understood as being modified in some instances by the term “about, ” “approximate, ” or “substantially. ” For example, “about, ” “approximate, ” or “substantially” may indicate ±20%variation of the value it describes, unless otherwise stated. Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that may vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the application are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable.

Each of the patents, patent applications, publications of patent applications, and other material, such as articles, books, specifications, publications, documents, things, and/or the like, referenced herein is hereby incorporated herein by this reference in its entirety for all purposes, excepting any prosecution file history associated with same, any of same that is inconsistent with or in conflict with the present document, or any of same that may have a limiting affect as to the broadest scope of the claims now or later associated with the present document. By way of example, should there be any inconsistency or conflict between the description, definition, and/or the use of a term associated with any of the incorporated material and that associated with the present document, the description, definition, and/or the use of the term in the present document shall prevail.

In closing, it is to be understood that the embodiments of the application disclosed herein are illustrative of the principles of the embodiments of the application. Other modifications that may be employed may be within the scope of the application. Thus, by way of example, but not of limitation, alternative configurations of the embodiments of the application may be utilized in accordance with the teachings herein. Accordingly, embodiments of the present application are not limited to that precisely as shown and described.

Claims

A system for analyzing a subject behavior, the system comprising:

at least one storage medium including a set of instructions and

at least one processor in communication with the at least one storage medium, wherein when executing the instructions, the at least one processor is configured to direct the system to perform operations including:

obtaining data stream acquired by one or more sensors associated with a subject during a time period of an event, wherein the data stream comprises a first data set and a second data set;

generating one or more perception results associated with the subject based on the first data set and the second data set;

determining one or more actions associated with the subject based on the second data set; and

determining, based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set.
The system of claim 1, wherein the first data set comprises data obtained from one or more camera video frames, and wherein the second data set comprises data obtained from Global Position System (GPS) signals and Inertial Measurement Unit (IMU) signals.
The system of claim 2, wherein the one or more perception results comprise one or more semantic masks associated with one or more participants and one or more surrounding objects associated with the event, a location of the subject, and a distance between the subject and nearest participants.
The system of any one of claims 1 to 3, wherein to determine one or more actions associated with the subject, the at least one processor is further configured to direct the system to perform operations comprising:

classifying each time step over the time period into a wheeling class based on first feature information included in the second data set;

classifying each time step over the time period into an accelerate class based on second feature information included in the second data set; and

determining the one or more actions by crossing the wheeling class and the accelerate class.
The system of claim 4, wherein the first feature information is obtained from yaw angular velocity signal included in the second data set, and wherein the second feature information is obtained from forward accelerate signal included in the second data set.
The system of claim4 , wherein the one or more actions belong to a plurality of predefined action categories, and wherein the plurality of predefined action categories comprise a left accelerate, a left cruise, a left brake, a straight accelerate, a straight cruise, a straight brake, a right accelerates, a right cruise, and a right brake.
The system of claim 3, wherein to determine one or more inference results associated with the subject, the at least one processor is further configured to direct the system to perform operations comprising:

determining an attention mask representing an attention associated with the subject based on a first machine learning model; and

determining an intention output representing an intention associated with the subject based on a second machine learning model.
The system of claim 7, wherein to determine the attention mask representing the attention associated with the subject based on the first machine leaning model, the at least one processor is further configured to direct the system to perform operations comprising:

inputting the perception results and the second data set and the perception result into an attention proposal network (APN) , wherein the APN is pre-trained by convolutional neural networks (CNNs) and recurrent neural networks (RNNs) ; and

determining the attention mask representing an attention intensity associated with the subject over participants and surrounding objects associated with the event.
The system of claim 8, wherein to determine the intention output representing the intention associated with the subject based on the second machine learning model, the at least one processor is further configured to direct the system to perform operations comprising:

inputting the perception results and the second data set into an intention inference network, wherein the semantic mask included in the perception result is pre-converted to an attention weighted semantic mask, and wherein the intention inference network is pre-trained by CNNs and RNNs; and

determining the intention output by generating probabilities over a plurality of predefined intention categories.
The system of claim 9, wherein the semantic mask included in the perception result is pre-converted to the attention weighted semantic mask by multiplying with the attention mask produced by the APN.
The system of claim 9, wherein the at least one processor is further configured to direct the system to perform additional operations comprising:

determining an attention object category by matching the attention mask with the semantic mask at each time step; and

outputting a behavior representation associated with the subject for the event, wherein the behavior representation comprises the action, the intention, the attention object category, and the attention mask associated with the subject.
The system of any of claims 1-10, wherein the at least one processor is further configured to direct the system to perform additional operations comprising:

receiving a query input for matching a scene associated with a subject to one or more scenarios stored in a database;

determining a structured behavior representation associated with the scene; and

retrieving one or more scenarios that are most similar to the structured behavior representation from the database by a ball tree search.
A method for analyzing a subject behavior, the method comprising:

obtaining data stream acquired by one or more sensors associated with a subject during a time period of an event, wherein the data stream comprises a first data set and a second data set;

generating one or more perception results associated with the subject based on the first data set and the second data set;

determining one or more actions associated with the subject based on the second data set; and

determining, based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set.
The method of claim 13, wherein the first data set comprises data obtained from one or more camera video frames, and wherein the second data set comprises data obtained from Global Position System (GPS) signals and Inertial Measurement Unit (IMU) signals.
The method of claim 14, wherein the one or more perception results comprise one or more semantic masks associated with one or more participants and one or more surrounding objects associated with the event, a location of the subject, and a distance between the subject and nearest participants.
The method of any one of claims 13 to 15, wherein determining one or more actions associated with the subject comprises:

classifying each time step over the time period into a wheeling class based on first feature information included in the second data set;

classifying each time step over the time period into an accelerate class based on second feature information included in the second data set; and

determining the one or more actions by crossing the wheeling class and the accelerate class.
The method of claim 16, wherein the first feature information is obtained from yaw angular velocity signal included in the second data set, and wherein the second feature information is obtained from forward accelerate signal included in the second data set.
The method of claim 16 , wherein the one or more actions belong to a plurality of predefined action categories, and wherein the plurality of predefined action categories comprise a left accelerate, a left cruise, a left brake, a straight accelerate, a straight cruise, a straight brake, a right accelerates, a right cruise, and a right brake.
The system of claim 15, wherein determining one or more inference results associated with the subject comprises

determining an attention mask representing an attention associated with the subject based on a first machine learning model; and

determining an intention output representing an intention associated with the subject based on a second machine learning model.
The system of claim 19, wherein determining the attention mask representing the attention associated with the subject based on the first machine leaning model comprises

inputting the perception results and the second data set and the perception result into an attention proposal network (APN) , wherein the APN is pre-trained by convolutional neural networks (CNNs) and recurrent neural networks (RNNs) ; and

determining the attention mask representing an attention intensity associated with the subject over participants and surrounding objects associated with the event.
The system of claim 20, wherein determining the intention output representing the intention associated with the subject based on the second machine learning model comprises:

inputting the perception results and the second data set into an intention inference network, wherein the semantic mask included in the perception result is pre-converted to an attention weighted semantic mask, and wherein the intention inference network is pre-trained by CNNs and RNNs; and

determining the intention output by generating probabilities over a plurality of predefined intention categories.
The method of claim 21, wherein the semantic mask included in the perception result is pre-converted to the attention weighted semantic mask by multiplying with the attention mask produced by the APN.
The method of claim 21, further comprising:

determining an attention object category by matching the attention mask with the semantic mask at each time step; and

outputting a behavior representation associated with the subject for the event, wherein the behavior representation comprises the action, the intention, the attention object category, and the attention mask associated with the subject.
The method of any of claims 13-23, further comprising:

receiving a query input for matching a scene associated with a subject to one or more scenarios stored in a database;

determining a structured behavior representation associated with the scene; and

retrieving one or more scenarios that are most similar to the structured behavior representation from the database by a ball tree search.
A non-transitory computer-readable storage medium embodying a computer program product, the computer program product comprising instructions for analyzing a subject behavior, wherein the instructions are configured to cause a computing device to perform operations comprising:

obtaining data stream acquired by one or more sensors associated with a subject during a time period of an event, wherein the data stream comprises a first data set and a second data set;

generating one or more perception results associated with the subject based on the first data set and the second data set;

determining one or more actions associated with the subject based on the second data set; and

determining, based on one or more trained machine learning models, one or more inference results associated with the subject based on the one or more perception results and the second data set.