WO2023221848A1 - 车辆起步行为的预测方法、装置、存储介质及程序产品 - Google Patents

车辆起步行为的预测方法、装置、存储介质及程序产品 Download PDF

Info

Publication number
WO2023221848A1
WO2023221848A1 PCT/CN2023/093436 CN2023093436W WO2023221848A1 WO 2023221848 A1 WO2023221848 A1 WO 2023221848A1 CN 2023093436 W CN2023093436 W CN 2023093436W WO 2023221848 A1 WO2023221848 A1 WO 2023221848A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
image
vehicle
data
road
Prior art date
Application number
PCT/CN2023/093436
Other languages
English (en)
French (fr)
Inventor
葛彦悟
李向旭
张亦涵
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023221848A1 publication Critical patent/WO2023221848A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/588Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the field of intelligent driving technology, and in particular to a method, device, storage medium and program product for predicting vehicle starting behavior.
  • the autonomous vehicle is the main vehicle and other vehicles on the road are obstacle vehicles.
  • the main vehicle can predict the behavior of the obstacle vehicles during driving, so that it can automatically plan and control the main vehicle based on the prediction results.
  • the driving trajectory of the vehicle is thus reduced, thereby reducing the probability of collision with an obstacle vehicle.
  • the behavior prediction of the obstacle vehicle includes predicting whether the obstacle vehicle has starting behavior.
  • the main vehicle obtains environmental data collected by multiple sensors, including lidar, cameras, millimeter wave radar, ultrasonic radar, etc. on the main vehicle.
  • the main vehicle fuses these environmental data and determines the obstacle vehicle.
  • Perception information such as position, speed, heading, lane lines, traffic lights, etc. is integrated with environmental data collected by multi-source sensors to determine the perception results.
  • the host vehicle predicts whether the obstacle vehicle has a starting behavior based on the sensing results, the host vehicle positioning results and the high definition map (HDMAP).
  • HDMAP high definition map
  • the detection results of radar often have jitter and deviation, which will affect the accuracy of predicting the starting behavior of obstacle vehicles.
  • the positioning results of the main vehicle may be inaccurate, which will also affect the accuracy of the prediction of the starting behavior of the obstacle vehicle, thereby affecting the normal driving of the main vehicle.
  • the frame rates of the multiple sensors are different. The minimum frame rate must be used to fuse the environmental data collected by the multi-source sensors. In this way, the frame rate of the obtained sensing results is low, resulting in poor real-time performance and delay in prediction. larger.
  • This application provides a method, device, storage medium and program product for predicting vehicle starting behavior, which can improve the prediction accuracy, accuracy and real-time performance of vehicle starting behavior, and the generalization of this solution is also better.
  • the technical solutions are as follows:
  • a method for predicting vehicle starting behavior includes:
  • the multiple sets of host vehicle motion data respectively correspond to the multiple frame images.
  • the multiple frame images are obtained by shooting the environmental information around the host vehicle; the target in the multiple frame images
  • the obstacle vehicle is detected to determine the multi-frame target image area, which is the area where the target obstacle vehicle is located in the multi-frame image; the road structure in the multi-frame image is identified, and the multi-frame image is combined Target image area, determine multiple sets of road structure data, the multiple sets of road structure data respectively represent the road structure of the road where the target obstacle vehicle is located in the multi-frame image;
  • Based on the multi-frame target image area multiple sets of main vehicle motion data and multiple
  • a set of road structure data is used to determine the prediction result corresponding to each frame of the multi-frame image.
  • the prediction result is used to indicate whether the target obstacle vehicle in the corresponding image has a starting behavior.
  • the starting behavior of the obstacle vehicle is predicted based on image data and main vehicle motion data, without the need for data collected by multiple sensors such as lidar, millimeter wave radar, ultrasonic radar, etc., thus reducing the cost of merging multi-source sensors.
  • the resulting sensing results have problems of jitter, error, and large delay. It also reduces the problem of incorrect prediction results caused by inaccurate main vehicle positioning results.
  • this solution does not require high-precision maps, and can also be applied in scenarios where high-precision maps are not available and/or positioning is poor. It can be seen that the prediction accuracy, accuracy and real-time performance of this scheme are higher, and the generalization is also better.
  • the host vehicle motion data in this solution includes at least one of the host vehicle's vehicle speed and yaw angular velocity.
  • the road structure data in this solution includes at least one of the lane line and the position of the road edge closest to the target obstacle vehicle.
  • determining the prediction result corresponding to each frame image in the multi-frame image includes: based on the multi-frame target image area, determining The image perception features corresponding to each frame of the multi-frame image are used to characterize the motion characteristics of the target obstacle vehicle and the environmental characteristics of the environment surrounding the target obstacle vehicle; based on the multiple sets of main vehicle motion data, determine the multi-frame The main vehicle perception characteristics corresponding to each frame in the image are used to characterize the motion characteristics of the main vehicle; based on the multiple sets of road structure data, the road perception characteristics corresponding to each frame in the multi-frame image are determined, and the road perception is Features are used to characterize the structural characteristics of the road where the target obstacle vehicle is located; based on the image perception features, main vehicle perception features and road perception features corresponding to each frame of the multi-frame image, the prediction results corresponding to each frame of the multi-frame image are determined . It should be understood that this solution predicts whether
  • determining the image perception characteristics corresponding to each frame image in the multi-frame image includes: inputting the image data of the multi-frame target image area into a common feature extraction network to obtain the common features of the image. ; Determine multiple sets of combined data that correspond one-to-one to the multi-frame target image area. Each set of combined data includes image data corresponding to the target image area in one frame and common features of the image; input the multiple sets of combined data into the backbone network to obtain The image perception characteristics corresponding to each frame of the multi-frame image.
  • the common features of the images are first extracted, and the common features of the images represent the static features of the environment surrounding the target obstacle vehicle to a certain extent. Then the common features of the image are combined with the image data of the target image area in each frame, and the dynamic features of the target obstacle vehicle and the static features of the environment are extracted through the backbone network.
  • the common feature extraction network is a multi-scale convolutional neural network.
  • Multi-scale convolution can effectively fuse features of different scales, which is more conducive to extracting more robust common features of images.
  • determining the host vehicle sensing characteristics corresponding to each frame of the multi-frame image includes: inputting the multiple sets of host vehicle motion data into a first multi-layer perceptron to obtain the Perceptual characteristics of the main vehicle corresponding to each frame of the multi-frame image.
  • determining the road perception characteristics corresponding to each frame of the multiple frame images includes: inputting the multiple sets of road structure data into a second multi-layer perceptron to obtain the multiple frame images.
  • the road perception features corresponding to each frame of image includes: inputting the multiple sets of road structure data into a second multi-layer perceptron to obtain the multiple frame images.
  • determining the prediction result corresponding to each frame of the multi-frame image includes: converting the multi-frame image into The image perception features, host vehicle perception features and road perception features corresponding to each frame of image are input into the inter-frame feature fusion model to obtain the fusion features corresponding to each frame of the multi-frame image; The fused features are input into the third multi-layer perceptron to obtain prediction results corresponding to each frame of the multi-frame image.
  • electronic devices fuse models through inter-frame features. The model not only fuses three different features corresponding to the same frame image, but also fuses the features between frames.
  • the inter-frame feature fusion model includes a serial recurrent neural network and an attention mechanism network. It is worth noting that the combination of the recurrent neural network and the attention mechanism network can effectively improve the calculation accuracy of the distance dependence of inter-frame features, that is, effectively capture the distance-dependence features between frames, that is, effectively fuse inter-frame features, and can reduce the prediction cost. The jitter of the results effectively improves the precision and recall rate of the prediction results.
  • a device for predicting vehicle starting behavior has the function of realizing the behavior of the method for predicting vehicle starting behavior in the first aspect.
  • the vehicle starting behavior prediction device includes one or more modules, and the one or more modules are used to implement the vehicle starting behavior prediction method provided in the first aspect.
  • a device for predicting vehicle starting behavior which device includes:
  • the acquisition module is used to acquire multiple frames of images and multiple sets of main vehicle motion data.
  • the multiple sets of main vehicle motion data respectively correspond to the multiple frame images.
  • the multi-frame images are obtained by shooting the environmental information around the host vehicle;
  • the first determination module is used to detect the target obstacle vehicle in the multi-frame image to determine the multi-frame target image area, and the multi-frame target image area is the area where the target obstacle vehicle is located in the multi-frame image;
  • the second determination module is used to identify the road structure in the multi-frame image, and determine multiple sets of road structure data in combination with the multi-frame target image area.
  • the multiple sets of road structure data respectively represent the targets in the multi-frame image.
  • the third determination module is used to determine the prediction result corresponding to each frame image in the multi-frame image based on the multi-frame target image area, multiple sets of main vehicle motion data and multiple sets of road structure data.
  • the prediction result is used to indicate the corresponding image. Whether the target obstacle vehicle in the vehicle has starting behavior.
  • the third determination module includes:
  • the first determination sub-module is used to determine the image perception features corresponding to each frame of the multi-frame image based on the multi-frame target image area.
  • the image perception features are used to characterize the motion characteristics of the target obstacle vehicle and the surrounding environment of the target obstacle vehicle. environmental characteristics;
  • the second determination sub-module is used to determine the main vehicle sensing characteristics corresponding to each frame of the multi-frame image based on the multiple sets of main vehicle motion data, and the main vehicle sensing characteristics are used to characterize the motion characteristics of the main vehicle;
  • the third determination sub-module is used to determine the road perception features corresponding to each frame of the multi-frame image based on the multiple sets of road structure data.
  • the road perception features are used to characterize the structural features of the road where the target obstacle vehicle is located;
  • the fourth determination sub-module is used to determine the prediction result corresponding to each frame of the multi-frame image based on the image perception features, host vehicle perception features and road perception features corresponding to each frame of the multi-frame image.
  • the first determination sub-module is used to:
  • the multiple sets of combined data are input into the backbone network to obtain image perception features corresponding to each frame of the multiple frame images.
  • the second determination sub-module is used to:
  • the multiple sets of host vehicle motion data are input into the first multi-layer perceptron to obtain the host vehicle sensing features corresponding to each frame of the multiple frame images.
  • the third determination sub-module is used for:
  • the multiple sets of road structure data are input into the second multi-layer perceptron to obtain the road perception features corresponding to each frame of the multiple frame images.
  • the fourth determination sub-module is used for:
  • the fusion features corresponding to each frame of the multi-frame image are input into the third multi-layer perceptron to obtain a prediction result corresponding to each frame of the multi-frame image.
  • the inter-frame feature fusion model includes a serial recurrent neural network and an attention mechanism network.
  • the host vehicle motion data includes at least one of vehicle speed and yaw angular velocity of the host vehicle.
  • the road structure data includes at least one of the lane line and the position of the road edge closest to the target obstacle vehicle.
  • an electronic device in a third aspect, includes a processor and a memory.
  • the memory is used to store a program for executing the method for predicting vehicle starting behavior provided in the first aspect, and to store a program for implementing the above method.
  • the first aspect provides the data involved in the prediction method of vehicle starting behavior.
  • the processor is configured to execute a program stored in the memory.
  • the electronic device may further include a communication bus for establishing a connection between the processor and the memory.
  • a computer-readable storage medium is provided.
  • a computer program is stored in the computer-readable storage medium.
  • the instructions are executed by a processor, the method for predicting vehicle starting behavior described in the first aspect is implemented.
  • a fifth aspect provides a computer program product containing instructions that, when executed by a processor, implement the method for predicting vehicle starting behavior described in the first aspect.
  • Predicting the starting behavior of the obstacle vehicle based on image data and main vehicle motion data does not require data collected by multiple sensors such as lidar, millimeter-wave radar, ultrasonic radar, etc., thereby reducing the perception results caused by the fusion of multi-source sensors.
  • the problems of jitter, error and large delay also reduce the problem of incorrect prediction results caused by inaccurate main vehicle positioning results.
  • this solution does not require high-precision maps, and can also be applied in scenarios where high-precision maps are not available and/or positioning is poor. It can be seen that the prediction accuracy, accuracy and real-time performance of this scheme are higher, and the generalization is also better.
  • Figure 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 2 is a flow chart of a method for predicting vehicle starting behavior provided by an embodiment of the present application
  • Figure 3 is a flow chart of another method for vehicle starting behavior prediction provided by an embodiment of the present application.
  • Figure 4 is a flow chart of another method for vehicle starting behavior prediction provided by the embodiment of the present application.
  • Figure 5 is a flow chart of a vehicle control method provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a vehicle starting behavior prediction device provided by an embodiment of the present application.
  • vehicle behavior prediction supported by huge data plays a crucial role in autonomous driving, assisted driving and intelligent transportation systems.
  • vehicle starting behavior prediction is an important part of vehicle behavior prediction. For example, if it is predicted that an obstacle vehicle on the side of the road is about to start to cut into the lane of the main vehicle, then based on the prediction results, the main vehicle can be controlled or reminded to slow down to avoid or change course, thereby reducing traffic accidents.
  • FIG. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device can become part or all of the vehicle machine or server.
  • the electronic device includes one or more processors 101, a communication bus 102, a memory 103, and one or more communication interfaces 104.
  • the processor 101 is a general central processing unit (CPU), a network processor (network processing, NP), a microprocessor, or one or more integrated circuits used to implement the solution of the present application, for example, a dedicated Integrated circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof.
  • a dedicated Integrated circuit application-specific integrated circuit, ASIC
  • programmable logic device programmable logic device
  • PLD programmable logic device
  • the above-mentioned PLD is a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any of them combination.
  • CPLD complex programmable logic device
  • FPGA field-programmable gate array
  • GAL general array logic
  • Communication bus 102 is used to transfer information between the above-mentioned components.
  • the communication bus 102 is divided into an address bus, a data bus, a control bus, etc.
  • address bus a data bus
  • control bus a control bus
  • only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the memory 103 is a read-only memory (ROM), a random access memory (RAM), or an electrically erasable programmable read-only memory (EEPROM). , optical disc (including compact disc read-only memory, CD-ROM), compressed optical disc, laser disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or can be used for portable Or any other medium that stores the desired program code in the form of instructions or data structures and can be accessed by a computer, without limitation.
  • the memory 103 exists independently and is connected to the processor 101 through the communication bus 102, or the memory 103 and the processor 101 are integrated together.
  • Communication interface 104 uses any transceiver-like device for communicating with other devices or communication networks.
  • Communication Interface 104 includes a wired communication interface and, optionally, a wireless communication interface.
  • the wired communication interface is such as an Ethernet interface.
  • the Ethernet interface is an optical interface, an electrical interface, or a combination thereof.
  • the wireless communication interface is a wireless local area networks (WLAN) interface, a cellular network communication interface, or a combination thereof.
  • WLAN wireless local area networks
  • the electronic device includes multiple processors, such as processor 101 and processor 105 as shown in FIG. 1 .
  • processors are a single-core processor, or a multi-core processor.
  • a processor here refers to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the electronic device also includes an output device 106 and an input device 107.
  • the output device 106 communicates with the processor 101 and can display information in a variety of ways.
  • the output device 106 is a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector (projector), etc.
  • the input device 107 communicates with the processor 101 and can receive user input in a variety of ways.
  • the input device 107 is a mouse, a keyboard, a touch screen device or a sensing device, or the like.
  • the memory 103 is used to store the program code 110 for executing the solution of the present application, and the processor 101 can execute the program code 110 stored in the memory 103 .
  • the program code includes one or more software modules, and the electronic device can implement the prediction method of vehicle starting behavior provided in the embodiment of FIG. 2 below through the processor 101 and the program code 110 in the memory 103 .
  • Figure 2 is a flow chart of a method for predicting vehicle starting behavior provided by an embodiment of the present application. This method is applied to electronic devices.
  • the electronic device is a device on the main vehicle, such as a vehicle machine, or the electronic device can also be a server, such as a server in a traffic management center. Please refer to Figure 2.
  • the method includes the following steps.
  • Step 201 Acquire multiple frames of images and multiple sets of host vehicle motion data.
  • the multiple sets of host vehicle motion data respectively correspond to the multiple frame images.
  • the multiple frame images are obtained by photographing environmental information around the host vehicle.
  • the electronic device acquires multiple frames of images and multiple sets of host vehicle motion data.
  • the multiple sets of host vehicle motion data are in one-to-one correspondence with the multiple frame images, that is, the multiple sets of host vehicle motion data are the host vehicle motion data corresponding to each frame of the multiple frame images.
  • the multi-frame images are obtained by capturing environmental information around the host vehicle, such as images captured by a camera installed in front of the host vehicle.
  • the motion data of the main vehicle is the data collected by the motion sensor on the main vehicle.
  • the motion sensor includes one or more of the vehicle speed sensor, angular velocity sensor, etc.
  • the main vehicle movement data can accurately represent the real-time movement of the main vehicle.
  • the host vehicle motion data in the embodiment of the present application includes at least one of the host vehicle's vehicle speed and yaw angular velocity.
  • the main vehicle motion data used in the vehicle behavior prediction of this solution must correspond to the image one-to-one, if the frame rate of the motion sensor on the main vehicle (such as 100 Hz) is higher than the frame rate of the camera (such as 20Hz) ), then the original motion data collected by the motion sensor can be down-sampled through linear interpolation, that is, down-conversion processing, to obtain multiple sets of main vehicle motion data that correspond to the multi-frame images one-to-one, and the obtained multiple The time of grouping the motion data of the main vehicle is also the same as the time of the multi-frame images.
  • linear interpolation that is, down-conversion processing
  • the original motion data collected by the motion sensor can be upsampled through linear interpolation to obtain a multi-frame image that corresponds one-to-one and is consistent in time. Set the main vehicle motion data.
  • the frame rate of motion sensors and cameras on current vehicles is generally higher than the fusion frame rate of multi-source sensors.
  • the fusion frame rate is usually consistent with the minimum frame rate of multi-source sensors. Therefore, this solution The delay is smaller and the real-time performance is higher.
  • the number of the multi-frame images is N.
  • the electronic device acquires the i-th frame image, the i-N+1 to i-th The frame images are processed, and at the moment when the i+1th frame image is obtained, the iNth to i+1th frame images are processed.
  • i is not less than N.
  • N can be 16, 10 or other values.
  • N is 16 as an example for introduction.
  • the electronic device processes the 1st to 16th frame images, after acquiring the 17th frame image, processes the 2nd to 17th frame images, and after acquiring the 18th frame image After the image, the 3rd to 18th frame images are processed.
  • the original image sequence formed by the N frame images is (img_orig t_N-1 ,...,img_orig t_0 ), and the dimension is [N, 3, w_orig, h_orig], where , t represents the time at which the multi-frame image is acquired, N represents the N frames of images in the time dimension, 3 represents the three RGB channels in the channel dimension, w_orig and h_orig represent the width and height of each frame of image respectively.
  • the dimension of the main vehicle data sequence composed of the N sets of main vehicle motion data is [N, 2], where N represents that there are N sets of main vehicle data in the time dimension.
  • Vehicle motion data, 2 means that each set of main vehicle motion data includes two data.
  • Step 202 Detect the target obstacle vehicle in the multi-frame image to determine the multi-frame target image area.
  • the multi-frame target image area is the area where the target obstacle vehicle is located in the multi-frame image.
  • the electronic device detects the target obstacle vehicle in the multi-frame image to determine the multi-frame target image area.
  • the electronic device inputs the multi-frame image into a target detection network to determine the area where the target obstacle vehicle is located in the multi-frame image.
  • a rectangular frame is used to select the area where the target obstacle vehicle is located in the image, and the image is cropped based on the rectangular frame. target image area.
  • the electronic device crops out the target image area in the image.
  • the external expansion ratio is greater than or equal to 1.
  • the expansion ratio is 1.5.
  • the image area in the enlarged rectangular frame is cropped as the target image area.
  • the center point of the rectangular frame after expansion is the same as the center point of the rectangular frame before expansion.
  • the electronic device directly crops out the target image area in the image according to the rectangular frame.
  • target obstacle vehicles appear continuously in the multi-frame images.
  • j-i+1 is not less than the specified threshold
  • the subsequent steps will not be performed, that is, the behavior prediction of the target obstacle vehicle will not be performed.
  • the number of multi-frame images is 16 and the specified threshold is 8, if there are target obstacle vehicles in 8 of these 16 frames of images, continue to perform subsequent steps. If there is a target obstacle vehicle in only 5 of these 16 frames of images, the behavior prediction of the target obstacle vehicle will not be performed.
  • the j-th frame image will be
  • the target image area corresponding to the i-frame image is used as the target image area corresponding to the image before the i-th frame
  • the target image area corresponding to the j-th frame image is used as the target image area corresponding to the image after the j-th frame.
  • i is not less than 1 and j is less than N, or i is greater than 1 and j is not greater than N, where N is the total number of multi-frame images.
  • the target image area corresponding to the 14th frame image is used as the target image corresponding to the 15th and 16th frame images.
  • the electronic device can detect the multiple obstacle vehicles and regard each of the multiple obstacle vehicles as a target obstacle vehicle, thereby detecting each target obstacle vehicle.
  • the car performs starting behavior detection. It should be noted that steps 201 to 204 are the starting behavior detection of one of the target obstacle vehicles. For example, assuming that the number of images in which the obstacle vehicle A exists in the 1st to 16th frame images is greater than the specified threshold, steps 201 to 204 are performed for the obstacle vehicle A in the 1st to 16th frame images to predict the risk of the obstacle vehicle A. Starting behavior.
  • steps 201 to 204 are performed for obstacle vehicle B in the 3rd to 18th frame of images to predict the starting behavior of obstacle vehicle B.
  • steps 201 to 204 are also performed for the obstacle vehicle C in the 1st to 16th frame images to predict the starting behavior of the obstacle vehicle C.
  • the number of multi-frame images is N
  • the target image sequence formed by the N-frame target image area is (img t_N-1 ,...,img t_0 )
  • the dimension is [N, 3, w , h], where t represents the moment when the multi-frame image is acquired, N represents the target image area of N frames in the time dimension, 3 represents the RGB three channels in the channel dimension, w and h represent the width of the target image area of each frame respectively. and high.
  • Step 203 Identify the road structure in the multi-frame image, and combine the multi-frame target image area to determine multiple sets of road structure data.
  • the multiple sets of road structure data respectively represent the road where the target obstacle vehicle is located in the multi-frame image. road structure.
  • the electronic device can identify the road structure in the multi-frame image and combine the multi-frame target image area to determine the road structure data of the road where the target obstacle vehicle is located in the multi-frame image.
  • the electronic device identifies the road structure in the multi-frame image, and can identify each road in the multi-frame image, such as each lane line and each road edge. Based on the area where the target obstacle vehicle detected in the multi-frame image is located, the electronic device determines the road where the target obstacle vehicle is located in each frame image from all the identified roads in the multi-frame image, and obtains The road structure data of the road where the target obstacle vehicle is located in the multi-frame image.
  • the road structure data includes at least one of the lane line and the position of the road edge closest to the target obstacle vehicle.
  • each set of road structure data includes the coordinates of multiple two-dimensional discrete points of the corresponding lane line.
  • the multiple two-dimensional discrete points The number of point coordinates is 30, that is, each set of road structure data includes 30 two-dimensional coordinates.
  • the length of the lane line closest to the target obstacle vehicle identified in one frame of image is not less than a preset length (such as 100 meters), then the lane line can be represented by 30 two-dimensional coordinates.
  • the number of two-dimensional discrete points used to represent the lane line may be less than 30. In this case, you can use The coordinates of the recognized two-dimensional discrete point farthest from the main vehicle on the lane line are used to supplement the 30 two-dimensional coordinates. Alternatively, in this case, the lane line can be extended by curve fitting to complement the 30 two-dimensional coordinates.
  • each set of road structure data includes 30 two-dimensional coordinates, and these 30 two-dimensional coordinates represent the data of a lane line
  • the N sets of road structure data The dimension of the formed road data sequence is [N, 30, 2], where N means that there are N groups of road structure data in the time dimension, 30 means that each group of road structure data includes 30 coordinates, and 2 means that each coordinate includes two values, that is, each coordinate is a two-dimensional coordinate.
  • Step 204 Based on the multi-frame target image area, multiple sets of main vehicle motion data and multiple sets of road structure data, determine the prediction results corresponding to each frame in the multi-frame images. The prediction results are used to indicate target obstacles in the corresponding images. Whether the car has starting behavior.
  • the electronic device uses the multi-frame target image area, the multi-group host vehicle motion data and the multi-group road Structural data
  • one implementation method for determining the prediction results corresponding to each frame of the multi-frame image is: based on the multi-frame target image area, determine the image perception characteristics corresponding to each frame of the multi-frame image; based on the multiple groups
  • the main vehicle motion data is used to determine the perceptual characteristics of the host vehicle corresponding to each frame of the multi-frame image; based on the multiple sets of road structure data, the road perceptual characteristics corresponding to each frame of the multi-frame image are determined; based on the multi-frame image, the perceptual characteristics of the main vehicle are determined.
  • the image perception features, host vehicle perception features and road perception features corresponding to each frame of image determine the prediction results corresponding to each frame of the image in the multi-frame image. To put it simply, the electronic device perceives the characteristics of the image, the characteristics of the main vehicle's motion data, and the characteristics of the road structure data, and combines these three parts of characteristics to predict whether the obstacle vehicle has starting behavior.
  • this image perception feature is used to characterize the motion characteristics of the target obstacle vehicle and the environmental characteristics of the environment surrounding the target obstacle vehicle.
  • the environmental features are generally static features.
  • the image perception features represent the dynamic features of the target obstacle vehicle and the static features of the environment.
  • the main vehicle perception features are used to characterize the movement characteristics of the host vehicle, and the road perception features are used to characterize the structural characteristics of the road where the target obstacle vehicle is located.
  • an implementation method for the electronic device to determine the image perception characteristics corresponding to each frame of the multi-frame image based on the multi-frame target image area is: input the image data of the multi-frame target image area into a common Feature extraction network to obtain common features of the image; determine multiple sets of combined data that correspond one-to-one to the multi-frame target image area, each set of combined data includes image data corresponding to the target image area of one frame and common features of the image; combine the multiple sets of combined data The combined data is input into the backbone network to obtain the image perception features corresponding to each frame of the multi-frame image.
  • the electronic device first extracts the common features of the image, and the common features of the image represent the static features of the environment surrounding the target obstacle vehicle to a certain extent.
  • the electronic device then combines the common features of the image with the image data of the target image area in each frame, and extracts the dynamic features of the target obstacle vehicle and the static features of the environment through the backbone network.
  • the common feature of the image has a channel dimension
  • the image data of the multi-frame target image area also has a channel dimension.
  • the electronic device combines the common feature of the image with the image data of each frame of the target image area in the multi-frame target image area in a channel. Dimensions are spliced to obtain a corresponding set of combined data. In this way, by splicing in the channel dimension, it is easier to extract clearly distinguishable static features and motion features through the backbone network.
  • the dimensions of the image data of each frame of the target image area in the multi-frame target image area are [c1, w, h], c1 represents the number of channels of the image data, w and h represent the width and height of the target image area respectively.
  • c1 is equal to 3, indicating RGB three channels.
  • the dimensions of the common features of the image are [c2, w, h], c2 represents the number of channels of the common features of the image, the height and width of the common features of the image are the same as the height and width of the target image area, optionally, c2 is equal to 3 or other numerical values.
  • the dimensions of each set of combined data spliced by the electronic device through splicing in the channel dimension are [c1+c2,w,h].
  • the common feature extraction (CFE) network is a multi-scale convolutional neural network, that is, there are multiple convolution kernels used in the convolution layer of the common feature extraction network.
  • the CFE network shown in Figure 4 includes 4 convolutional layers. The first three convolutional layers all use 3 convolution kernels of different scales. The sizes of these 3 convolution kernels are respectively It is 1 ⁇ 1, 3 ⁇ 3, 5 ⁇ 5. Multi-scale convolution can effectively fuse features of different scales, which is more conducive to extracting more robust common features of images.
  • the common feature extraction network can also be other types of neural networks, and the CFE network shown in Figure 4 is not used to limit the embodiments of the present application.
  • the CFE network shown in Figure 4 is introduced in detail.
  • the target image sequence formed by the N-frame target image area determined based on the image collected at time t in time sequence is expressed as (img t_N-1 ,...,img t_0 ), and the dimension is [N, 3, w, h], Among them, N represents the target image area of N frames in the time dimension, 3 represents the three RGB channels in the channel dimension, w and h represent the width and height of the target image area in each frame respectively.
  • the intermediate feature ComFea1 mean(conv1 ⁇ 1(img t_N-1 ,...,img t_0 ),conv3 ⁇ 3(img t_N-1 ,...,img t_0 ),conv5 ⁇ 5(img t_N-1 , ...,img t_0 )), the dimensions are [32,3,w,h].
  • mean() represents the operation of taking the mean, which is represented as M in Figure 4.
  • the common feature sequence R_ComFea t and the target image sequence (img t_N-1 ,...,img t_0 ) are spliced in the channel dimension to obtain multiple sets of combined data. Multiple formed sequences are input into the backbone network.
  • the backbone network is a convolutional neural network.
  • the backbone network may use ResNet (such as ResNet50).
  • ResNet such as ResNet50
  • the embodiments of this application do not limit the network structure of the backbone network.
  • the backbone network includes multiple CNNs (as shown in Figure 4).
  • the multiple CNNs correspond to the multiple sets of combined data.
  • the electronic device inputs the multiple sets of combined data into the multiple CNNs respectively to obtain The image perception characteristics corresponding to each frame of the multi-frame image.
  • the network structures and network parameters of the multiple CNNs are the same. In some other embodiments, the network structures and network parameters of the multiple CNNs may be different.
  • an implementation method for the electronic device to determine the sensing characteristics of the host vehicle corresponding to each frame of the multi-frame image based on the multiple sets of host vehicle motion data is to input the multiple sets of host vehicle motion data into the third
  • a multi-layer perceptron is used to obtain the perceptual characteristics of the main vehicle corresponding to each frame of the multi-frame image.
  • the multiple sets of main vehicle motion data are input into multiple first multi-layer perceptrons (MLP) respectively.
  • MLP multi-layer perceptrons
  • the network structures and network parameters of the multiple first MLPs are the same. . In some other embodiments, the network structures and network parameters of the multiple first MLPs may be different.
  • the multiple sets of main vehicle motion data are input into the same first MLP. For example, every time a set of main vehicle motion data is determined, the currently determined set of main vehicle motion data is input into the first MLP to ensure real-time performance.
  • the first MLP includes one or more hidden layers. In this embodiment of the present application, the first MLP includes two hidden layers.
  • an electronic device determines the road perception characteristics corresponding to each frame image in the multi-frame image based on the multiple sets of road structure data: input the multiple sets of road structure data into the second multi-layer perception machine to obtain the road perception characteristics corresponding to each frame of the multi-frame image.
  • the multiple sets of road structure data are respectively input into multiple second MLPs.
  • the network structures and network parameters of the multiple second MLPs are the same.
  • the network structures and network parameters of the plurality of second MLPs may be different.
  • the multiple sets of road structure data are all input into the same second MLP. For example, every time a set of road structure data is determined, the currently determined set of road structure data is input into the second MLP to ensure real-time performance.
  • the second MLP includes one or more hidden layers. In this embodiment of the present application, the second MLP includes two hidden layers.
  • the main vehicle motion data and road structure data are processed by MLP.
  • the MLP used is equivalent to a feature extraction model or coding model, which is used to process the main vehicle motion data and road structure data.
  • Structural data is used for feature extraction or encoding.
  • the host vehicle motion data and/or road structure data can also be processed using other neural networks.
  • the electronic device determines an implementation of the prediction result corresponding to each frame of the multi-frame image based on the image perception characteristics, the host vehicle perception characteristics and the road perception characteristics corresponding to each frame of the multi-frame image.
  • the method is: input the image perception features, main vehicle perception features and road perception features corresponding to each frame of the multi-frame image into the inter-frame feature fusion model to obtain the fusion features corresponding to each frame of the multi-frame image;
  • the fusion features corresponding to each frame of the multi-frame image are input into the third multi-layer perceptron to obtain the prediction result corresponding to each frame of the multi-frame image. It should be understood that the electronic device not only fuses three different features corresponding to the same frame image through the inter-frame feature fusion model, but also fuses the features between frames.
  • the electronic device splices the image perception features, the host vehicle perception features and the road perception features corresponding to each frame image in the multi-frame images to obtain the combined perception features corresponding to the corresponding images.
  • the electronic device inputs the combined perceptual features corresponding to each frame of the multi-frame image into the inter-frame feature fusion model to obtain the fusion features corresponding to each frame of the multi-frame image.
  • the dimension of the image perception feature corresponding to the multi-frame image is [N, c1]
  • the dimension of the host vehicle perception feature corresponding to the multi-frame image is [N, c2]
  • the road perception feature corresponding to the multi-frame image is [N, c2].
  • the dimension of the feature is [N, c3], then the dimension of the combined perceptual feature corresponding to the multi-frame image is [N, c1+c2+c3].
  • N represents N frames of images in the time dimension
  • c1 represents the number of elements included in each image perception feature
  • c2 represents the number of elements included in each host vehicle perception feature
  • c3 represents the number of elements included in each road perception feature. number.
  • the inter-frame feature fusion model includes a series recurrent neural network (RNN) and an attention mechanism network.
  • RNN can be a long short term memory (LSTM) network or a gate recurrent unit (GRU), etc.
  • LSTM long short term memory
  • GRU gate recurrent unit
  • RNN uses a two-layer bidirectional LSTM network.
  • the attention mechanism network can be a self-attention (SA) network or a multi-head attention (MHA) network, etc.
  • the combination of the recurrent neural network and the attention mechanism network can effectively improve the calculation accuracy of the distance dependence of inter-frame features, that is, effectively capture the distance-dependence features between frames, that is, effectively fuse inter-frame features, and can reduce the prediction cost.
  • the jitter of the results effectively improves the precision and recall rate of the prediction results.
  • the network structure of the inter-frame feature fusion model introduced above is not used to limit the embodiments of the present application.
  • the inter-frame feature fusion model may also include a recurrent neural network instead of an attention mechanism network.
  • the fusion features corresponding to each frame of the multi-frame image are respectively input into multiple third MLPs.
  • the network structures and network parameters of the multiple third MLPs are the same.
  • the network structures and network parameters of the plurality of third MLPs may be different.
  • the fusion features corresponding to each frame of the multi-frame images are input into the same third MLP.
  • the third MLP includes one or more hidden layers.
  • the third MLP includes two hidden layers.
  • the fusion features corresponding to each frame of the multi-frame image are processed using the third MLP to obtain the prediction result.
  • the third MLP is equivalent to a classification model, and the prediction results are divided into two categories, one is the presence of starting behavior, and the other is the absence of starting behavior.
  • the fusion features corresponding to each frame of the multi-frame image can also be processed using other neural networks to obtain prediction results.
  • FIG. 3 is a flow chart of another method for vehicle starting behavior prediction provided by an embodiment of the present application.
  • the number of multi-frame images processed by the electronic device each time is N.
  • the N frames of images acquired by the electronic device at time t are according to The time sequence is marked as t_0,..., t_N-2, t_N-1 respectively.
  • the electronic device performs target detection on these N frames of images to determine the area where the target obstacle vehicle is located in the multi-frame images (black in Figure 3 The area selected by the rectangular frame).
  • the electronic device also performs road structure recognition on the N frames of images to determine the road structure data of the road where the target obstacle vehicle is located in the N frames of images, that is, N sets of road structure data are obtained.
  • the electronic device also obtains N sets of main vehicle motion data corresponding to the N frames of images.
  • the electronic device cuts out the area where the target obstacle vehicle is located in the multi-frame image according to the expansion ratio to obtain N frames of target image areas, and inputs the image data of the N frames of target image areas into the CFE network to extract common features of the images. .
  • the electronic device concatenates (concat, C) the common features of the image with the target image areas of each frame in the N frames of target image areas in the channel dimension to obtain N sets of combined data.
  • the electronic device inputs the N sets of combined data into a backbone network to obtain the image perception characteristics corresponding to each frame of the N frames of images.
  • the electronic device processes the N sets of host vehicle motion data respectively through MLP to obtain the host vehicle perception characteristics corresponding to each frame of the N frames of images.
  • the electronic device also processes the N sets of road structure data respectively through MLP to obtain the road perception characteristics corresponding to each frame of the N frames of images. Then, the electronic device splices the image perception features, the host vehicle motion features and the road perception features corresponding to the N frames of images to obtain N sets of perception features corresponding to the N frames of images.
  • the electronic device inputs these N sets of combined perceptual features into the inter-frame feature fusion model (including the serial CNN and attention mechanism network) to obtain the fusion features corresponding to each frame of the N frame images. Finally, the electronic device processes the fusion features corresponding to the N frames of images through MLP to obtain N prediction results corresponding to the N frames of images.
  • the inter-frame feature fusion model including the serial CNN and attention mechanism network
  • Figure 4 shows the specific structures of the CFE network, backbone network, recurrent neural network and attention mechanism network. The network structures of these networks have been introduced previously and will not be repeated here.
  • BBOX bounding box
  • the image img_orig is cropped according to formula (1) to obtain the target image area img.
  • the Crop() function in formula (1) is used to realize the function of cropping the image img_orig after expanding the target BBOX according to the expansion ratio ratio.
  • img Crop(img_orig,ratio) (1)
  • the target image sequence obtained after cropping the multi-frame images according to step 2 is (img t_N-1 ,...,img t_0 ), and the dimension is [N,3 ,w,h].
  • the electronic device inputs the target image sequence into the CEF network, and obtains the common feature sequence R_ComFea t after N copies according to formula (2), with the dimension also being [N, 3, w, h].
  • R_ComFea t CEF(img t_N-1 ,...,img t_0 ) (2)
  • the electronic device splices the target image sequence and the common feature sequence in the channel dimension to obtain a combined data sequence including a variety of combined data, with dimensions [N, 6, w, h], and inputs the combined data sequence into the backbone network ( CNN) to perform feature extraction according to formula (3) to obtain the image perception feature SFea t corresponding to the multi-frame image, with the dimension [N, c1].
  • concat() represents the splicing operation or connection operation.
  • SFea t CNN(concat((img t_N-1 ,...,img t_0 ),R_ComFea t )) (3)
  • the electronic device also obtains multiple sets of main vehicle movement data.
  • the main vehicle motion data represented as Ego the main vehicle data sequence formed by the multiple sets of main vehicle motion data is (Ego t_N-1 ,...,Ego t_0 ), and the dimension is [N, c2_in].
  • c2_in represents the number of elements included in each set of main vehicle motion data.
  • the electronic device also identifies the road structure of the multi-frame images to obtain multiple sets of road structure data.
  • the road structure data represented as Lane As an example, assuming that the road structure data includes the two-dimensional coordinates of the position of the lane line, then the road data sequence formed by the multiple sets of road structure data is (Lane t_N-1 ,...,Lane t_0 ) , the dimension is [N,c3_in,2]. Among them, c3_in indicates the number of two-dimensional coordinates included in each set of road structure data, and 2 indicates that each two-dimensional coordinate includes two coordinate values.
  • the electronic device processes the multiple sets of main vehicle motion data and the multiple sets of road structure data respectively through MLP.
  • the dimensions of the host vehicle perception features corresponding to the N frames of images obtained after MLP processing of the multiple sets of main vehicle motion data are [N, c2], and the N frames of images corresponding to the N frames of images obtained after MLP processing of the multiple sets of road structure data are
  • the dimension of road perception features is [N, c3].
  • the electronic device splices the image perception features, the host vehicle perception features and the road perception features corresponding to each frame of the multi-frame image to obtain the combined perception feature C_SFea t corresponding to the corresponding image.
  • the combined perception feature corresponding to the multi-frame image The dimension of C_SFea t is [N,c1+c2+c3]. It should be understood that in step 6, the electronic device obtains the combined sensing feature C_SFea t corresponding to the multi-frame image according to formula (4).
  • C_SFea t Concat(SFea t ,MLP(Ego t_N-1 ,...,Ego t_0 ),MLP(Lane t_N-1 ,...,Lane t_0 )) (4)
  • the electronic device inputs the combined sensing feature C_SFea t corresponding to the multi-frame image into the inter-frame feature fusion model to obtain the fusion feature STARFea t corresponding to the multi-frame image according to formula (5), with the dimension [N, c4].
  • the inter-frame feature fusion model includes RNN and attention mechanism network (ATTENTATION).
  • STARSFea t ATENTATION(CNN(C_SFea t )) (5)
  • the electronic device processes the fusion feature STARFea t corresponding to the multi-frame image through MLP according to formula (6) to obtain the prediction result Out t corresponding to the multi-frame image.
  • the dimension is [N, 2], where 2 represents prediction.
  • the CUTIN behavior includes the behavior of the obstacle vehicle starting to cut into the lane where the main vehicle is.
  • Out t MLP (STARSFea t ) (6)
  • each network model used in the embodiment of the present application has been trained, and the embodiment of the present application does not limit the training methods of these network models.
  • batch processing is used to train these network models, that is, each adjustment of network parameters is implemented based on multiple sets of sample image sequences.
  • FIG. 5 is a flow chart of a vehicle control method in an autonomous driving or assisted driving scenario provided by an embodiment of the present application.
  • the process of autonomous driving of the host vehicle multiple frames of images and the original motion data of the host vehicle are acquired, and the multi-frame images and the original motion data of the host vehicle are processed through the perception module to obtain the multi-frame target image area, multiple groups Main vehicle motion data and multiple sets of road structure data.
  • the perception module sends the multi-frame target image area, multiple sets of main vehicle motion data and multiple sets of road structure data to the prediction module.
  • the prediction module is used to predict vehicle starting behavior.
  • the prediction module determines the prediction results corresponding to each frame of the multi-frame image based on the target image area of the multi-frame, the movement data of the main vehicle, and the road structure data of the multi-frame.
  • the prediction module sends the prediction results corresponding to each frame of the multi-frame image to the planning module.
  • the planning module determines the driving trajectory, vehicle speed, etc. of the main vehicle based on the prediction results corresponding to each frame of image.
  • the control module controls the movement of the main vehicle according to the driving trajectory, vehicle speed, etc. planned by the planning module.
  • the camera on the main vehicle captures images of the environment around the host vehicle, such as images of the environment in front of the host vehicle, and the vehicle speed sensor on the vehicle collects motion data of the host vehicle, such as vehicle speed. , yaw angular velocity, etc., so that the host vehicle predicts whether the obstacle vehicle has a starting behavior based on the image data and the host vehicle motion data according to the vehicle starting behavior prediction method provided by the embodiment of the present application.
  • this solution can also be applied to intelligent transportation systems.
  • pictures taken by roadside equipment Images of the surrounding environment are taken and sent to the server of the traffic management center. Vehicles on the road can also report motion data.
  • the server uses the vehicle starting behavior prediction method provided by the embodiment of the present application to predict whether the obstacle vehicle has starting behavior. For example, the server can obtain multiple frames of images taken by a roadside device within a period of time, as well as motion data reported by vehicles that will pass through the road where the roadside device is located within the time period, and then predict obstacle vehicles according to this solution. starting behavior.
  • the server When the server predicts that the target obstacle vehicle has starting behavior, it can broadcast to the vehicles that report motion data to remind these vehicles that there is an obstacle vehicle nearby that is about to start. In some embodiments, if the server cannot match or correlate the multi-frame images captured by the roadside device with the motion data reported by the vehicle, the server can set the host vehicle motion data participating in the calculation to 0 or The average speed of vehicles traveling on the road, etc., to implement this plan.
  • the starting behavior of the obstacle vehicle is predicted based on the image data and the main vehicle motion data, without the need for data collected by multiple sensors such as lidar, millimeter wave radar, ultrasonic radar, etc., thus reducing the need for It eliminates the problems of jitter, error and large delay in the perception results caused by the fusion of multi-source sensors, and also reduces the problem of incorrect prediction results caused by inaccurate main vehicle positioning results.
  • this solution does not require high-precision maps, and can also be applied in scenarios where high-precision maps are not available and/or positioning is poor. It can be seen that the prediction accuracy, accuracy and real-time performance of this scheme are higher, and the generalization is also better.
  • FIG. 6 is a schematic structural diagram of a vehicle starting behavior prediction device 600 provided by an embodiment of the present application.
  • the vehicle starting behavior prediction device 600 can be implemented as part or all of an electronic device by software, hardware, or a combination of the two.
  • the electronic device may be any electronic device in the above embodiments.
  • the device 600 includes: an acquisition module 601 , a first determination module 602 , a second determination module 603 and a third determination module 604 .
  • the acquisition module 601 is used to acquire multiple frames of images and multiple sets of host vehicle motion data.
  • the multiple sets of host vehicle motion data respectively correspond to the multiple frame images.
  • the multiple frame images are obtained by photographing the environmental information around the host vehicle;
  • the first determination module 602 is used to detect the target obstacle vehicle in the multi-frame image to determine the multi-frame target image area, which is the area where the target obstacle vehicle is located in the multi-frame image;
  • the second determination module 603 is used to identify the road structure in the multi-frame image, and determine multiple sets of road structure data in combination with the multi-frame target image area.
  • the multiple sets of road structure data respectively represent the road structure in the multi-frame image.
  • the third determination module 604 is used to determine the prediction result corresponding to each frame image in the multi-frame image based on the multi-frame target image area, multiple sets of host vehicle motion data and multiple sets of road structure data.
  • the prediction result is used to indicate the corresponding Whether the target obstacle vehicle in the image has starting behavior.
  • the third determination module 604 includes:
  • the first determination sub-module is used to determine the image perception features corresponding to each frame of the multi-frame image based on the multi-frame target image area.
  • the image perception features are used to characterize the motion characteristics of the target obstacle vehicle and the surrounding environment of the target obstacle vehicle. environmental characteristics;
  • the second determination sub-module is used to determine the main vehicle sensing characteristics corresponding to each frame of the multi-frame image based on the multiple sets of main vehicle motion data, and the main vehicle sensing characteristics are used to characterize the motion characteristics of the main vehicle;
  • the third determination sub-module is used to determine the road perception features corresponding to each frame of the multi-frame image based on the multiple sets of road structure data.
  • the road perception features are used to characterize the structural features of the road where the target obstacle vehicle is located;
  • the fourth determination sub-module is used to determine the prediction result corresponding to each frame of the multi-frame image based on the image perception features, host vehicle perception features and road perception features corresponding to each frame of the multi-frame image.
  • the first determination sub-module is used to:
  • the multiple sets of combined data are input into the backbone network to obtain image perception features corresponding to each frame of the multiple frame images.
  • the second determination sub-module is used to:
  • the multiple sets of host vehicle motion data are input into the first multi-layer perceptron to obtain the host vehicle sensing features corresponding to each frame of the multiple frame images.
  • the third determination sub-module is used for:
  • the multiple sets of road structure data are input into the second multi-layer perceptron to obtain the road perception features corresponding to each frame of the multiple frame images.
  • the fourth determination sub-module is used for:
  • the fusion features corresponding to each frame of the multi-frame image are input into the third multi-layer perceptron to obtain a prediction result corresponding to each frame of the multi-frame image.
  • the inter-frame feature fusion model includes a serial recurrent neural network and an attention mechanism network.
  • the host vehicle motion data includes at least one of vehicle speed and yaw angular velocity of the host vehicle.
  • the road structure data includes at least one of the lane line and the position of the road edge closest to the target obstacle vehicle.
  • the starting behavior of the obstacle vehicle is predicted based on image data and main vehicle motion data without the need for data collected by multiple sensors such as laser radar, millimeter wave radar, ultrasonic radar, etc., thereby reducing the need for fusion of multi-source sensors.
  • the resulting sensing results have problems of jitter, error, and large delay. It also reduces the problem of incorrect prediction results caused by inaccurate positioning results of the main vehicle.
  • this solution does not require high-precision maps, and can also be applied in scenarios where high-precision maps are not available and/or positioning is poor. It can be seen that the prediction accuracy, accuracy and real-time performance of this scheme are higher, and the generalization is also better.
  • the vehicle starting behavior prediction device provided in the above embodiment predicts the vehicle starting behavior
  • the division of the above functional modules is only used as an example. In actual applications, the above functions can be allocated to different modules as needed.
  • the functional module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the vehicle starting behavior prediction device and the vehicle starting behavior prediction method embodiment provided in the above embodiments belong to the same concept. The specific implementation process can be found in the method embodiments and will not be described again here.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Through wired (such as: coaxial cable, optical fiber, data subscriber line (digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) method to transmit to another website, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (such as floppy disks, hard disks, magnetic tapes), optical media (such as digital versatile discs (DVD)) or semiconductor media (such as solid state disks (SSD)) wait.
  • the computer-readable storage media mentioned in the embodiments of this application may be non-volatile storage media, in other words, may be non-transitory storage media.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • Signals are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions.
  • the images, videos, motion data, road structure data, etc. involved in the embodiments of this application are all obtained with full authorization.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Traffic Control Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种车辆起步行为的预测方法、装置、存储介质及程序产品,属于智能驾驶技术领域。在该方法中,基于图像数据和主车运动数据来对障碍车的起步行为进行预测,无需激光雷达、毫米波雷达、超声波雷达等多个传感器采集的数据,从而减少了融合多源传感器所带来的感知结果存在抖动和误差的问题以及延迟较大的问题,还减少了由于主车定位结果不准所带来的预测结果有误的问题。另外,本方案也无需高精度地图,在无高精度地图和/或定位较差的场景中,本方案也能够得以应用。由此可见,本方案的预测精度、准确率和实时性更高,泛化性也更好。

Description

车辆起步行为的预测方法、装置、存储介质及程序产品
本申请要求于2022年5月17日提交的申请号为202210539732.2、发明名称为“车辆起步行为的预测方法、装置、存储介质及程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及智能驾驶技术领域,特别涉及一种车辆起步行为的预测方法、装置、存储介质及程序产品。
背景技术
对道路上的车辆进行行为预测,有利于提高车辆行驶的安全性。例如,在自动驾驶场景中,自动驾驶的车辆为主车,道路上其他车辆为障碍车,主车在行驶过程中,能够对障碍车进行行为预测,以便于根据预测结果来自动规划并控制主车的行驶轨迹,从而降低与障碍车碰撞的概率。其中,对障碍车的行为预测包括预测障碍车是否存在起步行为。
在相关技术中,主车获取多个传感器采集的环境数据,该多个传感器包括主车上的激光雷达、相机、毫米波雷达、超声波雷达等,主车对这些环境数据进行融合并确定障碍车的位置、速度、航向、所在车道线、红绿灯等感知信息,即融合多源传感器采集的环境数据来确定感知结果。然后,主车基于感知结果、主车定位结果以及高精度地图(high definition map,HDMAP)来预测障碍车是否存在起步行为。
然而,雷达的探测结果往往存在抖动和偏差,会影响障碍车起步行为预测的准确性。并且,在定位较差的环境中,如隧道、施工路段中,会存在主车定位结果不准的情况,也会影响障碍车起步行为预测的准确性,进而影响主车的正常行驶。另外,该多个传感器的帧率各不相同,要以最小帧率来融合多源传感器采集的环境数据,这样所得到感知结果的帧率较低,从而导致预测的实时性不高、时延较大。
发明内容
本申请提供了一种车辆起步行为的预测方法、装置、存储介质及程序产品,能够提高车辆起步行为的预测精度、准确率和实时性,本方案的泛化性也更好。所述技术方案如下:
第一方面,提供了一种车辆起步行为的预测方法,所述方法包括:
获取多帧图像和多组主车运动数据,该多组主车运动数据分别对应该多帧图像,该多帧图像是对主车周围的环境信息进行拍摄得到;对该多帧图像中的目标障碍车进行检测,以确定出多帧目标图像区域,该多帧目标图像区域是该多帧图像中目标障碍车所在的区域;对该多帧图像中的道路结构进行识别,并结合该多帧目标图像区域,确定出多组道路结构数据,该多组道路结构数据分别表征该多帧图像中目标障碍车所在道路的道路结构;基于该多帧目标图像区域、多组主车运动数据和多组道路结构数据,确定该多帧图像中各帧图像对应的预测结果,该预测结果用于指示相应图像中的目标障碍车是否存在起步行为。
在该方法中,基于图像数据和主车运动数据来对障碍车的起步行为进行预测,无需激光雷达、毫米波雷达、超声波雷达等多个传感器采集的数据,从而减少了融合多源传感器所带来的感知结果存在抖动和误差的问题以及延迟较大的问题,还减少了由于主车定位结果不准所带来的预测结果有误的问题。另外,本方案也无需高精度地图,在无高精度地图和/或定位较差的场景中,本方案也能够得以应用。由此可见,本方案的预测精度、准确率和实时性更高,泛化性也更好。
可选地,本方案中的主车运动数据包括主车的车速和横摆角速度中的至少一种数据。
可选地,本方案中的道路结构数据包括距离目标障碍车最近的车道线和道沿的位置中的至少一种数据。
可选地,基于该多帧目标图像区域、多组主车运动数据和多组道路结构数据,确定该多帧图像中各帧图像对应的预测结果,包括:基于该多帧目标图像区域,确定该多帧图像中各帧图像对应的图像感知特征,该图像感知特征用于表征目标障碍车的运动特征和目标障碍车周围环境的环境特征;基于该多组主车运动数据,确定该多帧图像中各帧图像对应的主车感知特征,主车感知特征用于表征主车的运动特征;基于该多组道路结构数据,确定该多帧图像中各帧图像对应的道路感知特征,道路感知特征用于表征目标障碍车所在道路的结构特征;基于该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定该多帧图像中各帧图像对应的预测结果。应当理解,本方案通过感知图像中目标障碍车的运动特征和环境特征,以及主车的运动特征和目标障碍车所在道路的结构特征,进而预测障碍车是否存在起步行为。
可选地,基于该多帧目标图像区域,确定该多帧图像中各帧图像对应的图像感知特征,包括:将该多帧目标图像区域的图像数据输入共同特征提取网络,以得到图像共同特征;确定与该多帧目标图像区域一一对应的多组组合数据,每组组合数据包括相应一帧目标图像区域的图像数据和该图像共同特征;将该多组组合数据输入骨干网络,以得到该多帧图像中各帧图像对应的图像感知特征。应当理解,先提取出图像共同特征,图像共同特征在一定程度上表征了目标障碍车周围环境的静态特征。再将图像共同特征与各帧目标图像区域的图像数据进行组合,通过骨干网络提取出目标障碍车的动态特征以及环境的静态特征。
可选地,共同特征提取网络为多尺度的卷积神经网络。多尺度卷积能够有效对不同尺度的特征进行融合,更有利于提取到鲁棒性更强的图像共同特征。
可选地,基于该多组主车运动数据,确定该多帧图像中各帧图像对应的主车感知特征,包括:将该多组主车运动数据输入第一多层感知机,以得到该多帧图像中各帧图像对应的主车感知特征。
可选地,基于该多组道路结构数据,确定该多帧图像中各帧图像对应的道路感知特征,包括:将该多组道路结构数据输入第二多层感知机,以得到该多帧图像中各帧图像对应的道路感知特征。
可选地,基于该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定该多帧图像中各帧图像对应的预测结果,包括:将该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征输入帧间特征融合模型,以得到该多帧图像中各帧图像对应的融合特征;将该多帧图像中各帧图像对应的融合特征输入第三多层感知机,以得到该多帧图像中各帧图像对应的预测结果。应当理解的是,电子设备通过帧间特征融合模 型既融合了同一帧图像对应的三种不同特征,还融合了帧间的特征。
可选地,该帧间特征融合模型包括串联的递归神经网络和注意力机制网络。值得注意的是,递归神经网络与注意力机制网络相结合,能够有效提高帧间特征的远近依赖性计算精度,即有效捕获帧间的远近依赖特征,也即有效融合帧间特征,能够降低预测结果的抖动,有效提升预测结果的精度和召回率。
第二方面,提供了一种车辆起步行为的预测装置,所述车辆起步行为的预测装置具有实现上述第一方面中车辆起步行为的预测方法行为的功能。所述车辆起步行为的预测装置包括一个或多个模块,该一个或多个模块用于实现上述第一方面所提供的车辆起步行为的预测方法。
也即是,提供了一种车辆起步行为的预测装置,该装置包括:
获取模块,用于获取多帧图像和多组主车运动数据,该多组主车运动数据分别对应该多帧图像,该多帧图像是对主车周围的环境信息进行拍摄得到;
第一确定模块,用于对该多帧图像中的目标障碍车进行检测,以确定出多帧目标图像区域,该多帧目标图像区域是该多帧图像中目标障碍车所在的区域;
第二确定模块,用于对该多帧图像中的道路结构进行识别,并结合该多帧目标图像区域,确定出多组道路结构数据,该多组道路结构数据分别表征该多帧图像中目标障碍车所在道路的道路结构;
第三确定模块,用于基于该多帧目标图像区域、多组主车运动数据和多组道路结构数据,确定该多帧图像中各帧图像对应的预测结果,该预测结果用于指示相应图像中的目标障碍车是否存在起步行为。
可选地,第三确定模块,包括:
第一确定子模块,用于基于该多帧目标图像区域,确定该多帧图像中各帧图像对应的图像感知特征,该图像感知特征用于表征目标障碍车的运动特征和目标障碍车周围环境的环境特征;
第二确定子模块,用于基于该多组主车运动数据,确定该多帧图像中各帧图像对应的主车感知特征,主车感知特征用于表征该主车的运动特征;
第三确定子模块,用于基于该多组道路结构数据,确定该多帧图像中各帧图像对应的道路感知特征,道路感知特征用于表征目标障碍车所在道路的结构特征;
第四确定子模块,用于基于该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定该多帧图像中各帧图像对应的预测结果。
可选地,第一确定子模块用于:
将该多帧目标图像区域的图像数据输入共同特征提取网络,以得到图像共同特征;
确定与该多帧目标图像区域一一对应的多组组合数据,每组组合数据包括相应一帧目标图像区域的图像数据和图像共同特征;
将该多组组合数据输入骨干网络,以得到该多帧图像中各帧图像对应的图像感知特征。
可选地,第二确定子模块用于:
将该多组主车运动数据输入第一多层感知机,以得到该多帧图像中各帧图像对应的主车感知特征。
可选地,第三确定子模块用于:
将该多组道路结构数据输入第二多层感知机,以得到该多帧图像中各帧图像对应的道路感知特征。
可选地,第四确定子模块用于:
将该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征输入帧间特征融合模型,以得到该多帧图像中各帧图像对应的融合特征;
将该多帧图像中各帧图像对应的融合特征输入第三多层感知机,以得到该多帧图像中各帧图像对应的预测结果。
可选地,该帧间特征融合模型包括串联的递归神经网络和注意力机制网络。
可选地,主车运动数据包括主车的车速和横摆角速度中的至少一种数据。
可选地,道路结构数据包括距离目标障碍车最近的车道线和道沿的位置中的至少一种数据。
第三方面,提供了一种电子设备,所述电子设备包括处理器和存储器,所述存储器用于存储执行上述第一方面所提供的车辆起步行为的预测方法的程序,以及存储用于实现上述第一方面所提供的车辆起步行为的预测方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述电子设备还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述指令被处理器执行时实现上述第一方面所述的车辆起步行为的预测方法。
第五方面,提供了一种包含指令的计算机程序产品,所述指令被处理器执行时实现上述第一方面所述的车辆起步行为的预测方法。
上述第二方面、第三方面、第四方面和第五方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
本申请提供的技术方案至少能够带来以下有益效果:
基于图像数据和主车运动数据来对障碍车的起步行为进行预测,无需激光雷达、毫米波雷达、超声波雷达等多个传感器采集的数据,从而减少了融合多源传感器所带来的感知结果存在抖动和误差的问题以及延迟较大的问题,还减少了由于主车定位结果不准所带来的预测结果有误的问题。另外,本方案也无需高精度地图,在无高精度地图和/或定位较差的场景中,本方案也能够得以应用。由此可见,本方案的预测精度、准确率和实时性更高,泛化性也更好。
附图说明
图1是本申请实施例提供的一种电子设备的结构示意图;
图2是本申请实施例提供的一种车辆起步行为的预测方法的流程图;
图3是本申请实施例提供的另一种车辆起步行为预测的方法流程图;
图4是本申请实施例提供的又一种车辆起步行为预测的方法流程图;
图5是本申请实施例提供的一种车辆控制方法的流程图;
图6是本申请实施例提供的一种车辆起步行为的预测装置的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
首先对本申请实施例所涉及的应用场景进行介绍。
随着车辆保有量的增加,随之而来的空气污染、交通拥堵、交通事故等问题也越来越受到关注。为了减缓这些问题,自动驾驶技术、辅助驾驶技术以及智能交通系统得到了快速发展。由庞大数据支撑的车辆行为预测在自动驾驶、辅助驾驶以及智能交通系统中起着至关重要的作用。其中,对车辆进行起步行为预测是车辆行为预测中的重要部分。例如,若预测出路侧的障碍车正要起步以切入到主车所在车道,那么,基于预测结果来控制或提醒主车进行减速避让或改变航向等,从而减少交通事故。
需要说明的是,本申请实施例描述的系统架构以及业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着系统架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
请参考图1,图1是根据本申请实施例示出的一种电子设备的结构示意图。可选地,该电子设备能够成为车机或服务器的部分或全部。该电子设备包括一个或多个处理器101、通信总线102、存储器103以及一个或多个通信接口104。
处理器101为一个通用中央处理器(central processing unit,CPU)、网络处理器(network processing,NP)、微处理器、或者为一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。可选地,上述PLD为复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线102用于在上述组件之间传送信息。可选地,通信总线102分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
可选地,存储器103为只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、光盘(包括只读光盘(compact disc read-only memory,CD-ROM)、压缩光盘、激光盘、数字通用光盘、蓝光光盘等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器103独立存在,并通过通信总线102与处理器101相连接,或者,存储器103与处理器101集成在一起。
通信接口104使用任何收发器一类的装置,用于与其它设备或通信网络通信。通信接口 104包括有线通信接口,可选地,还包括无线通信接口。其中,有线通信接口例如以太网接口等。可选地,以太网接口为光接口、电接口或其组合。无线通信接口为无线局域网(wireless local area networks,WLAN)接口、蜂窝网络通信接口或其组合等。
可选地,在一些实施例中,电子设备包括多个处理器,如图1中所示的处理器101和处理器105。这些处理器中的每一个为一个单核处理器,或者一个多核处理器。可选地,这里的处理器指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,电子设备还包括输出设备106和输入设备107。输出设备106和处理器101通信,能够以多种方式来显示信息。例如,输出设备106为液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备107和处理器101通信,能够以多种方式接收用户的输入。例如,输入设备107是鼠标、键盘、触摸屏设备或传感设备等。
在一些实施例中,存储器103用于存储执行本申请方案的程序代码110,处理器101能够执行存储器103中存储的程序代码110。该程序代码中包括一个或多个软件模块,该电子设备能够通过处理器101以及存储器103中的程序代码110,来实现下文图2实施例提供的车辆起步行为的预测方法。
图2是本申请实施例提供的一种车辆起步行为的预测方法的流程图。该方法应用于电子设备。可选地,该电子设备为主车上的设备,如车机,该电子设备也可以为服务器,如交通管理中心的服务器。请参考图2,该方法包括如下步骤。
步骤201:获取多帧图像和多组主车运动数据,该多组主车运动数据分别对应该多帧图像,该多帧图像是对主车周围的环境信息进行拍摄得到。
在本申请实施例中,电子设备获取多帧图像以及多组主车运动数据。该多组主车运动数据与该多帧图像一一对应,即,该多组主车运动数据是该多帧图像中各帧图像所对应的主车运动数据。其中,该多帧图像是对主车周围的环境信息进行拍摄得到,例如安装于主车前方的相机拍摄的图像。主车运动数据是主车上运动传感器采集的数据,运动传感器包括车速传感器、角速度传感器等中的一种或多种。主车运动数据能够准确表征主车的实时运动情况。可选地,本申请实施例中的主车运动数据包括主车的车速和横摆角速度中的至少一种数据。
由于本方案的车辆行为预测中所用到的主车运动数据要与图像一一对应,因此,如果主车上运动传感器的帧率(如100赫兹(Hz))高于相机的帧率(如20Hz),那么,可以通过线性插值的方式对运动传感器采集的原始运动数据进行下采样,即降频处理,以得到与多帧图像一一对应的多组主车运动数据,且所得到的该多组主车运动数据的时间与该多帧图像的时间也是相同的。如果主车上运动传感器的帧率低于相机的帧率,那么,可以通过线性插值的方式对运动传感器采集的原始运动数据进行上采样,以得到与多帧图像一一对应且时间一致的多组主车运动数据。
需要说明的是,目前车辆上运动传感器的帧率和相机的帧率普遍高于多源传感器的融合帧率,其中,融合帧率通常与多源传感器的最小帧率一致,因此,本方案的延迟较小,实时性更高。
可选地,该多帧图像的数量为N,电子设备在获取到第i帧图像的时刻,对第i-N+1至i 帧图像进行处理,在获取到第i+1帧图像的时刻,对第i-N至i+1帧图像进行处理。其中,i不小于N。可选地,N可以为16,也可以为10或其他数值。在本申请实施例中,以N为16为例进行介绍。示例性地,电子设备在获取到第16帧图像后,对第1至16帧图像进行处理,在获取到第17帧图像后,对第2至17帧图像进行处理,在获取到第18帧图像后,对第3至18帧图像进行处理。
示例性地,假设该多帧图像的数量为N,该N帧图像所形成的原始图像序列为(img_origt_N-1,…,img_origt_0),维度为[N,3,w_orig,h_orig],其中,t表示获取该多帧图像的时刻,N表示时间维度上有N帧图像,3表示通道维度上的RGB三通道,w_orig和h_orig分别表示每帧图像的宽和高。假设主车运动数据包括主车的车速和横摆角速度,则该N组主车运动数据所组成的主车数据序列的维度为[N,2],其中,N表示时间维度上有N组主车运动数据,2表示每组主车运动数据包括两个数据。
步骤202:对该多帧图像中的目标障碍车进行检测,以确定出多帧目标图像区域,该多帧目标图像区域是多帧图像中目标障碍车所在的区域。
在本申请实施例中,电子设备对该多帧图像中的目标障碍车进行检测,以确定出多帧目标图像区域。可选地,电子设备将该多帧图像输入目标检测网络,以确定出该多帧图像中目标障碍车所在的区域。
示例性地,电子设备检测出该多帧图像中任一图像的目标障碍车之后,以矩形框在该图像中框选出该目标障碍车所在的区域,并基于该矩形框,裁剪出该图像中的目标图像区域。可选地,电子设备按照外扩比率对该矩形框进行外扩后,裁剪出该图像中的目标图像区域。其中,外扩比率大于或等于1。示例性地,外扩比率为1.5,电子设备将该矩形框的长和宽分别扩大1.5倍后,裁剪出扩大后的矩形框中的图像区域,以作为目标图像区域。需要说明的是,外扩后的矩形框的中心点与外扩前的矩形框的中心点是相同的。或者,电子设备直接按照该矩形框裁剪出该图像中的目标图像区域。
通常来说目标障碍车在该多帧图像中是连续出现的。可选地,如果从该多帧图像中的第i帧图像到第j帧图像均存在目标障碍车,且j-i+1不小于指定阈值,则继续执行后续步骤,以预测该目标障碍车是否存在起步行为。如果j-i+1小于指定阈值,则不执行后续步骤,即不对该目标障碍车进行行为预测。示例性地,假设该多帧图像的数量为16,指定阈值为8,如果这16帧图像中有8帧图像存在目标障碍车,则继续执行后续步骤。如果这16帧图像中只有5帧图像存在目标障碍车,则不对该目标障碍车进行行为预测。
可选地,如果从该多帧图像中的第i帧图像到第j帧图像均存在目标障碍车,且j-i+1不小于指定阈值但小于该多帧图像的总数量,则将第i帧图像对应的目标图像区域作为第i帧之前图像对应的目标图像区域,将第j帧图像对应的目标图像区域作为第j帧之后图像对应的目标图像区域。或者,将第i帧图像对应的矩形框作为第i帧之前图像对应的矩形框,进而裁剪出第i帧之前图像中的目标图像区域,将第j帧图像对应的矩形框作为第j帧之后图像对应的矩形框,以裁剪出第j帧之后图像中的目标图像区域。其中,i不小于1且j小于N,或者,i大于1且j不大于N,N为该多帧图像的总数量。示例性地,假设N为16,这16帧图像中的第1至14帧图像均检测到目标障碍车,则将第14帧图像对应的目标图像区域作为第15和16帧图像对应的目标图像区域,或者,将第14帧图像对应的矩形框作为第15和16帧图像对应的矩形框,进而裁剪出第15帧和第16帧图像对应的目标图像区域。
另外,若一帧图像中存在多个障碍车,则电子设备能够检测出该多个障碍车,将该多个障碍车中的每个障碍车均作为一个目标障碍车,从而对每个目标障碍车进行起步行为检测。需要说明的是,步骤201至步骤204是针对其中一个目标障碍车所进行的起步行为检测。示例性地,假设第1至16帧图像中存在障碍车A的图像数量大于指定阈值,则针对第1至第16帧图像中的障碍车A执行步骤201至步骤204,以预测障碍车A的起步行为。假设第3至18帧图像中存在障碍车B的图像数量大于指定阈值,则针对第3至18帧图像中的障碍车B执行步骤201至步骤204,以预测障碍车B的起步行为。假设第1至16帧图像存在障碍车C的图像数量大于指定阈值,则针对第1至16帧图像中的障碍车C也执行步骤201至步骤204,以预测障碍车C的起步行为。
在本申请实施例中,假设该多帧图像的数量为N,该N帧目标图像区域所形成的目标图像序列为(imgt_N-1,…,imgt_0),维度为[N,3,w,h],其中,t表示获取该多帧图像的时刻,N表示时间维度上有N帧目标图像区域,3表示通道维度上的RGB三通道,w和h分别表示每帧目标图像区域的宽和高。
步骤203:对该多帧图像中的道路结构进行识别,并结合该多帧目标图像区域,确定出多组道路结构数据,该多组道路结构数据分别表征该多帧图像中目标障碍车所在道路的道路结构。
在本申请实施例中,电子设备可以对该多帧图像中的道路结构进行识别,并结合该多帧目标图像区域,以确定出该多帧图像中目标障碍车所在道路的道路结构数据。
示例性地,电子设备对该多帧图像中的道路结构进行识别,可以识别出该多帧图像中的每条道路,例如每条车道线和每个道沿。电子设备基于在该多帧图像中所检测到的目标障碍车所在的区域,从所识别出的该多帧图像中的所有道路中,确定出各帧图像中目标障碍车所在的道路,并得到该多帧图像中目标障碍车所在道路的道路结构数据。
可选地,道路结构数据包括距离目标障碍车最近的车道线和道沿的位置中的至少一种数据。以道路结构数据包括距离目标障碍车最近的车道线的位置为例,每组道路结构数据包括相应车道线的多个二维离散点的坐标,在本申请实施例中,该多个二维离散点坐标的数量为30,即,每组道路结构数据包括30个二维坐标。可选地,若在一帧图像中所识别到的距离目标障碍车最近的车道线的长度不小于预设长度(如100米),则该车道线能够以30个二维坐标进行表示。若在一帧图像中所识别到的记录目标障碍车最近的车道线的长度小于预设长度,则用于表示该车道线的二维离散点的数量可能小于30,这种情况下,可以用所识别到的车道线上距离主车最远的一个二维离散点的坐标来补足30个二维坐标。或者,在这种情况下,可以通过曲线拟合的方式延伸该车道线,以补足30个二维坐标。
在本申请实施例中,假设该多帧图像的数量为N,每组道路结构数据包括30个二维坐标,这30个二维坐标表示一条车道线的数据,那么,该N组道路结构数据所形成的道路数据序列的维度为[N,30,2],其中,N表示时间维度上有N组道路结构数据,30表示每组道路结构数据包括30个坐标,2表示每个坐标包括两个数值,即每个坐标为二维坐标。
步骤204:基于该多帧目标图像区域、多组主车运动数据和多组道路结构数据,确定该多帧图像中各帧图像对应的预测结果,该预测结果用于指示相应图像中的目标障碍车是否存在起步行为。
在本申请实施例中,电子设备基于该多帧目标图像区域、多组主车运动数据和多组道路 结构数据,确定该多帧图像中各帧图像对应的预测结果的一种实现方式为:基于该多帧目标图像区域,确定该多帧图像中各帧图像对应的图像感知特征;基于该多组主车运动数据,确定该多帧图像中各帧图像对应的主车感知特征;基于该多组道路结构数据,确定该多帧图像中各帧图像对应的道路感知特征;基于该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定该多帧图像中各帧图像对应的预测结果。简单来说,电子设备分别感知出图像的特征、主车运动数据的特征以及道路结构数据的特征,结合这三部分特征来预测障碍车是否存在起步行为。
需要说明的是,该图像感知特征用于表征目标障碍车的运动特征和目标障碍车周围环境的环境特征。其中,环境特征通常来说是静态特征,应当理解,该图像感知特征表征了目标障碍车的动态特征以及环境的静态特征。主车感知特征用于表征主车的运动特征,道路感知特征用于表征目标障碍车所在道路的结构特征。
在本申请实施例中,电子设备基于该多帧目标图像区域,确定该多帧图像中各帧图像对应的图像感知特征的一种实现方式为:将该多帧目标图像区域的图像数据输入共同特征提取网络,以得到图像共同特征;确定与该多帧目标图像区域一一对应的多组组合数据,每组组合数据包括相应一帧目标图像区域的图像数据和该图像共同特征;将该多组组合数据输入骨干网络,以得到该多帧图像中各帧图像对应的图像感知特征。也即是,电子设备先提取出图像共同特征,图像共同特征在一定程度上表征了目标障碍车周围环境的静态特征。电子设备再将图像共同特征与各帧目标图像区域的图像数据进行组合,通过骨干网络提取出目标障碍车的动态特征以及环境的静态特征。
可选地,图像共同特征具有通道维度,该多帧目标图像区域的图像数据也具有通道维度,电子设备将该图像共同特征与该多帧目标图像区域中各帧目标图像区域的图像数据在通道维度进行拼接,以得到相应一组组合数据。这样,通过在通道维度进行拼接的方式,能够便于后续通过骨干网络提取到区分明显的静态特征和运动特征。示例性地,该多帧目标图像区域中各帧目标图像区域的图像数据的维度为[c1,w,h],c1表示图像数据的通道数,w和h分别表示目标图像区域的宽和高,如果图像为RGB图像,则c1等于3,表示RGB三通道。图像共同特征的维度为[c2,w,h],c2表示图像共同特征的通道数,图像共同特征的高和宽均与目标图像区域的高和宽是相同的,可选地,c2等于3或其他数值。电子设备通过在通道维度进行拼接的方式所拼接出的每组组合数据的维度为[c1+c2,w,h]。
可选地,共同特征提取(common feature extraction,CFE)网络为多尺度的卷积神经网络,即,共同特征提取网络中卷积层所采用的卷积核有多个。示例性地,如图4中所示的CFE网络,该CFE网络包括4层卷积层,前三层卷积层均采用3种不同尺度的卷积核,这3种卷积核的尺寸分别为1×1、3×3、5×5。多尺度卷积能够有效对不同尺度的特征进行融合,更有利于提取到鲁棒性更强的图像共同特征。需要说明的是,共同特征提取网络也可以是其他类型的神经网络,图4所示的CFE网络也并不用于限制本申请实施例。
接下来对图4所示的CFE网络进行详细介绍。假设基于t时刻采集的图像所确定的N帧目标图像区域按照时间顺序所形成的目标图像序列表示为(imgt_N-1,…,imgt_0),维度为[N,3,w,h],其中,N表示时间维度上有N帧目标图像区域,3表示通道维度上的RGB三通道,w和h分别表示每帧目标图像区域的宽和高。
首先,将目标图像序列(imgt_N-1,…,imgt_0)输入CFE网络的第一层卷积层,以得到第一层 卷积层输出的中间特征ComFea1=mean(conv1×1(imgt_N-1,…,imgt_0),conv3×3(imgt_N-1,…,imgt_0),conv5×5(imgt_N-1,…,imgt_0)),维度为[32,3,w,h]。然后,将ComFea1输入CFE网络的第二层卷积层,以得到第二层卷积层输出的中间特征ComFea2=mean(conv1×1(ComFea1),conv3×3(ComFea1),conv5×5(ComFea1)),维度为[16,3,w,h]。然后,将ComFea2输入CFE网络的第三层卷积层,以得到第三层卷积层输出的中间特征ComFea3=mean(conv1×1(ComFea2),conv3×3(ComFea2),conv5×5(ComFea2)),维度为[8,3,w,h]。之后,将ComFea3输入CFE网络的第四层卷积层,以得到第四层卷积层输出的图像共同特征ComFeat=conv1×1(ComFea3),维度为[1,3,w,h]。其中,mean()表示取均值的操作,在图4中表示为M。
可选地,将图像共同特征ComFeat进行复制(repeat)操作,以得到与目标图像序列(imgt_N-1,…,imgt_0)的维度一致的共同特征序列R_ComFeat=repeat(ComFeat),即R_ComFeat的维度也为[N,3,w,h],将共同特征序列R_ComFeat与目标图像序列(imgt_N-1,…,imgt_0)在通道维度进行拼接,以得到多组组合数据多形成的序列,将这个序列输入骨干网络。
可选地,骨干(backbone)网络是一种卷积神经网络。示例性地,骨干网络可以采用ResNet(如ResNet50)。本申请实施例不限定骨干网络的网络结构。可选地,该骨干网络包括多个CNN(如图4所示),该多个CNN与该多组组合数据一一对应,电子设备将该多组组合数据分别输入该多个CNN,以得到该多帧图像中各帧图像对应的图像感知特征。可选地,在本申请实施例中,该多个CNN的网络结构和网络参数相同。在其他一些实施例中,该多个CNN的网络结构和网络参数可以存在不同。
在本申请实施例中,电子设备基于该多组主车运动数据,确定该多帧图像中各帧图像对应的主车感知特征的一种实现方式为:将该多组主车运动数据输入第一多层感知机,以得到该多帧图像中各帧图像对应的主车感知特征。
可选地,该多组主车运动数据分别输入多个第一多层感知机(multi-layer perceptron,MLP),在本申请实施例中,该多个第一MLP的网络结构和网络参数相同。在其他一些实施例中,该多个第一MLP的网络结构和网络参数可以存在不同。或者,该多组主车运动数据均输入同一个第一MLP。例如,每确定一组主车运动数据,将当前确定的一组主车运动数据输入第一MLP,以保证实时性。可选地,第一MLP包括一层或多层隐藏层。在本申请实施例中,第一MLP包括两层隐藏层。
在本申请实施例中,电子设备基于该多组道路结构数据,确定该多帧图像中各帧图像对应的道路感知特征的一种实现方式:将该多组道路结构数据输入第二多层感知机,以得到该多帧图像中各帧图像对应的道路感知特征。
可选地,该多组道路结构数据分别输入多个第二MLP,在本申请实施例中,该多个第二MLP的网络结构和网络参数相同。在其他一些实施例中,该多个第二MLP的网络结构和网络参数可以存在不同。或者,该多组道路结构数据均输入同一个第二MLP。例如,每确定一组道路结构数据,将当前确定的一组道路结构数据输入第二MLP,以保证实时性。可选地,第二MLP包括一层或多层隐藏层。在本申请实施例中,第二MLP包括两层隐藏层。
需要说明的是,在本申请实施例中,主车运动数据和道路结构数据均采用MLP进行处理,所采用的MLP相当于一个特征提取模型或者说编码模型,用于对主车运动数据和道路结构数据进行特征提取或者说编码。在另一些实施例中,主车运动数据和/或道路结构数据也可以采用其他的神经网络进行处理。
在本申请实施例中,电子设备基于该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定该多帧图像中各帧图像对应的预测结果的一种实现方式为:将该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征输入帧间特征融合模型,以得到该多帧图像中各帧图像对应的融合特征;将该多帧图像中各帧图像对应的融合特征输入第三多层感知机,以得到该多帧图像中各帧图像对应的预测结果。应当理解的是,电子设备通过帧间特征融合模型既融合了同一帧图像对应的三种不同特征,还融合了帧间的特征。
可选地,电子设备将该多帧图像中的各帧图像对应的图像感知特征、主车感知特征和道路感知特征进行拼接,以得到相应图像对应的组合感知特征。电子设备将该多帧图像中各帧图像对应的组合感知特征输入帧间特征融合模型,以得到该多帧图像中各帧图像对应的融合特征。示例性地,假设该多帧图像对应的图像感知特征的维度为[N,c1],该多帧图像对应的主车感知特征的维度为[N,c2],该多帧图像对应的道路感知特征的维度为[N,c3],那么,该多帧图像对应的组合感知特征的维度为[N,c1+c2+c3]。其中,N表示时间维度上有N帧图像,c1表示每个图像感知特征包括的元素个数,c2表示每个主车感知特征包括的元素个数,c3表示每个道路感知特征包括的元素个数。
可选地,在本申请实施例中,该帧间特征融合模型包括串联的递归神经网络(recurrent neural network,RNN)和注意力机制网络。其中,RNN可以为长短时记忆(long short term memory,LSTM)网络或门控递归单元(gate recurrent unit,GRU)等。如图4所示,RNN采用双层双向的LSTM网络。注意力机制网络可以是自注意力(self-attention,SA)网络或多头注意力(multi-head attention,MHA)网络等。
值得注意的是,递归神经网络与注意力机制网络相结合,能够有效提高帧间特征的远近依赖性计算精度,即有效捕获帧间的远近依赖特征,也即有效融合帧间特征,能够降低预测结果的抖动,有效提升预测结果的精度和召回率。
需要说的是,上述所介绍的帧间特征融合模型的网络结构并不用于限制本申请实施例。例如,在另一些实施例中,帧间特征融合模型也可以包括递归神经网络,而不包括注意力机制网络。
可选地,该多帧图像中各帧图像对应的融合特征分别输入多个第三MLP,在本申请实施例中,该多个第三MLP的网络结构和网络参数相同。在其他一些实施例中,该多个第三MLP的网络结构和网络参数可以存在不同。或者,该多帧图像中各帧图像对应的融合特征均输入同一个第三MLP。例如,按照该多帧图像的时间顺序,依次将该多帧图像中各帧图像对应的融合特征输入第三MLP。可选地,第三MLP包括一层或多层隐藏层。在本申请实施例中,第三MLP包括两层隐藏层。
需要说明的是,在本申请实施例中,该多帧图像中各帧图像对应的融合特征采用第三MLP进行处理,以得到预测结果。第三MLP相当于一个分类模型,预测结果分为两类,一类为存在起步行为,另一类为不存在起步行为。在另一些实施例中,该多帧图像中各帧图像对应的融合特征也可以采用其他的神经网络进行处理,以得到预测结果。
图3是本申请实施例提供的另一种车辆起步行为预测的方法流程图。接下来请参考图3,对本申请实施例所提供的车辆起步行为预测的方法再次进行示例性的解释说明。在图3中,电子设备每次处理的多帧图像的数量为N。其中,电子设备在t时刻所获取的N帧图像按照 时间顺序分别被标记为t_0、……、t_N-2、t_N-1,电子设备对这N帧图像进行目标检测,以确定出该多帧图像中目标障碍车所在的区域(如图3中黑色矩形框所框选出的区域)。电子设备还对这N帧图像进行道路结构识别,以确定出该N帧图像中目标障碍车所在道路的道路结构数据,即得到N组道路结构数据。另外,电子设备还获取与该N帧图像一一对应的N组主车运动数据。电子设备按照外扩比率将该多帧图像中目标障碍车所在的区域裁剪出来,以得到N帧目标图像区域,将该N帧目标图像区域的图像数据输入CFE网络中,以提取出图像共同特征。电子设备将该图像共同特征与该N帧目标图像区域中各帧目标图像区域在通道维度进行拼接(concat,C),以得到N组组合数据。电子设备将这N组组合数据输入骨干(backbone)网络,以得到该N帧图像中各帧图像对应的图像感知特征。电子设备通过MLP对该N组主车运动数据分别进行处理,以得到该N帧图像中各帧图像对应的主车感知特征。电子设备还通过MLP对该N组道路结构数据分别进行处理,以得到该N帧图像中各帧图像对应的道路感知特征。然后,电子设备将这N帧图像对应的图像感知特征、主车运动特征和道路感知特征进行拼接,以得到以这N帧图像一一对应的N组感知特征。电子设备将这N组组合感知特征输入帧间特征融合模型(包括串联的CNN和注意力机制网络),以得到该N帧图像中各帧图像对应的融合特征。最后,电子设备通过MLP对这N帧图像对应的融合特征分别进行处理,以得到这N帧图像分别对应的N个预测结果。
图4所示的方法流程图是对图3中各个网络模型的结构展开后得到的。在图4中示出了CFE网络、backbone网络、递归神经网络和注意力机制网络的具体结构,对这几个网络的网络结构已在前文进行了介绍,这里不再赘述。
接下来再通过下述步骤1至步骤8对本方案再次进行示例性解释说明。需要说明的是,本申请实施例不限制步骤1至步骤8的执行顺序。
1.假设相机拍摄的图像记为img_orig,电子设备获取多帧图像img_orig。
2.通过对目标障碍物进行目标检测得到目标障碍车所在区域的包围框(bounding box,BBOX),即矩形框。基于BBOX,按照公式(1)对图像img_orig进行裁剪,得到目标图像区域img。公式(1)中的Crop()函数用于实现将目标BBOX按照外扩比率ratio进行外扩后对图像img_orig进行裁剪的功能。
img=Crop(img_orig,ratio)       (1)
3.假设相机采集的多帧图像的总数量为N,按照步骤2对该多帧图像进行裁剪后得到的目标图像序列为(imgt_N-1,…,imgt_0),维度为[N,3,w,h]。电子设备将目标图像序列输入到CEF网络,按照公式(2)得到经N次复制后的共同特征序列R_ComFeat,维度也为[N,3,w,h]。
R_ComFeat=CEF(imgt_N-1,...,imgt_0)      (2)
4.电子设备将目标图像序列和共同特征序列在通道维度进行拼接,以得到包括多种组合数据的组合数据序列,维度为[N,6,w,h],将组合数据序列输入骨干网络(CNN),以按照公式(3)进行特征提取,得到该多帧图像对应的图像感知特征SFeat,维度为[N,c1]。其中,concat()表示拼接操作或者说连接操作。
SFeat=CNN(concat((imgt_N-1,...,imgt_0),R_ComFeat))      (3)
5.另外,电子设备还获取多组主车运动数据。以主车运动数据表示为Ego为例,该多组主车运动数据所形成的主车数据序列为(Egot_N-1,…,Egot_0),维度为[N,c2_in]。其中,c2_in表示每组主车运动数据所包括的元素个数。例如,c2_in=2,表示每组主车运动数据包括车速和 横摆角速度这两个元素。除此之外,电子设备还对该多帧图像的道路结构进行识别,以得到多组道路结构数据。以道路结构数据表示为Lane为例,假设道路结构数据包括车道线的位置的二维坐标,那么,该多组道路结构数据所形成的道路数据序列为(Lanet_N-1,…,Lanet_0),维度为[N,c3_in,2]。其中,c3_in表示每组道路结构数据所包括的二维坐标的数量,2表示每个二维坐标包括两个坐标值。
6.电子设备通过MLP对该多组主车运动数据和多组道路结构数据分别进行处理。该多组主车运动数据经MLP处理后所得到的N帧图像对应的主车感知特征的维度为[N,c2],该多组道路结构数据经MLP处理后所得到的N帧图像对应的道路感知特征的维度为[N,c3]。电子设备将该多帧图像中的各帧图像对应的图像感知特征、主车感知特征和道路感知特征进行拼接,以得到相应图像对应的组合感知特征C_SFeat,该多帧图像对应的组合感知特征C_SFeat的维度为[N,c1+c2+c3]。应当理解的是,在步骤6中电子设备是按照公式(4)得到了该多帧图像对应的组合感知特征C_SFeat
C_SFeat=Concat(SFeat,MLP(Egot_N-1,...,Egot_0),MLP(Lanet_N-1,...,Lanet_0))    (4)
7.电子设备将该多帧图像对应的组合感知特征C_SFeat输入帧间特征融合模型,以按照公式(5)得到该多帧图像对应的融合特征STARFeat,维度为[N,c4]。其中,该帧间特征融合模型包括RNN和注意力机制网络(ATTENTATION)。
STARSFeat=ATENTATION(CNN(C_SFeat))     (5)
8.电子设备通过MLP按照公式(6)对该多帧图像对应的融合特征STARFeat进行处理,以得到该多帧图像对应的预测结果Outt,维度为[N,2],其中2表示预测结果的两种值,这两种值分别表示存在起步行为(如CUTIN)、不存在起步行为(如No-CUTIN)。其中,CUTIN行为包括障碍车起步切入主车所在车道的行为。
Outt=MLP(STARSFeat)    (6)
需要说明的是,本申请实施例中所应用的各个网络模型均是经过训练的,本申请实施例不限制这些网络模型的训练方式等。在一实施例中,采用批处理的方式对这些网络模型进行训练,即每次对网络参数的调整是基于多组样本图像序列实现的。
由前述可知,本方案能够应用于自动驾驶和辅助驾驶场景中。图5是本申请实施例提供的一种自动驾驶或辅助驾驶场景中车辆控制方法的流程图。在主车在自动驾驶的过程中,获取多帧图像和主车的原始运动数据,通过感知模块对该多帧图像和主车的原始运动数据进行处理,以得到多帧目标图像区域、多组主车运动数据和多组道路结构数据。感知模块将该多帧目标图像区域、多组主车运动数据和多组道路结构数据发送给预测模块。预测模块用于进行车辆起步行为预测。预测模块基于这多帧目标图像区域、多组主车运动数据和多组道路结构数据,确定该多帧图像中各帧图像对应的预测结果。预测模块将该多帧图像中各帧图像对应的预测结果发送给规划模块。规划模块基于各帧图像对应的预测结果确定出主车的行驶轨迹、车速等,控制模块按照规划模块所规划的行驶轨迹、车速等来控制主车的运动。
需要说明的是,在自动驾驶和辅助驾驶领域中,主车上的相机拍摄主车周围环境的图像,如拍摄主车前方环境的图像,主车上的车速传感器采集主车运动数据,如车速、横摆角速度等,从而由主车基于图像数据和主车运动数据,根据本申请实施例提供的车辆起步行为预测方法来预测障碍车是否存在起步行为。
由前述可知,本方案也能够应用于智能交通系统中。在智能交通系统中,由路侧设备拍 摄周围环境的图像并发送给交通管理中心的服务器,道路上的车辆也可以上报运动数据,服务器本申请实施例提供的车辆起步行为预测方法来预测障碍车是否存在起步行为。示例性地,服务器可以获取某路侧设备在一段时间内拍摄的多帧图像,以及该时间段内将要通过该路侧设备所在道路的车辆所上报的运动数据,进而根据本方案来预测障碍车的起步行为。服务器可以在预测到目标障碍车存在起步行为的情况下,向上报运动数据的车辆进行广播,以提醒这些车辆附近有障碍车将要起步。在一些实施例中,服务器若不能够将路侧设备拍摄的多帧图像与车辆所上报的运动数据匹配上或者说不能够关联上,则服务器可以将参与运算的主车运动数据设置为0或者该道路上所行驶车辆的车速均值等,从而实施本方案。
综上所述,在本申请实施例中,基于图像数据和主车运动数据来对障碍车的起步行为进行预测,无需激光雷达、毫米波雷达、超声波雷达等多个传感器采集的数据,从而减少了融合多源传感器所带来的感知结果存在抖动和误差的问题以及延迟较大的问题,还减少了由于主车定位结果不准所带来的预测结果有误的问题。另外,本方案也无需高精度地图,在无高精度地图和/或定位较差的场景中,本方案也能够得以应用。由此可见,本方案的预测精度、准确率和实时性更高,泛化性也更好。
图6是本申请实施例提供的一种车辆起步行为的预测装置600的结构示意图,该车辆起步行为的预测装置600可以由软件、硬件或者两者的结合实现成为电子设备的部分或者全部,该电子设备可以为上述实施例中的任一电子设备。参见图6,该装置600包括:获取模块601、第一确定模块602、第二确定模块603和第三确定模块604。
获取模块601,用于获取多帧图像和多组主车运动数据,该多组主车运动数据分别对应该多帧图像,该多帧图像是对主车周围的环境信息进行拍摄得到;
第一确定模块602,用于对该多帧图像中的目标障碍车进行检测,以确定出多帧目标图像区域,该多帧目标图像区域是该多帧图像中目标障碍车所在的区域;
第二确定模块603,用于对该多帧图像中的道路结构进行识别,并结合该多帧目标图像区域,确定出多组道路结构数据,该多组道路结构数据分别表征该多帧图像中目标障碍车所在道路的道路结构;
第三确定模块604,用于基于该多帧目标图像区域、多组主车运动数据和多组道路结构数据,确定该多帧图像中各帧图像对应的预测结果,该预测结果用于指示相应图像中的目标障碍车是否存在起步行为。
可选地,第三确定模块604,包括:
第一确定子模块,用于基于该多帧目标图像区域,确定该多帧图像中各帧图像对应的图像感知特征,该图像感知特征用于表征目标障碍车的运动特征和目标障碍车周围环境的环境特征;
第二确定子模块,用于基于该多组主车运动数据,确定该多帧图像中各帧图像对应的主车感知特征,主车感知特征用于表征该主车的运动特征;
第三确定子模块,用于基于该多组道路结构数据,确定该多帧图像中各帧图像对应的道路感知特征,道路感知特征用于表征目标障碍车所在道路的结构特征;
第四确定子模块,用于基于该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定该多帧图像中各帧图像对应的预测结果。
可选地,第一确定子模块用于:
将该多帧目标图像区域的图像数据输入共同特征提取网络,以得到图像共同特征;
确定与该多帧目标图像区域一一对应的多组组合数据,每组组合数据包括相应一帧目标图像区域的图像数据和图像共同特征;
将该多组组合数据输入骨干网络,以得到该多帧图像中各帧图像对应的图像感知特征。
可选地,第二确定子模块用于:
将该多组主车运动数据输入第一多层感知机,以得到该多帧图像中各帧图像对应的主车感知特征。
可选地,第三确定子模块用于:
将该多组道路结构数据输入第二多层感知机,以得到该多帧图像中各帧图像对应的道路感知特征。
可选地,第四确定子模块用于:
将该多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征输入帧间特征融合模型,以得到该多帧图像中各帧图像对应的融合特征;
将该多帧图像中各帧图像对应的融合特征输入第三多层感知机,以得到该多帧图像中各帧图像对应的预测结果。
可选地,该帧间特征融合模型包括串联的递归神经网络和注意力机制网络。
可选地,主车运动数据包括主车的车速和横摆角速度中的至少一种数据。
可选地,道路结构数据包括距离目标障碍车最近的车道线和道沿的位置中的至少一种数据。
在本申请实施例中,基于图像数据和主车运动数据来对障碍车的起步行为进行预测,无需激光雷达、毫米波雷达、超声波雷达等多个传感器采集的数据,从而减少了融合多源传感器所带来的感知结果存在抖动和误差的问题以及延迟较大的问题,还减少了由于主车定位结果不准所带来的预测结果有误的问题。另外,本方案也无需高精度地图,在无高精度地图和/或定位较差的场景中,本方案也能够得以应用。由此可见,本方案的预测精度、准确率和实时性更高,泛化性也更好。
需要说明的是:上述实施例提供的车辆起步行为的预测装置在预测车辆起步行为时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的车辆起步行为的预测装置与车辆起步行为的预测方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital  subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请实施例提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。
应当理解的是,本文提及的“至少一个”是指一个或多个,“多个”是指两个或两个以上。在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
需要说明的是,本申请实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请实施例中涉及到的图像、视频、运动数据、道路结构数据等都是在充分授权的情况下获取的。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种车辆起步行为的预测方法,其特征在于,所述方法包括:
    获取多帧图像和多组主车运动数据,所述多组主车运动数据分别对应所述多帧图像,所述多帧图像是对所述主车周围的环境信息进行拍摄得到;
    对所述多帧图像中的目标障碍车进行检测,以确定出多帧目标图像区域,所述多帧目标图像区域是所述多帧图像中所述目标障碍车所在的区域;
    对所述多帧图像中的道路结构进行识别,并结合所述多帧目标图像区域,确定出多组道路结构数据,所述多组道路结构数据分别表征所述多帧图像中所述目标障碍车所在道路的道路结构;
    基于所述多帧目标图像区域、所述多组主车运动数据和所述多组道路结构数据,确定所述多帧图像中各帧图像对应的预测结果,所述预测结果用于指示相应图像中的所述目标障碍车是否存在起步行为。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述多帧目标图像区域、所述多组主车运动数据和所述多组道路结构数据,确定所述多帧图像中各帧图像对应的预测结果,包括:
    基于所述多帧目标图像区域,确定所述多帧图像中各帧图像对应的图像感知特征,所述图像感知特征用于表征所述目标障碍车的运动特征和所述目标障碍车周围环境的环境特征;
    基于所述多组主车运动数据,确定所述多帧图像中各帧图像对应的主车感知特征,所述主车感知特征用于表征所述主车的运动特征;
    基于所述多组道路结构数据,确定所述多帧图像中各帧图像对应的道路感知特征,所述道路感知特征用于表征所述目标障碍车所在道路的结构特征;
    基于所述多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定所述多帧图像中各帧图像对应的预测结果。
  3. 如权利要求2所述的方法,其特征在于,所述基于所述多帧目标图像区域,确定所述多帧图像中各帧图像对应的图像感知特征,包括:
    将所述多帧目标图像区域的图像数据输入共同特征提取网络,以得到图像共同特征;
    确定与所述多帧目标图像区域一一对应的多组组合数据,每组组合数据包括相应一帧目标图像区域的图像数据和所述图像共同特征;
    将所述多组组合数据输入骨干网络,以得到所述多帧图像中各帧图像对应的图像感知特征。
  4. 如权利要求2或3所述的方法,其特征在于,所述基于所述多组主车运动数据,确定所述多帧图像中各帧图像对应的主车感知特征,包括:
    将所述多组主车运动数据输入第一多层感知机,以得到所述多帧图像中各帧图像对应的主车感知特征。
  5. 如权利要求2-4任一所述的方法,其特征在于,所述基于所述多组道路结构数据,确定所述多帧图像中各帧图像对应的道路感知特征,包括:
    将所述多组道路结构数据输入第二多层感知机,以得到所述多帧图像中各帧图像对应的道路感知特征。
  6. 如权利要求2-5任一所述的方法,其特征在于,所述基于所述多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定所述多帧图像中各帧图像对应的预测结果,包括:
    将所述多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征输入帧间特征融合模型,以得到所述多帧图像中各帧图像对应的融合特征;
    将所述多帧图像中各帧图像对应的融合特征输入第三多层感知机,以得到所述多帧图像中各帧图像对应的预测结果。
  7. 如权利要求6所述的方法,其特征在于,所述帧间特征融合模型包括串联的递归神经网络和注意力机制网络。
  8. 如权利要求1-7任一所述的方法,其特征在于,所述主车运动数据包括所述主车的车速和横摆角速度中的至少一种数据。
  9. 如权利要求1-8任一所述的方法,其特征在于,所述道路结构数据包括距离所述目标障碍车最近的车道线和道沿的位置中的至少一种数据。
  10. 一种车辆起步行为的预测装置,其特征在于,所述装置包括:
    获取模块,用于获取多帧图像和多组主车运动数据,所述多组主车运动数据对应所述多帧图像,所述多帧图像是对主车周围的环境信息进行拍摄得到;
    第一确定模块,用于对所述多帧图像中的目标障碍车进行检测,以确定出多帧目标图像区域,所述多帧目标图像区域是所述多帧图像中所述目标障碍车所在的区域;
    第二确定模块,用于对所述多帧图像中的道路结构进行识别,并结合所述多帧目标图像区域,确定出多组道路结构数据,所述多组道路结构数据分别表征所述多帧图像中所述目标障碍车所在道路的道路结构;
    第三确定模块,用于基于所述多帧目标图像区域、所述多组主车运动数据和所述多组道路结构数据,确定所述多帧图像中各帧图像对应的预测结果,所述预测结果用于指示相应图像中的所述目标障碍车是否存在起步行为。
  11. 如权利要求10所述的装置,其特征在于,所述第三确定模块,包括:
    第一确定子模块,用于基于所述多帧目标图像区域,确定所述多帧图像中各帧图像对应的图像感知特征,所述图像感知特征用于表征所述目标障碍车的运动特征和所述目标障碍车周围环境的环境特征;
    第二确定子模块,用于基于所述多组主车运动数据,确定所述多帧图像中各帧图像对应的主车感知特征,所述主车感知特征用于表征所述主车的运动特征;
    第三确定子模块,用于基于所述多组道路结构数据,确定所述多帧图像中各帧图像对应的道路感知特征,所述道路感知特征用于表征所述目标障碍车所在道路的结构特征;
    第四确定子模块,用于基于所述多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征,确定所述多帧图像中各帧图像对应的预测结果。
  12. 如权利要求11所述的装置,其特征在于,所述第一确定子模块用于:
    将所述多帧目标图像区域的图像数据输入共同特征提取网络,以得到图像共同特征;
    确定与所述多帧目标图像区域一一对应的多组组合数据,每组组合数据包括相应一帧目标图像区域的图像数据和所述图像共同特征;
    将所述多组组合数据输入骨干网络,以得到所述多帧图像中各帧图像对应的图像感知特征。
  13. 如权利要求11或12所述的装置,其特征在于,所述第二确定子模块用于:
    将所述多组主车运动数据输入第一多层感知机,以得到所述多帧图像中各帧图像对应的主车感知特征。
  14. 如权利要求11-13任一所述的装置,其特征在于,所述第三确定子模块用于:
    将所述多组道路结构数据输入第二多层感知机,以得到所述多帧图像中各帧图像对应的道路感知特征。
  15. 如权利要求11-14任一所述的装置,其特征在于,所述第四确定子模块用于:
    将所述多帧图像中各帧图像对应的图像感知特征、主车感知特征和道路感知特征输入帧间特征融合模型,以得到所述多帧图像中各帧图像对应的融合特征;
    将所述多帧图像中各帧图像对应的融合特征输入第三多层感知机,以得到所述多帧图像中各帧图像对应的预测结果。
  16. 如权利要求15所述的装置,其特征在于,所述帧间特征融合模型包括串联的递归神经网络和注意力机制网络。
  17. 如权利要求10-16任一所述的装置,其特征在于,所述主车运动数据包括所述主车的车速和横摆角速度中的至少一种数据。
  18. 如权利要求10-17任一所述的装置,其特征在于,所述道路结构数据包括距离所述目标障碍车最近的车道线和道沿的位置中的至少一种数据。
  19. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-9任一所述的方法的步骤。
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品内存储有计算机指令,所述计算机指令被处理器执行时实现权利要求1-9任一所述的方法的步骤。
PCT/CN2023/093436 2022-05-17 2023-05-11 车辆起步行为的预测方法、装置、存储介质及程序产品 WO2023221848A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210539732.2 2022-05-17
CN202210539732.2A CN117115776A (zh) 2022-05-17 2022-05-17 车辆起步行为的预测方法、装置、存储介质及程序产品

Publications (1)

Publication Number Publication Date
WO2023221848A1 true WO2023221848A1 (zh) 2023-11-23

Family

ID=88811576

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093436 WO2023221848A1 (zh) 2022-05-17 2023-05-11 车辆起步行为的预测方法、装置、存储介质及程序产品

Country Status (2)

Country Link
CN (1) CN117115776A (zh)
WO (1) WO2023221848A1 (zh)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002329298A (ja) * 2001-05-02 2002-11-15 Nissan Motor Co Ltd 車両用走行制御装置
JP2003331397A (ja) * 2002-05-13 2003-11-21 Mitsubishi Electric Corp 発進報知装置
JP2007207274A (ja) * 2007-04-23 2007-08-16 Mitsubishi Electric Corp 発進報知装置
CN106652517A (zh) * 2016-09-12 2017-05-10 北京易车互联信息技术有限公司 基于摄像头的前车起步提醒的方法及系统
US20180336424A1 (en) * 2017-05-16 2018-11-22 Samsung Electronics Co., Ltd. Electronic device and method of detecting driving event of vehicle
CN110717361A (zh) * 2018-07-13 2020-01-21 长沙智能驾驶研究院有限公司 本车停车检测方法、前车起步提醒方法及存储介质
CN110733509A (zh) * 2018-07-18 2020-01-31 阿里巴巴集团控股有限公司 驾驶行为分析方法、装置、设备以及存储介质
CN111489560A (zh) * 2020-04-13 2020-08-04 深圳市海圳汽车技术有限公司 基于浅层卷积神经网络探测前车起步检测方法及控制方法
WO2020172875A1 (zh) * 2019-02-28 2020-09-03 深圳市大疆创新科技有限公司 道路结构信息的提取方法、无人机及自动驾驶系统
JP2020144724A (ja) * 2019-03-08 2020-09-10 オムロン株式会社 車両追跡装置、車両追跡方法、および車両追跡プログラム
CN113255612A (zh) * 2021-07-05 2021-08-13 智道网联科技(北京)有限公司 前车起步提醒方法及系统、电子设备和存储介质
CN113830085A (zh) * 2021-09-26 2021-12-24 上汽通用五菱汽车股份有限公司 车辆跟停起步方法、装置、设备及计算机可读存储介质

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002329298A (ja) * 2001-05-02 2002-11-15 Nissan Motor Co Ltd 車両用走行制御装置
JP2003331397A (ja) * 2002-05-13 2003-11-21 Mitsubishi Electric Corp 発進報知装置
JP2007207274A (ja) * 2007-04-23 2007-08-16 Mitsubishi Electric Corp 発進報知装置
CN106652517A (zh) * 2016-09-12 2017-05-10 北京易车互联信息技术有限公司 基于摄像头的前车起步提醒的方法及系统
US20180336424A1 (en) * 2017-05-16 2018-11-22 Samsung Electronics Co., Ltd. Electronic device and method of detecting driving event of vehicle
CN110717361A (zh) * 2018-07-13 2020-01-21 长沙智能驾驶研究院有限公司 本车停车检测方法、前车起步提醒方法及存储介质
CN110733509A (zh) * 2018-07-18 2020-01-31 阿里巴巴集团控股有限公司 驾驶行为分析方法、装置、设备以及存储介质
WO2020172875A1 (zh) * 2019-02-28 2020-09-03 深圳市大疆创新科技有限公司 道路结构信息的提取方法、无人机及自动驾驶系统
JP2020144724A (ja) * 2019-03-08 2020-09-10 オムロン株式会社 車両追跡装置、車両追跡方法、および車両追跡プログラム
CN111489560A (zh) * 2020-04-13 2020-08-04 深圳市海圳汽车技术有限公司 基于浅层卷积神经网络探测前车起步检测方法及控制方法
CN113255612A (zh) * 2021-07-05 2021-08-13 智道网联科技(北京)有限公司 前车起步提醒方法及系统、电子设备和存储介质
CN113830085A (zh) * 2021-09-26 2021-12-24 上汽通用五菱汽车股份有限公司 车辆跟停起步方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN117115776A (zh) 2023-11-24

Similar Documents

Publication Publication Date Title
WO2022083402A1 (zh) 障碍物检测方法、装置、计算机设备和存储介质
US11455805B2 (en) Method and apparatus for detecting parking space usage condition, electronic device, and storage medium
EP4152204A1 (en) Lane line detection method, and related apparatus
WO2021135879A1 (zh) 监控车辆数据方法、装置、计算机设备及存储介质
US20220373353A1 (en) Map Updating Method and Apparatus, and Device
KR20210080459A (ko) 차선 검출방법, 장치, 전자장치 및 가독 저장 매체
EP3951741B1 (en) Method for acquiring traffic state, relevant apparatus, roadside device and cloud control platform
US20220172452A1 (en) Detecting objects non-visible in color images
WO2021253245A1 (zh) 识别车辆变道趋势的方法和装置
CN112686923A (zh) 一种基于双阶段卷积神经网络的目标跟踪方法及系统
WO2022217630A1 (zh) 一种车速确定方法、装置、设备和介质
JP2023536025A (ja) 路車協調における目標検出方法、装置及び路側機器
WO2021036243A1 (zh) 识别车道的方法、装置及计算设备
CN111967396A (zh) 障碍物检测的处理方法、装置、设备及存储介质
CN115204044A (zh) 轨迹预测模型的生成以及轨迹信息处理方法、设备及介质
CN114926791A (zh) 一种路口车辆异常变道检测方法、装置、存储介质及电子设备
CN114724063A (zh) 一种基于深度学习的公路交通事件检测方法
CN114627409A (zh) 一种车辆异常变道的检测方法及装置
CN112817006B (zh) 一种车载智能道路病害检测方法及系统
CN116912517B (zh) 相机视野边界的检测方法及装置
WO2023221848A1 (zh) 车辆起步行为的预测方法、装置、存储介质及程序产品
CN116665179A (zh) 数据处理方法、装置、域控制器以及存储介质
WO2022243337A2 (en) System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation
CN114387310A (zh) 一种基于深度学习的城市主干道路车流量统计方法
CN112215042A (zh) 一种车位限位器识别方法及其系统、计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23806800

Country of ref document: EP

Kind code of ref document: A1