WO2023231991A1

WO2023231991A1 - Traffic signal lamp sensing method and apparatus, and device and storage medium

Info

Publication number: WO2023231991A1
Application number: PCT/CN2023/096961
Authority: WO
Inventors: 王磊; 刘挺; 卿泉
Original assignee: 阿里巴巴达摩院(杭州)科技有限公司
Priority date: 2022-05-30
Filing date: 2023-05-29
Publication date: 2023-12-07
Also published as: CN114694123B; CN114694123A

Abstract

Provided in the embodiments of the present application are a traffic signal lamp sensing method and apparatus, and a device and a storage medium. The traffic signal lamp sensing method comprises: acquiring a plurality of types of target data of a target position, wherein the plurality of types of target data comprise at least two types of the following: image data, radar data and map data; respectively performing feature extraction on the various types of target data to obtain target feature vectors corresponding to the various types of target data; performing fusion processing on the various target feature vectors on the basis of a cross attention mechanism to obtain a fused feature vector; and performing classification prediction on the basis of the fused feature vector to obtain a traffic signal lamp sensing result of the target position. According to the embodiments of the present application, cross-modal data fusion and comprehensive analysis reasoning are performed on the basis of a plurality of types of different modal data of a surrounding environment of a target position, so as to obtain a final sensing result, such that the sensing stability and accuracy are relatively high.

Description

Traffic light sensing method, device, equipment and storage medium

This application claims priority to the Chinese patent application filed with the China Patent Office on May 30, 2022, with application number 202210599282.6 and the application name "Traffic light sensing method, device, equipment and storage medium", the entire content of which is incorporated by reference. in this application.

Technical field

Embodiments of the present application relate to the field of computer technology, and in particular to a traffic light sensing method, device, equipment and storage medium.

Background technique

Traffic light perception refers to accurately identifying the color and control direction of traffic lights at intersections. It is a very important task in fields such as autonomous driving.

In related technologies, a common solution for traffic light perception is to obtain image data containing traffic lights, and detect the above image data through a target detection model to obtain corresponding perception results.

The above scheme has a high degree of dependence on image content, and the stability of the scheme is poor. For example: when the traffic lights are blocked by other surrounding objects such as large cars, or when the traffic lights are invisible in the image due to rainy weather, the above solution cannot obtain the perception results.

Contents of the invention

In view of this, embodiments of the present application provide a traffic light sensing method, device, equipment and storage medium to at least partially solve the above problems.

According to the first aspect of the embodiment of the present application, a traffic light sensing method is provided, including:

Acquire multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data;

Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data;

Based on the cross-attention mechanism, various target feature vectors are fused to obtain a fused feature vector;

Classification prediction is performed based on the fused feature vector to obtain the traffic light perception result of the target position.

According to a second aspect of the embodiment of the present application, a traffic light sensing device is provided, including:

The target data acquisition module is used to acquire multiple target data of the target location. The multiple target data include at least two of the following: image data, radar data, and map data;

The target feature vector obtaining module is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;

The fusion module is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector;

The result obtaining module performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target position.

According to a third aspect of the embodiment of the present application, an electronic device is provided, including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface complete each other through the communication bus. communication between; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform operations corresponding to the traffic light sensing method described in the first aspect.

According to a fourth aspect of the embodiments of the present application, a computer storage medium is provided, on which a computer program is stored. When the program is executed by a processor, the traffic light sensing method as described in the first aspect is implemented.

The traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector. The feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. That is to say, in the embodiment of this application, based on multiple different modal data of the environment around the target position, modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.

Description of the drawings

In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some of the embodiments recorded in the embodiments of this application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings.

Figure 1 is a step flow chart of a traffic light sensing method according to Embodiment 1 of the present application;

Figure 2 is a schematic diagram of an example scenario in the embodiment shown in Figure 1;

Figure 3 is a step flow chart of a traffic light sensing method according to Embodiment 2 of the present application;

Figure 4 is a schematic diagram of an example scenario in the embodiment shown in Figure 3;

Figure 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application;

FIG. 6 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present application.

Detailed ways

In order to enable those in the art to better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the description The embodiments are only part of the embodiments of the present application, rather than all the embodiments. Based on the examples in the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art should fall within the scope of protection of the embodiments of this application.

The specific implementation of the embodiments of the present application will be further described below with reference to the accompanying drawings of the embodiments of the present application.

Embodiment 1

Referring to Figure 1, Figure 1 is a flow chart of a traffic light sensing method according to Embodiment 1 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:

Step 102: Acquire multiple target data at the target location. The multiple target data include at least two of the following: image data, radar data, and map data.

Specifically, the target location may be a target intersection where traffic light sensing is to be performed or a specific location around the target intersection. The image data can be images of the target position collected by cameras, etc.; radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.; The map data may be data containing information such as the location, shape, and size of instance objects such as lane lines, crosswalks, and green belts at the target location.

In the embodiment of the present application, the multiple target data may specifically include: image data, radar data, or any two of the map data, or may include all three of the above data. Those skilled in the art can understand that the more types of target data are obtained, the higher the accuracy and stability of the final traffic light sensing result will be.

Step 104: Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data.

Specifically, as far as image data is concerned, feature extraction can be performed on the image data based on a pre-trained feature extraction model to obtain the target feature vector corresponding to the image data; as far as radar data is concerned, three-dimensional target detection can be accomplished through pre-training. The model detects the radar data and obtains the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained; as far as the map data is concerned, after the map data is obtained, the map data can be vectorized to obtain the map The target feature vector corresponding to the data, for example: For the map instance object of a specific lane line at the target location, the position information of multiple sampling points on the lane line can be obtained, and then each two adjacent sampling points are used as the starting point and end point respectively to generate a vector. The vector It is the feature vector of the lane line instance object, which represents the lane line instance object between the two adjacent sampling points.

Step 106: Based on the cross attention mechanism, various target feature vectors are fused to obtain a fused feature vector.

Specifically, for the target feature vector corresponding to each type of target data, the target feature vector corresponding to the target data can be determined by referring to the similarity between the target feature vectors corresponding to the other target data and the target feature vector corresponding to the target data. Adjustment is made so that the adjusted target feature vector focuses on representing information related to the other target feature vectors, while ignoring information that is less relevant to the other target feature vectors. Then, all the adjusted target feature vectors are fused. Process to obtain the fused feature vector.

Step 108: Perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location.

Specifically, the traffic light perception results finally obtained by the embodiment of the present application may include the target positions: traffic light colors in the straight direction, the left turn direction, and the right turn direction.

Any existing classification prediction method can be used to obtain the final perception result based on the fusion feature vector. For example: the perception results are obtained through the classification prediction model used for classification prediction, etc. When a classification prediction model is used for classification prediction, the classification prediction model can be a classifier structure with three branches. Each branch is used to output a two-class classification result to predict whether to go straight, turn left, or turn right. The color of the traffic light in one direction. In addition, in the embodiments of the present application, the specific structure of the classification prediction model is not limited. For example, a multi-layer perceptron model with a relatively simple structure can be used, and so on.

Referring to Figure 2, Figure 2 is a schematic diagram of a scene corresponding to Embodiment 1 of the present application. Below, with reference to the schematic diagram shown in Figure 2, a specific scenario example will be used to describe the embodiment of the present application:

Acquire three types of target data: image data, radar point cloud data and map data at the target location; perform feature extraction on the above image data at the target location to obtain the corresponding target feature vector 1; for the radar point cloud data at the target location, extract It performs 3D target detection and obtains three targets located at the target position, such as: pedestrian, vehicle 1 and vehicle 2 (in the embodiment of this application, the types of targets that can be detected can be preset according to actual needs. In the embodiment of this application, , there is no limit to the number of types and specific content of the preset targets that can be detected. For example, the preset targets that can be detected can include 3 types, namely: pedestrians, vehicles, and cyclists, etc. In Figure 2, only Taking the radar data containing 3 preset targets as an example, this does not constitute a limitation on the embodiments of the present application), each target corresponds to a target feature vector (used to characterize the type, location, shape and other characteristics of the target), Figure 2 The target feature vector 2, target feature vector 3 and target feature vector 4 in are the targets corresponding to the radar data. Marker feature vectors; map data for target locations (such as high-precision map data mainly used for autonomous driving) can be vectorized to obtain the corresponding target feature vectors, where each target feature vector is used to represent a feature in the map Characteristic information of instance objects, assuming that the map data in Figure 2 contains 4 instance objects, namely: lane line 1, lane line 2, lane line 3 and crosswalk. Correspondingly, target feature vector 5 represents the characteristic information of lane line 1, Target feature vector 6 represents the feature information of lane line 2, target feature vector 7 represents the feature information of lane line 3, and target feature vector 8 represents the feature information of crosswalk. Target feature vectors 5-8 are the target feature vectors corresponding to the map data; After obtaining the target feature vectors corresponding to the three target data respectively, the above target feature vectors can be fused based on the cross attention mechanism to obtain the fused feature vector, and then classification prediction is performed based on the fused feature vector to obtain the traffic light perception result: Traffic light information corresponding to the three directions of going straight, turning left, and turning right, specifically, such as: the colors of the traffic lights corresponding to the three directions of going straight, turning left, and turning right.

According to the traffic light sensing method provided by the embodiment of the present application, multiple different target data of the target position are obtained and target feature vectors corresponding to various target data are obtained, and then feature fusion based on the cross-attention mechanism is performed on each target feature vector. In traffic light perception, it is based on fused feature vectors. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.

The traffic light sensing method in this embodiment can be executed by any appropriate electronic device with data processing capabilities, including but not limited to: servers, PCs, etc.

Embodiment 2

Referring to Figure 3, Figure 3 is a flow chart of a traffic light sensing method according to Embodiment 2 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:

Step 302: Acquire multiple target data at the target location. The multiple target data include at least two of the following: image data, radar data, and map data.

Specifically, in the embodiment of the present application, for the target data of image data, multiple frames of continuous image data can be acquired at the same time; for the target data of radar data, multiple frames of continuous radar data can also be acquired at the same time. For example: you can obtain continuous image data or radar data of a preset number of frames of the target position before the current moment.

The image data can be images of the target position collected by cameras, etc.; radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.; Map data can include instance objects such as target location lane lines, crosswalks, green belts, etc. data of location, shape, size and other information.

Step 304: Perform feature extraction for each type of target data to obtain a feature sequence corresponding to the target data. The feature sequence contains multiple initial feature vectors.

Among them, for image data, each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, multiple initial feature vectors represent feature information of at least one map instance object.

Specifically, for image data, the number of types of initial feature vectors contained in the feature sequence is the same as the number of frames of the image data. One initial feature vector corresponds to one frame of image data and is used to characterize the feature information contained in the frame of image data. , for example, when there are 3 frames of image data, there are also 3 corresponding initial feature vectors. Each initial feature vector is a feature vector obtained after feature extraction of one frame of image data; similarly, for radar data, the feature vector The number of types of initial feature vectors contained in the sequence is the same as the number of frames of radar data. One initial feature vector corresponds to one frame of radar data and is used to characterize the feature information contained in the radar data of that frame. For example, when the radar data totals 3 frame, the corresponding initial feature vectors are also three types, and each initial feature vector is a feature vector obtained after feature extraction of one frame of radar data.

For map data, after vectorizing it, a feature sequence containing a variety of initial feature vectors can be obtained. The above multiple initial feature vectors are used to represent map instance objects in the map (such as lane lines, crosswalks, green belts, etc. ) feature information, for example: for a lane line with a length of 200 meters, the lane line part of the first 100 meters can be represented by the first initial feature vector, and the lane line part of the subsequent 100 meters can be represented by the second initial feature vector. Feature limit representation, where the first initial feature vector is vectorized based on the coordinate positions of the starting point and end point of the first 100 meters of the lane line; the second initial feature vector is based on the starting point of the next 100 meters of the lane line. The coordinate positions of the starting point and the ending point are vectorized.

Step 306: Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each type of target data to obtain a target feature vector corresponding to each type of target data.

Specifically, if the target data is image data or radar data, based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:

Select an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data, as Baseline initial vector; based on the correlation between the base initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors; based on the attention values of the remaining initial feature vectors, update the base initial vector to obtain the target feature vector corresponding to the target data .

The correlation degree between the base initial vector and the other initial feature vectors represents the degree of correlation between the base initial vector and the other initial feature vectors. When updating the baseline initial vector, the degree of correlation between the baseline initial vector and other initial feature vectors can be characterized by the attention weight. When the vector is updated, the greater the attention weight of the above-mentioned remaining initial feature vectors is, conversely, the lower the correlation between the baseline initial vector and the remaining initial feature vectors, the smaller the attention weight of the above-mentioned remaining initial feature vectors. The above attention weight can be calculated using the existing attention mechanism (attention method).

Specifically, based on the correlation between the baseline initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors, which may include:

The attention mechanism can be used to calculate the attention weights of the remaining initial feature vectors, and then the product of the above attention weight and the remaining initial feature vectors is used as the attention value of the remaining initial feature vectors.

Furthermore, in multi-frame continuous image data or radar data, the later the timestamp of the data is, the higher the importance of the feature information it contains. Therefore, in order to make the target feature vector corresponding to the final target data better able to Characterize the feature information in the target data. In some embodiments, if the target data is image data or radar data, you can: select the one with the latest timestamp from various initial feature vectors in the feature sequence corresponding to the target data. The initial feature vector corresponding to the frame image data or radar data is used as the base initial vector, and the base initial vector is updated based on the attention values of the remaining initial feature vectors to obtain the target feature vector corresponding to the target data.

If the target data is map data, based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:

For a variety of initial feature vectors that represent the feature information of each map instance object, feature fusion is performed based on the self-attention mechanism to obtain multiple self-updating feature vectors for each map instance object; multiple self-updates for each map instance object are obtained The feature vector is subjected to a maximum pooling operation to obtain the target feature vector of the target data.

Specifically, multiple self-updating feature vectors of each map instance object can be obtained in the following ways:

For each initial feature vector in the map instance object, based on the correlation between this type of initial feature vector and other types of initial feature vectors, calculate the attention values of the other initial feature vectors; based on the attention values of the other types of initial feature vectors, update This kind of initial feature vector obtains a self-updating feature vector.

For example: Suppose a map data contains only one map instance object: a lane line, and the lane line pair The corresponding initial feature vectors are initial feature vector 1 and initial feature vector 2 respectively, then the process of obtaining the target feature vector of the map data can include:

For the initial feature vector 1, based on the correlation (attention weight) between the initial feature vector 1 and the initial feature vector 2, calculate the attention value of the initial feature vector 2; based on the attention value of the initial feature vector 2, update the initial feature Vector 1, obtain the self-updating feature vector 1 corresponding to the initial feature vector 1; similarly, for the initial feature vector 2, calculate the initial feature vector based on the correlation (attention weight) between the initial feature vector 2 and the initial feature vector 1 The attention value of 1; based on the attention value of the initial feature vector 1, update the initial feature vector 2 to obtain the self-updating feature vector 2 corresponding to the initial feature vector 2; then perform the maximum operation on the self-updating feature vector 1 and the self-updating feature vector 2 Pooling operation (take the largest element at the same position of each updated feature vector as the element value of the corresponding position of the target feature vector) to obtain the target feature vector of the target data (that is, the lane line).

Step 308: For each type of target feature vector, calculate the attention values of other types of target feature vectors based on the correlation between this type of target feature vector and other types of target feature vectors.

The degree of correlation between this type of target feature vector and other types of target feature vectors represents the degree of correlation between this type of target feature vector and other types of target feature vectors. When calculating the attention value of other types of target feature vectors, the degree of association between this type of target feature vector and other types of target feature vectors can be characterized by the attention weight. The higher the degree of correlation, the greater the attention weight of the above-mentioned other types of target feature vectors. On the contrary, the lower the degree of correlation between this type of target feature vector and other types of target feature vectors, the greater the attention weight of the above-mentioned other types of target feature vectors. The smaller the weight. The above attention weight can also be calculated using the existing attention mechanism (attention method).

Specifically, for each other type of target feature vector, the correlation degree (attention weight) between the target feature vector of the other type and the above-mentioned target feature vector can be calculated, and then the correlation degree (attention weight) and the above-mentioned target feature vector can be calculated. The product of the target feature vectors of other categories is used as the attention value of the target feature vectors of the other categories.

For example: for target feature vector 1 and another target feature vector 2, the process of calculating the attention value of target feature vector 2 includes: first calculating the correlation (attention) between target feature vector 1 and target feature vector 2 weight), and then the product of the correlation (attention weight) and the target feature vector 2 is used as the attention value of the target feature vector 2.

Step 310: Based on the attention values of other types of target feature vectors, update the target feature vectors of this type to obtain updated target vectors. Afterwards, it is determined whether the preset update stop condition is reached. If not, the updated target vector is used as the new target feature vector, and the process returns to step 308; if so, step 312 is executed.

Specifically, for each type of target feature vector, after obtaining the attention values of other types of target feature vectors respectively, the sum of the attention values of this type of target feature vector and the other types of target feature vectors can be used as the The updated target vector corresponding to the target feature vector.

In addition, in the embodiment of the present application, the update stop condition can be customized according to actual needs, and the specific content of the update stop condition is not limited here. For example: the update stop condition can be that the number of times the target vector is updated reaches a preset number; the update stop condition can also be that the correlation (attention weight) between the target vectors after two updates is greater than the preset correlation threshold ( attention weight threshold), etc.

Step 312: Perform fusion processing on various updated target vectors to obtain a fused feature vector.

In the embodiment of the present application, there is no limit to the specific fusion processing method. For example, the sum of various updated target vectors can be directly used as the fusion feature vector; a weight value can also be set separately for each updated target vector, and then based on the above According to the set weight value, perform a weighted sum of various updated target vectors to obtain a fusion feature vector; you can also perform a maximum pooling operation on various updated target vectors to obtain a fusion feature vector, and so on.

Step 314: Classify and predict based on the fused feature vector to obtain the traffic light perception result at the target location.

Specifically, the traffic light perception results finally obtained by the embodiment of the present application may include the target position: traffic light information of the straight direction, left turn direction and right turn direction, specifically, such as the straight direction, left turn direction and right turn direction. Traffic light colors.

In this embodiment of the present application, the feature extraction for each type of target data in step 304 can be performed based on the feature extraction model; in step 306, based on the self-attention mechanism, each initial feature vector in the corresponding feature sequence of each type of target data is extracted. Feature fusion can be based on a self-attention model (for example: a transformer model based on a self-attention mechanism, etc.); Steps 308 to 310 can be based on a cross-attention model (for example: a transformer model based on a self-attention mechanism, etc.) ); Step 314 can be performed based on the classification prediction model. Therefore, the traffic light sensing method provided by the embodiment of the present application can, after obtaining the target data, output the final sensing result based on a series of machine learning models. In other words, the embodiment of the present application provides an end-to-end The traffic light sensing solution does not require complex post-processing operations, so the solution is simpler and has a wider scope of application.

Referring to Figure 4, Figure 4 is a schematic diagram of a scene corresponding to Embodiment 2 of the present application. Below, reference will be made to the The schematic diagram uses a specific scenario example to illustrate the embodiment of the present application:

Acquire three types of target data: image data of the target location, radar point cloud data and map data. Among them, the image data is 3 consecutive frames: the first frame of image data, the second frame of image data and the third frame of image data; radar data It is also three consecutive frames: the first frame of radar data, the second frame of radar data and the third frame of radar data; for the above three frames of image data, feature extraction is performed respectively to obtain the feature sequence corresponding to the image data (Figure 4 Correspondence of image data In the feature sequence, each open circle represents an initial feature vector corresponding to one frame of image data); for each frame of radar data in the above three frames of radar data, feature extraction is performed separately to obtain the initial feature vector corresponding to each frame of radar data. Feature sequence composed of feature data (assuming that the radar data contains a total of 3 types of targets: pedestrians, vehicle 1 and vehicle 2, then in the feature sequence corresponding to the radar data in Figure 4, the 3 solid circles in each column represent the initial start of a frame of radar data Feature vector, 1 solid circle represents the initial feature vector of a target in the radar data of this frame; 3 solid circles in each row represent the initial feature vector of the same target in different radar data frames); for map data, feature extraction ( Vectorized representation), and obtain a feature sequence consisting of multiple initial feature vectors corresponding to the map data (assuming that the map data contains a total of 4 map instance objects: lane line 1, lane line 2, lane line 3 and crosswalk, then Figure 4 map data In the corresponding feature sequence, each straight line with an arrow (including solid lines and dotted lines) represents an initial feature vector. Among them, lane line 1 corresponds to two kinds of initial feature vectors, and lane line 2 corresponds to two kinds of initial feature vectors. Lane line 3 corresponds to 2 kinds of initial feature vectors, and crosswalk corresponds to 4 kinds of initial feature vectors); based on the self-attention mechanism, the initial feature vectors in the feature sequence of each target data are feature fused respectively, and the target features corresponding to each target data are obtained. Vector, specifically: perform feature fusion on the initial feature vectors in the feature sequence corresponding to the image data to obtain the target feature vector 1 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the radar data (respectively on the same row) Perform feature fusion on each initial feature vector of the image data) to obtain the target feature vector 2, target feature vector 3 and target feature vector 4 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the map data (respectively for the same map instance Each initial feature vector corresponding to the object is subjected to feature fusion) to obtain the target feature vector 5, target feature vector 6, target feature vector 7 and target feature vector 8 corresponding to the map data; finally, based on the cross attention mechanism, the above target feature vector is 1-8 perform fusion processing to obtain the fusion feature vector, and then perform classification prediction based on the fusion feature vector to obtain the traffic light perception results: the corresponding traffic light colors in the three directions of going straight, turning left, and turning right.

The traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector. The feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. In other words, in the embodiment of this application, it is based on multiple different modes of the environment around the target location. Data, cross-modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.

In addition, before fusing the target feature vectors corresponding to different target data based on the cross-attention mechanism, the feature fusion of each initial feature vector of multiple consecutive image frames or radar frames is first based on the self-attention mechanism, and the map Each initial feature vector of different map instance objects in the data is feature fused to obtain target feature vectors corresponding to different target data. The above-mentioned feature fusion operation based on the self-attention mechanism to obtain the target feature vector is applicable to image or radar sequences and The historical status of traffic participants in the surrounding environment is correlated and fused. Compared with the method of directly obtaining the target feature vector through feature extraction based only on a single frame image or radar data, the target feature vector contains richer and more important information. Therefore, Based on the subsequent logical reasoning based on the above target feature vector, the final traffic light perception result is more accurate and stable.

Embodiment 3

Referring to Figure 5, Figure 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application. The traffic light sensing device provided by the embodiment of the present application includes:

The target data acquisition module 502 is used to acquire multiple target data of the target location. The multiple target data include at least two of the following: image data, radar data, and map data;

The target feature vector obtaining module 504 is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;

The fusion module 506 is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector;

The result obtaining module 508 performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target location.

Optionally, in some embodiments, the fusion module 506 is specifically used to:

For each type of target feature vector, based on the correlation between this type of target feature vector and other types of target feature vectors, calculate the attention values of other types of target feature vectors;

Based on the attention values of other types of target feature vectors, update the target feature vector of this type to obtain the updated target vector;

Various updated target vectors are fused to obtain a fused feature vector.

Optionally, in some embodiments, the fusion module 506, before performing the fusion process on various updated target vectors to obtain the fused feature vector, is also used to:

Determine whether the preset update stop condition is reached;

If not, use the updated target vector as the new target feature vector, and return each target feature vector. Based on the correlation between this target feature vector and other types of target feature vectors, calculate the attention values of other types of target feature vectors. steps until the update stop condition is met.

Optionally, in some embodiments, the fusion module 506, when performing the step of fusion processing on various updated target vectors to obtain a fused feature vector, is specifically used to:

Perform maximum pooling operations on various updated target vectors to obtain fusion feature vectors.

Optionally, in some embodiments, the target feature vector obtaining module 504 is specifically used for:

Perform feature extraction for each type of target data to obtain a feature sequence corresponding to that type of target data. The feature sequence contains a variety of initial feature vectors;

Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each target data to obtain the target feature vector corresponding to each target data;

Optionally, in some embodiments, if the target data is image data or radar data, the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism. , when obtaining the target feature vector corresponding to the target data, it is specifically used for:

Select an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as the base initial vector;

Based on the correlation between the baseline initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors;

Based on the attention values of the remaining initial feature vectors, the baseline initial vector is updated to obtain the target feature vector corresponding to the target data.

Optionally, in some embodiments, when the target feature vector obtaining module 504 performs the step of selecting an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as the reference initial vector, specifically use At:

From various initial feature vectors in the feature sequence corresponding to the target data, select the frame image with the latest timestamp The initial feature vector corresponding to the data or radar data is used as the baseline initial vector.

Optionally, in some embodiments, if the target data is map data, the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism to obtain the target When selecting the target feature vector corresponding to the data, it is specifically used for:

For a variety of initial feature vectors that represent the feature information of each map instance object, feature fusion is performed based on the self-attention mechanism to obtain a variety of self-updating feature vectors for each map instance object;

Perform a maximum pooling operation on multiple self-updating feature vectors of each map instance object to obtain the target feature vector of the target data.

Optionally, in some embodiments, when obtaining the target feature vector corresponding to the image data, the target feature vector obtaining module 504 is specifically used to:

Based on the pre-trained feature extraction model, feature extraction is performed on the image data to obtain the target feature vector corresponding to the image data;

When obtaining the target feature vector corresponding to the radar data, the target feature vector obtaining module 504 is specifically used to:

The radar data is detected through the pre-trained three-dimensional target detection model to obtain the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained;

When obtaining the target feature vector corresponding to the map data, the target feature vector obtaining module 504 is specifically used to:

Vectorize the map data to obtain the target feature vector corresponding to the map data.

The traffic light sensing device in the embodiment of the present application is used to implement the corresponding traffic light sensing method in the first or second method embodiment, and has the beneficial effects of the corresponding method embodiment, which will not be described again here. In addition, for the functional implementation of each module in the traffic light sensing device in the embodiment of the present application, reference can be made to the description of the corresponding parts in the first or second embodiment of the method, and will not be described again here.

Embodiment 4

Referring to FIG. 6 , a schematic structural diagram of an electronic device according to Embodiment 4 of the present application is shown. The specific embodiments of the present application do not limit the specific implementation of the electronic device.

As shown in Figure 6, the electronic device may include: a processor (processor) 602, a communications interface (Communications Interface) 604, a memory (memory) 606, and a communication bus 608.

in:

The processor 602, the communication interface 604, and the memory 606 complete communication with each other through the communication bus 608.

Communication interface 604 is used to communicate with other electronic devices or servers.

The processor 1202 is used to execute the program 610. Specifically, it can execute the above embodiment of the traffic light sensing method. related steps.

Specifically, program 610 may include program code including computer operating instructions.

The processor 602 may be a CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors included in the smart device can be the same type of processor, such as one or more CPUs; or they can be different types of processors, such as one or more CPUs and one or more ASICs.

Memory 606 is used to store program 610. The memory 606 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 610 can be specifically used to cause the processor 602 to perform the following operations: obtain multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data; characterize the various target data respectively. Extract and obtain target feature vectors corresponding to various target data; based on the cross-attention mechanism, fuse various target feature vectors to obtain a fused feature vector; perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location .

For the specific implementation of each step in program 610, please refer to the corresponding steps and corresponding descriptions in the units in the above embodiment of the traffic light sensing method, and will not be described again here. Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described devices and modules can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be described again here.

Through the electronic device of this embodiment, a variety of different target data at the target position are obtained and target feature vectors corresponding to various target data are obtained. Then feature fusion based on the cross-attention mechanism is performed on each target feature vector, and in traffic light perception When, it is based on the fusion feature vector. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.

Embodiments of the present application also provide a computer program product, including computer instructions, which instruct the computing device to perform operations corresponding to any of the traffic light sensing methods in the multiple method embodiments mentioned above.

It should be pointed out that according to the needs of implementation, each component/step described in the embodiments of this application can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of this application.

The above-mentioned methods according to the embodiments of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optical disk), or by Network downloads are originally stored on remote recording media or non-transitory machine-readable media and will be stored in Computer code is local to the recording medium so that the methods described herein can be processed by such software stored on the recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It will be understood that a computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code when the software or computer code is used by the computer, When accessed and executed by a processor or hardware, the traffic light sensing methods described herein are implemented. Furthermore, when a general-purpose computer accesses code for implementing the traffic light sensing method illustrated herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the traffic light sensing method illustrated herein.

Those of ordinary skill in the art will appreciate that the units and method steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of the embodiments of the present application.

The above embodiments are only used to illustrate the embodiments of the present application, but are not intended to limit the embodiments of the present application. Those of ordinary skill in the relevant technical fields can also make various modifications without departing from the spirit and scope of the embodiments of the present application. Changes and modifications, therefore all equivalent technical solutions also fall within the scope of the embodiments of this application, and the patent protection scope of the embodiments of this application should be limited by the claims.

Claims

A traffic light sensing method, including:

Acquire multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data;

Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data;

Based on the cross-attention mechanism, various target feature vectors are fused to obtain a fused feature vector;

Classification prediction is performed based on the fused feature vector to obtain the traffic light perception result of the target position.
The method according to claim 1, wherein the fusion processing is performed on various target feature vectors based on a cross-attention mechanism to obtain a fused feature vector, including:

For each type of target feature vector, based on the correlation between this type of target feature vector and other types of target feature vectors, calculate the attention values of other types of target feature vectors;

Based on the attention values of other types of target feature vectors, update the target feature vector of this type to obtain the updated target vector;

Various updated target vectors are fused to obtain a fused feature vector.
The method according to claim 2, wherein before performing fusion processing on various updated target vectors to obtain the fused feature vector, the method further includes:

Determine whether the preset update stop condition is reached;

If not, use the updated target vector as a new target feature vector, return the target feature vector for each type, and calculate other types of target feature vectors based on the correlation between this type of target feature vector and other types of target feature vectors. The attention value steps until the update stop condition is met.
The method according to claim 2 or 3, wherein the fusion processing of various updated target vectors to obtain the fusion feature vector includes:

Perform maximum pooling operations on various updated target vectors to obtain fusion feature vectors.
The method according to claim 1, wherein the feature extraction is performed on various target data respectively to obtain target feature vectors corresponding to various target data, including:

Perform feature extraction for each type of target data to obtain a feature sequence corresponding to the target data, and the feature sequence contains a variety of initial feature vectors;

Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each target data to obtain the target feature vector corresponding to each target data;

Among them, for image data, each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, the multiple initial feature vectors represent feature information of at least one map instance object.
The method according to claim 5, wherein if the target data is image data or radar data, feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data based on a self-attention mechanism to obtain the The process of target feature vector corresponding to target data includes:

Select an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as a reference initial vector;

Based on the correlation between the reference initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors;

Based on the attention values of the remaining initial feature vectors, the reference initial vector is updated to obtain a target feature vector corresponding to the target data.
The method according to claim 6, wherein selecting an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as a reference initial vector includes:

From various initial feature vectors in the feature sequence corresponding to the target data, select the initial feature vector corresponding to the frame of image data or radar data with the latest timestamp as the reference initial vector.
The method according to claim 5, wherein if the target data is map data, based on self-injection The attention mechanism performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data to obtain the target feature vector corresponding to the target data, including:

For a variety of initial feature vectors that represent the feature information of each map instance object, feature fusion is performed based on the self-attention mechanism to obtain a variety of self-updating feature vectors for each map instance object;

Perform a maximum pooling operation on multiple self-updating feature vectors of each map instance object to obtain a target feature vector of the target data.
The method according to claim 1, wherein the process of obtaining the target feature vector corresponding to the image data includes:

Based on the pre-trained feature extraction model, perform feature extraction on the image data to obtain the target feature vector corresponding to the image data;

The process of obtaining the target feature vector corresponding to the radar data includes:

The radar data is detected through a pre-trained three-dimensional target detection model to obtain a target detection result; based on the target detection result, a target feature vector corresponding to the radar data is obtained;

The process of obtaining the target feature vector corresponding to the map data includes:

The map data is vectorized to obtain a target feature vector corresponding to the map data.
A traffic light sensing device, including:

The target data acquisition module is used to acquire multiple target data of the target location. The multiple target data include at least two of the following: image data, radar data, and map data;

The target feature vector obtaining module is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;

The fusion module is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector;

The result obtaining module performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target position.
An electronic device, including: a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform operations corresponding to the traffic light sensing method according to any one of claims 1-9.
A computer storage medium on which a computer program is stored. When the program is executed by a processor, the traffic light sensing method as described in any one of claims 1-9 is implemented.
A computer program product includes computer instructions, and the computer instructions instruct a computing device to perform operations corresponding to the traffic light sensing method according to any one of claims 1-9.