WO2023231991A1 - Traffic signal lamp sensing method and apparatus, and device and storage medium - Google Patents

Traffic signal lamp sensing method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2023231991A1
WO2023231991A1 PCT/CN2023/096961 CN2023096961W WO2023231991A1 WO 2023231991 A1 WO2023231991 A1 WO 2023231991A1 CN 2023096961 W CN2023096961 W CN 2023096961W WO 2023231991 A1 WO2023231991 A1 WO 2023231991A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
data
feature
feature vector
initial
Prior art date
Application number
PCT/CN2023/096961
Other languages
French (fr)
Chinese (zh)
Inventor
王磊
刘挺
卿泉
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2023231991A1 publication Critical patent/WO2023231991A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data

Definitions

  • Embodiments of the present application relate to the field of computer technology, and in particular to a traffic light sensing method, device, equipment and storage medium.
  • Traffic light perception refers to accurately identifying the color and control direction of traffic lights at intersections. It is a very important task in fields such as autonomous driving.
  • a common solution for traffic light perception is to obtain image data containing traffic lights, and detect the above image data through a target detection model to obtain corresponding perception results.
  • the above scheme has a high degree of dependence on image content, and the stability of the scheme is poor. For example: when the traffic lights are blocked by other surrounding objects such as large cars, or when the traffic lights are invisible in the image due to rainy weather, the above solution cannot obtain the perception results.
  • embodiments of the present application provide a traffic light sensing method, device, equipment and storage medium to at least partially solve the above problems.
  • a traffic light sensing method including:
  • Acquire multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data;
  • Classification prediction is performed based on the fused feature vector to obtain the traffic light perception result of the target position.
  • a traffic light sensing device including:
  • the target data acquisition module is used to acquire multiple target data of the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data;
  • the target feature vector obtaining module is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;
  • the fusion module is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector
  • the result obtaining module performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target position.
  • an electronic device including: a processor, a memory, a communication interface, and a communication bus.
  • the processor, the memory, and the communication interface complete each other through the communication bus. communication between; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform operations corresponding to the traffic light sensing method described in the first aspect.
  • a computer storage medium on which a computer program is stored.
  • the program is executed by a processor, the traffic light sensing method as described in the first aspect is implemented.
  • the traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector.
  • the feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. That is to say, in the embodiment of this application, based on multiple different modal data of the environment around the target position, modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • Figure 1 is a step flow chart of a traffic light sensing method according to Embodiment 1 of the present application.
  • Figure 2 is a schematic diagram of an example scenario in the embodiment shown in Figure 1;
  • Figure 3 is a step flow chart of a traffic light sensing method according to Embodiment 2 of the present application.
  • Figure 4 is a schematic diagram of an example scenario in the embodiment shown in Figure 3;
  • FIG. 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present application.
  • FIG 1 is a flow chart of a traffic light sensing method according to Embodiment 1 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:
  • Step 102 Acquire multiple target data at the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data.
  • the target location may be a target intersection where traffic light sensing is to be performed or a specific location around the target intersection.
  • the image data can be images of the target position collected by cameras, etc.; radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.;
  • the map data may be data containing information such as the location, shape, and size of instance objects such as lane lines, crosswalks, and green belts at the target location.
  • the multiple target data may specifically include: image data, radar data, or any two of the map data, or may include all three of the above data.
  • Step 104 Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data.
  • feature extraction can be performed on the image data based on a pre-trained feature extraction model to obtain the target feature vector corresponding to the image data; as far as radar data is concerned, three-dimensional target detection can be accomplished through pre-training.
  • the model detects the radar data and obtains the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained; as far as the map data is concerned, after the map data is obtained, the map data can be vectorized to obtain the map
  • the target feature vector corresponding to the data for example: For the map instance object of a specific lane line at the target location, the position information of multiple sampling points on the lane line can be obtained, and then each two adjacent sampling points are used as the starting point and end point respectively to generate a vector.
  • the vector It is the feature vector of the lane line instance object, which represents the lane line instance object between the two adjacent sampling points.
  • Step 106 Based on the cross attention mechanism, various target feature vectors are fused to obtain a fused feature vector.
  • the target feature vector corresponding to the target data can be determined by referring to the similarity between the target feature vectors corresponding to the other target data and the target feature vector corresponding to the target data. Adjustment is made so that the adjusted target feature vector focuses on representing information related to the other target feature vectors, while ignoring information that is less relevant to the other target feature vectors. Then, all the adjusted target feature vectors are fused. Process to obtain the fused feature vector.
  • Step 108 Perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location.
  • the traffic light perception results finally obtained by the embodiment of the present application may include the target positions: traffic light colors in the straight direction, the left turn direction, and the right turn direction.
  • any existing classification prediction method can be used to obtain the final perception result based on the fusion feature vector.
  • the perception results are obtained through the classification prediction model used for classification prediction, etc.
  • the classification prediction model can be a classifier structure with three branches. Each branch is used to output a two-class classification result to predict whether to go straight, turn left, or turn right. The color of the traffic light in one direction.
  • the specific structure of the classification prediction model is not limited. For example, a multi-layer perceptron model with a relatively simple structure can be used, and so on.
  • Figure 2 is a schematic diagram of a scene corresponding to Embodiment 1 of the present application.
  • Figure 2 a specific scenario example will be used to describe the embodiment of the present application:
  • the types of targets that can be detected can be preset according to actual needs. In the embodiment of this application, there is no limit to the number of types and specific content of the preset targets that can be detected.
  • the preset targets that can be detected can include 3 types, namely: pedestrians, vehicles, and cyclists, etc.
  • each target corresponds to a target feature vector (used to characterize the type, location, shape and other characteristics of the target),
  • the target feature vector 2, target feature vector 3 and target feature vector 4 in are the targets corresponding to the radar data.
  • Marker feature vectors; map data for target locations can be vectorized to obtain the corresponding target feature vectors, where each target feature vector is used to represent a feature in the map Characteristic information of instance objects, assuming that the map data in Figure 2 contains 4 instance objects, namely: lane line 1, lane line 2, lane line 3 and crosswalk.
  • target feature vector 5 represents the characteristic information of lane line 1
  • Target feature vector 6 represents the feature information of lane line 2
  • target feature vector 7 represents the feature information of lane line 3
  • target feature vector 8 represents the feature information of crosswalk.
  • Target feature vectors 5-8 are the target feature vectors corresponding to the map data; After obtaining the target feature vectors corresponding to the three target data respectively, the above target feature vectors can be fused based on the cross attention mechanism to obtain the fused feature vector, and then classification prediction is performed based on the fused feature vector to obtain the traffic light perception result: Traffic light information corresponding to the three directions of going straight, turning left, and turning right, specifically, such as: the colors of the traffic lights corresponding to the three directions of going straight, turning left, and turning right.
  • the traffic light sensing method provided by the embodiment of the present application, multiple different target data of the target position are obtained and target feature vectors corresponding to various target data are obtained, and then feature fusion based on the cross-attention mechanism is performed on each target feature vector.
  • traffic light perception it is based on fused feature vectors. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • the traffic light sensing method in this embodiment can be executed by any appropriate electronic device with data processing capabilities, including but not limited to: servers, PCs, etc.
  • FIG 3 is a flow chart of a traffic light sensing method according to Embodiment 2 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:
  • Step 302 Acquire multiple target data at the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data.
  • multiple frames of continuous image data can be acquired at the same time; for the target data of radar data, multiple frames of continuous radar data can also be acquired at the same time. For example: you can obtain continuous image data or radar data of a preset number of frames of the target position before the current moment.
  • the image data can be images of the target position collected by cameras, etc.
  • radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.
  • Map data can include instance objects such as target location lane lines, crosswalks, green belts, etc. data of location, shape, size and other information.
  • the multiple target data may specifically include: image data, radar data, or any two of the map data, or may include all three of the above data.
  • Step 304 Perform feature extraction for each type of target data to obtain a feature sequence corresponding to the target data.
  • the feature sequence contains multiple initial feature vectors.
  • each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, multiple initial feature vectors represent feature information of at least one map instance object.
  • the number of types of initial feature vectors contained in the feature sequence is the same as the number of frames of the image data.
  • One initial feature vector corresponds to one frame of image data and is used to characterize the feature information contained in the frame of image data. , for example, when there are 3 frames of image data, there are also 3 corresponding initial feature vectors.
  • Each initial feature vector is a feature vector obtained after feature extraction of one frame of image data; similarly, for radar data, the feature vector
  • the number of types of initial feature vectors contained in the sequence is the same as the number of frames of radar data.
  • One initial feature vector corresponds to one frame of radar data and is used to characterize the feature information contained in the radar data of that frame. For example, when the radar data totals 3 frame, the corresponding initial feature vectors are also three types, and each initial feature vector is a feature vector obtained after feature extraction of one frame of radar data.
  • a feature sequence containing a variety of initial feature vectors can be obtained.
  • the above multiple initial feature vectors are used to represent map instance objects in the map (such as lane lines, crosswalks, green belts, etc. ) feature information, for example: for a lane line with a length of 200 meters, the lane line part of the first 100 meters can be represented by the first initial feature vector, and the lane line part of the subsequent 100 meters can be represented by the second initial feature vector.
  • Feature limit representation where the first initial feature vector is vectorized based on the coordinate positions of the starting point and end point of the first 100 meters of the lane line; the second initial feature vector is based on the starting point of the next 100 meters of the lane line. The coordinate positions of the starting point and the ending point are vectorized.
  • Step 306 Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each type of target data to obtain a target feature vector corresponding to each type of target data.
  • the target data is image data or radar data
  • feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:
  • the correlation degree between the base initial vector and the other initial feature vectors represents the degree of correlation between the base initial vector and the other initial feature vectors.
  • the degree of correlation between the baseline initial vector and other initial feature vectors can be characterized by the attention weight.
  • the greater the attention weight of the above-mentioned remaining initial feature vectors is, conversely, the lower the correlation between the baseline initial vector and the remaining initial feature vectors, the smaller the attention weight of the above-mentioned remaining initial feature vectors.
  • the above attention weight can be calculated using the existing attention mechanism (attention method).
  • the attention values of the remaining initial feature vectors which may include:
  • the attention mechanism can be used to calculate the attention weights of the remaining initial feature vectors, and then the product of the above attention weight and the remaining initial feature vectors is used as the attention value of the remaining initial feature vectors.
  • the later the timestamp of the data is, the higher the importance of the feature information it contains. Therefore, in order to make the target feature vector corresponding to the final target data better able to Characterize the feature information in the target data.
  • the target data is image data or radar data
  • the initial feature vector corresponding to the frame image data or radar data is used as the base initial vector, and the base initial vector is updated based on the attention values of the remaining initial feature vectors to obtain the target feature vector corresponding to the target data.
  • the target data is map data
  • feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:
  • feature fusion is performed based on the self-attention mechanism to obtain multiple self-updating feature vectors for each map instance object; multiple self-updates for each map instance object are obtained
  • the feature vector is subjected to a maximum pooling operation to obtain the target feature vector of the target data.
  • multiple self-updating feature vectors of each map instance object can be obtained in the following ways:
  • a map data contains only one map instance object: a lane line, and the lane line pair
  • the corresponding initial feature vectors are initial feature vector 1 and initial feature vector 2 respectively
  • the process of obtaining the target feature vector of the map data can include:
  • the initial feature vector 1 For the initial feature vector 1, based on the correlation (attention weight) between the initial feature vector 1 and the initial feature vector 2, calculate the attention value of the initial feature vector 2; based on the attention value of the initial feature vector 2, update the initial feature Vector 1, obtain the self-updating feature vector 1 corresponding to the initial feature vector 1; similarly, for the initial feature vector 2, calculate the initial feature vector based on the correlation (attention weight) between the initial feature vector 2 and the initial feature vector 1
  • the attention value of 1 based on the attention value of the initial feature vector 1, update the initial feature vector 2 to obtain the self-updating feature vector 2 corresponding to the initial feature vector 2; then perform the maximum operation on the self-updating feature vector 1 and the self-updating feature vector 2 Pooling operation (take the largest element at the same position of each updated feature vector as the element value of the corresponding position of the target feature vector) to obtain the target feature vector of the target data (that is, the lane line).
  • Step 308 For each type of target feature vector, calculate the attention values of other types of target feature vectors based on the correlation between this type of target feature vector and other types of target feature vectors.
  • the degree of correlation between this type of target feature vector and other types of target feature vectors represents the degree of correlation between this type of target feature vector and other types of target feature vectors.
  • the degree of association between this type of target feature vector and other types of target feature vectors can be characterized by the attention weight.
  • the higher the degree of correlation the greater the attention weight of the above-mentioned other types of target feature vectors.
  • the lower the degree of correlation between this type of target feature vector and other types of target feature vectors the greater the attention weight of the above-mentioned other types of target feature vectors.
  • the above attention weight can also be calculated using the existing attention mechanism (attention method).
  • the correlation degree (attention weight) between the target feature vector of the other type and the above-mentioned target feature vector can be calculated, and then the correlation degree (attention weight) and the above-mentioned target feature vector can be calculated.
  • the product of the target feature vectors of other categories is used as the attention value of the target feature vectors of the other categories.
  • the process of calculating the attention value of target feature vector 2 includes: first calculating the correlation (attention) between target feature vector 1 and target feature vector 2 weight), and then the product of the correlation (attention weight) and the target feature vector 2 is used as the attention value of the target feature vector 2.
  • Step 310 Based on the attention values of other types of target feature vectors, update the target feature vectors of this type to obtain updated target vectors. Afterwards, it is determined whether the preset update stop condition is reached. If not, the updated target vector is used as the new target feature vector, and the process returns to step 308; if so, step 312 is executed.
  • the sum of the attention values of this type of target feature vector and the other types of target feature vectors can be used as the The updated target vector corresponding to the target feature vector.
  • the update stop condition can be customized according to actual needs, and the specific content of the update stop condition is not limited here.
  • the update stop condition can be that the number of times the target vector is updated reaches a preset number
  • the update stop condition can also be that the correlation (attention weight) between the target vectors after two updates is greater than the preset correlation threshold ( attention weight threshold), etc.
  • Step 312 Perform fusion processing on various updated target vectors to obtain a fused feature vector.
  • the sum of various updated target vectors can be directly used as the fusion feature vector; a weight value can also be set separately for each updated target vector, and then based on the above According to the set weight value, perform a weighted sum of various updated target vectors to obtain a fusion feature vector; you can also perform a maximum pooling operation on various updated target vectors to obtain a fusion feature vector, and so on.
  • Step 314 Classify and predict based on the fused feature vector to obtain the traffic light perception result at the target location.
  • the traffic light perception results finally obtained by the embodiment of the present application may include the target position: traffic light information of the straight direction, left turn direction and right turn direction, specifically, such as the straight direction, left turn direction and right turn direction. Traffic light colors.
  • any existing classification prediction method can be used to obtain the final perception result based on the fusion feature vector.
  • the perception results are obtained through the classification prediction model used for classification prediction, etc.
  • the classification prediction model can be a classifier structure with three branches. Each branch is used to output a two-class classification result to predict whether to go straight, turn left, or turn right. The color of the traffic light in one direction.
  • the specific structure of the classification prediction model is not limited. For example, a multi-layer perceptron model with a relatively simple structure can be used, and so on.
  • the feature extraction for each type of target data in step 304 can be performed based on the feature extraction model; in step 306, based on the self-attention mechanism, each initial feature vector in the corresponding feature sequence of each type of target data is extracted.
  • Feature fusion can be based on a self-attention model (for example: a transformer model based on a self-attention mechanism, etc.);
  • Steps 308 to 310 can be based on a cross-attention model (for example: a transformer model based on a self-attention mechanism, etc.) );
  • Step 314 can be performed based on the classification prediction model.
  • the traffic light sensing method provided by the embodiment of the present application can, after obtaining the target data, output the final sensing result based on a series of machine learning models.
  • the embodiment of the present application provides an end-to-end
  • the traffic light sensing solution does not require complex post-processing operations, so the solution is simpler and has a wider scope of application.
  • Figure 4 is a schematic diagram of a scene corresponding to Embodiment 2 of the present application. Below, reference will be made to the The schematic diagram uses a specific scenario example to illustrate the embodiment of the present application:
  • the image data is 3 consecutive frames: the first frame of image data, the second frame of image data and the third frame of image data; radar data It is also three consecutive frames: the first frame of radar data, the second frame of radar data and the third frame of radar data; for the above three frames of image data, feature extraction is performed respectively to obtain the feature sequence corresponding to the image data ( Figure 4 Correspondence of image data In the feature sequence, each open circle represents an initial feature vector corresponding to one frame of image data); for each frame of radar data in the above three frames of radar data, feature extraction is performed separately to obtain the initial feature vector corresponding to each frame of radar data.
  • Feature sequence composed of feature data (assuming that the radar data contains a total of 3 types of targets: pedestrians, vehicle 1 and vehicle 2, then in the feature sequence corresponding to the radar data in Figure 4, the 3 solid circles in each column represent the initial start of a frame of radar data Feature vector, 1 solid circle represents the initial feature vector of a target in the radar data of this frame; 3 solid circles in each row represent the initial feature vector of the same target in different radar data frames); for map data, feature extraction ( Vectorized representation), and obtain a feature sequence consisting of multiple initial feature vectors corresponding to the map data (assuming that the map data contains a total of 4 map instance objects: lane line 1, lane line 2, lane line 3 and crosswalk, then Figure 4 map data
  • each straight line with an arrow represents an initial feature vector.
  • lane line 1 corresponds to two kinds of initial feature vectors
  • lane line 2 corresponds to two kinds of initial feature vectors
  • Lane line 3 corresponds to 2 kinds of initial feature vectors
  • crosswalk corresponds to 4 kinds of initial feature vectors
  • Target feature vector specifically: perform feature fusion on the initial feature vectors in the feature sequence corresponding to the image data to obtain the target feature vector 1 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the radar data (respectively on the same row) Perform feature fusion on each initial feature vector of the image data) to obtain the target feature vector 2, target feature vector 3 and target feature vector 4 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the map data (respectively for the same map instance Each initial feature vector corresponding to the object is subjected to feature fusion) to obtain the target feature vector 5, target feature vector 6, target feature vector 7 and target feature vector 8 corresponding to the map data; finally, based on the cross attention mechanism, the above target feature vector is 1-8 perform fusion processing to obtain the fusion feature vector, and then perform classification prediction based on the fusion feature vector to obtain the traffic light perception results: the corresponding traffic light colors in the three directions of going straight, turning left, and turning right.
  • the traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector.
  • the feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. In other words, in the embodiment of this application, it is based on multiple different modes of the environment around the target location. Data, cross-modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • the feature fusion of each initial feature vector of multiple consecutive image frames or radar frames is first based on the self-attention mechanism, and the map Each initial feature vector of different map instance objects in the data is feature fused to obtain target feature vectors corresponding to different target data.
  • the above-mentioned feature fusion operation based on the self-attention mechanism to obtain the target feature vector is applicable to image or radar sequences and The historical status of traffic participants in the surrounding environment is correlated and fused.
  • the target feature vector contains richer and more important information. Therefore, Based on the subsequent logical reasoning based on the above target feature vector, the final traffic light perception result is more accurate and stable.
  • the traffic light sensing method in this embodiment can be executed by any appropriate electronic device with data processing capabilities, including but not limited to: servers, PCs, etc.
  • FIG 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application.
  • the traffic light sensing device provided by the embodiment of the present application includes:
  • the target data acquisition module 502 is used to acquire multiple target data of the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data;
  • the target feature vector obtaining module 504 is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;
  • the fusion module 506 is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector
  • the result obtaining module 508 performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target location.
  • the fusion module 506 is specifically used to:
  • the fusion module 506, before performing the fusion process on various updated target vectors to obtain the fused feature vector is also used to:
  • the fusion module 506 when performing the step of fusion processing on various updated target vectors to obtain a fused feature vector, is specifically used to:
  • the target feature vector obtaining module 504 is specifically used for:
  • the feature sequence contains a variety of initial feature vectors
  • feature fusion is performed on each initial feature vector in the feature sequence corresponding to each target data to obtain the target feature vector corresponding to each target data;
  • each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, multiple initial feature vectors represent feature information of at least one map instance object.
  • the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism. , when obtaining the target feature vector corresponding to the target data, it is specifically used for:
  • the baseline initial vector is updated to obtain the target feature vector corresponding to the target data.
  • the target feature vector obtaining module 504 when the target feature vector obtaining module 504 performs the step of selecting an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as the reference initial vector, specifically use At:
  • the initial feature vector corresponding to the data or radar data is used as the baseline initial vector.
  • the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism to obtain the target
  • the target feature vector corresponding to the data it is specifically used for:
  • feature fusion is performed based on the self-attention mechanism to obtain a variety of self-updating feature vectors for each map instance object;
  • the target feature vector obtaining module 504 when obtaining the target feature vector corresponding to the image data, is specifically used to:
  • feature extraction is performed on the image data to obtain the target feature vector corresponding to the image data;
  • the target feature vector obtaining module 504 is specifically used to:
  • the radar data is detected through the pre-trained three-dimensional target detection model to obtain the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained;
  • the target feature vector obtaining module 504 is specifically used to:
  • the traffic light sensing device in the embodiment of the present application is used to implement the corresponding traffic light sensing method in the first or second method embodiment, and has the beneficial effects of the corresponding method embodiment, which will not be described again here.
  • the functional implementation of each module in the traffic light sensing device in the embodiment of the present application reference can be made to the description of the corresponding parts in the first or second embodiment of the method, and will not be described again here.
  • FIG. 6 a schematic structural diagram of an electronic device according to Embodiment 4 of the present application is shown.
  • the specific embodiments of the present application do not limit the specific implementation of the electronic device.
  • the electronic device may include: a processor (processor) 602, a communications interface (Communications Interface) 604, a memory (memory) 606, and a communication bus 608.
  • processor processor
  • communications interface Communication Interface
  • memory memory
  • the processor 602, the communication interface 604, and the memory 606 complete communication with each other through the communication bus 608.
  • Communication interface 604 is used to communicate with other electronic devices or servers.
  • the processor 1202 is used to execute the program 610. Specifically, it can execute the above embodiment of the traffic light sensing method. related steps.
  • program 610 may include program code including computer operating instructions.
  • the processor 602 may be a CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present application.
  • the one or more processors included in the smart device can be the same type of processor, such as one or more CPUs; or they can be different types of processors, such as one or more CPUs and one or more ASICs.
  • Memory 606 is used to store program 610.
  • the memory 606 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the program 610 can be specifically used to cause the processor 602 to perform the following operations: obtain multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data; characterize the various target data respectively. Extract and obtain target feature vectors corresponding to various target data; based on the cross-attention mechanism, fuse various target feature vectors to obtain a fused feature vector; perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location .
  • each step in program 610 please refer to the corresponding steps and corresponding descriptions in the units in the above embodiment of the traffic light sensing method, and will not be described again here.
  • Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described devices and modules can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be described again here.
  • the electronic device of this embodiment Through the electronic device of this embodiment, a variety of different target data at the target position are obtained and target feature vectors corresponding to various target data are obtained. Then feature fusion based on the cross-attention mechanism is performed on each target feature vector, and in traffic light perception When, it is based on the fusion feature vector. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • Embodiments of the present application also provide a computer program product, including computer instructions, which instruct the computing device to perform operations corresponding to any of the traffic light sensing methods in the multiple method embodiments mentioned above.
  • each component/step described in the embodiments of this application can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of this application.
  • the above-mentioned methods according to the embodiments of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optical disk), or by Network downloads are originally stored on remote recording media or non-transitory machine-readable media and will be stored in Computer code is local to the recording medium so that the methods described herein can be processed by such software stored on the recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA.
  • a recording medium such as CD ROM, RAM, floppy disk, hard disk or magneto-optical disk
  • Computer code is local to the recording medium so that the methods described herein can be processed by such software stored on the recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA.
  • a computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code when the software or computer code is used by the computer, When accessed and executed by a processor or hardware, the traffic light sensing methods described herein are implemented. Furthermore, when a general-purpose computer accesses code for implementing the traffic light sensing method illustrated herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the traffic light sensing method illustrated herein.
  • storage components e.g., RAM, ROM, flash memory, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Traffic Control Systems (AREA)

Abstract

Provided in the embodiments of the present application are a traffic signal lamp sensing method and apparatus, and a device and a storage medium. The traffic signal lamp sensing method comprises: acquiring a plurality of types of target data of a target position, wherein the plurality of types of target data comprise at least two types of the following: image data, radar data and map data; respectively performing feature extraction on the various types of target data to obtain target feature vectors corresponding to the various types of target data; performing fusion processing on the various target feature vectors on the basis of a cross attention mechanism to obtain a fused feature vector; and performing classification prediction on the basis of the fused feature vector to obtain a traffic signal lamp sensing result of the target position. According to the embodiments of the present application, cross-modal data fusion and comprehensive analysis reasoning are performed on the basis of a plurality of types of different modal data of a surrounding environment of a target position, so as to obtain a final sensing result, such that the sensing stability and accuracy are relatively high.

Description

交通信号灯感知方法、装置、设备及存储介质Traffic light sensing method, device, equipment and storage medium
本申请要求于2022年05月30日提交中国专利局、申请号为202210599282.6、申请名称为“交通信号灯感知方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on May 30, 2022, with application number 202210599282.6 and the application name "Traffic light sensing method, device, equipment and storage medium", the entire content of which is incorporated by reference. in this application.
技术领域Technical field
本申请实施例涉及计算机技术领域,尤其涉及一种交通信号灯感知方法、装置、设备及存储介质。Embodiments of the present application relate to the field of computer technology, and in particular to a traffic light sensing method, device, equipment and storage medium.
背景技术Background technique
交通信号灯感知,是指精确识别出路口交通信号灯的颜色及控制方向,其在自动驾驶等领域是一项非常重要的任务。Traffic light perception refers to accurately identifying the color and control direction of traffic lights at intersections. It is a very important task in fields such as autonomous driving.
相关技术中,进行交通信号灯感知的常用方案为:获取包含交通信号灯的图像数据,通过目标检测模型,对上述图像数据进行检测,从而得到对应的感知结果。In related technologies, a common solution for traffic light perception is to obtain image data containing traffic lights, and detect the above image data through a target detection model to obtain corresponding perception results.
上述方案,对图像内容的依赖程度较高,方案的稳定性较差。例如:当交通信号灯被大车等周围其他物体遮挡,或者,降雨天气等导致交通信号灯在图像中不可见时,上述方案无法获得感知结果。The above scheme has a high degree of dependence on image content, and the stability of the scheme is poor. For example: when the traffic lights are blocked by other surrounding objects such as large cars, or when the traffic lights are invisible in the image due to rainy weather, the above solution cannot obtain the perception results.
发明内容Contents of the invention
有鉴于此,本申请实施例提供一种交通信号灯感知方法、装置、设备及存储介质,以至少部分解决上述问题。In view of this, embodiments of the present application provide a traffic light sensing method, device, equipment and storage medium to at least partially solve the above problems.
根据本申请实施例的第一方面,提供了一种交通信号灯感知方法,包括:According to the first aspect of the embodiment of the present application, a traffic light sensing method is provided, including:
获取目标位置的多种目标数据,所述多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据;Acquire multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data;
分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量;Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data;
基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量;Based on the cross-attention mechanism, various target feature vectors are fused to obtain a fused feature vector;
基于所述融合特征向量进行分类预测,得到所述目标位置的交通信号灯感知结果。Classification prediction is performed based on the fused feature vector to obtain the traffic light perception result of the target position.
根据本申请实施例的第二方面,提供了一种交通信号灯感知装置,包括: According to a second aspect of the embodiment of the present application, a traffic light sensing device is provided, including:
目标数据获取模块,用于获取目标位置的多种目标数据,所述多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据;The target data acquisition module is used to acquire multiple target data of the target location. The multiple target data include at least two of the following: image data, radar data, and map data;
目标特征向量得到模块,用于分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量;The target feature vector obtaining module is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;
融合模块,用于基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量;The fusion module is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector;
结果得到模块,基于所述融合特征向量进行分类预测,得到所述目标位置的交通信号灯感知结果。The result obtaining module performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target position.
根据本申请实施例的第三方面,提供了一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如第一方面所述的交通信号灯感知方法对应的操作。According to a third aspect of the embodiment of the present application, an electronic device is provided, including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface complete each other through the communication bus. communication between; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform operations corresponding to the traffic light sensing method described in the first aspect.
根据本申请实施例的第四方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的交通信号灯感知方法。According to a fourth aspect of the embodiments of the present application, a computer storage medium is provided, on which a computer program is stored. When the program is executed by a processor, the traffic light sensing method as described in the first aspect is implemented.
本申请实施例提供的交通信号灯感知方法、装置、设备及存储介质,获取目标位置的多种不同目标数据并得到各种目标数据对应的目标特征向量,然后对各目标特征向量进行了基于交叉注意力机制的特征融合,在交通信号灯感知时,是基于融合特征向量进行的。也就是说,本申请实施例中,是基于目标位置周围环境的多种不同模态数据,进行垮模态数据融合及综合分析推理,从而得到最终的感知结果的。因此,与仅依赖图像数据这一单一模态数据进行的感知方式相比,本申请实施例的感知稳定性及准确性均较高。The traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector. The feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. That is to say, in the embodiment of this application, based on multiple different modal data of the environment around the target position, modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some of the embodiments recorded in the embodiments of this application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings.
图1为根据本申请实施例一的一种交通信号灯感知方法的步骤流程图;Figure 1 is a step flow chart of a traffic light sensing method according to Embodiment 1 of the present application;
图2为图1所示实施例中的一种场景示例的示意图;Figure 2 is a schematic diagram of an example scenario in the embodiment shown in Figure 1;
图3为根据本申请实施例二的一种交通信号灯感知方法的步骤流程图;Figure 3 is a step flow chart of a traffic light sensing method according to Embodiment 2 of the present application;
图4为图3所示实施例中的一种场景示例的示意图; Figure 4 is a schematic diagram of an example scenario in the embodiment shown in Figure 3;
图5为根据本申请实施例三的一种交通信号灯感知装置的结构框图;Figure 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application;
图6为根据本申请实施例四的一种电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present application.
具体实施方式Detailed ways
为了使本领域的人员更好地理解本申请实施例中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本申请实施例保护的范围。In order to enable those in the art to better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the description The embodiments are only part of the embodiments of the present application, rather than all the embodiments. Based on the examples in the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art should fall within the scope of protection of the embodiments of this application.
下面结合本申请实施例附图进一步说明本申请实施例具体实现。The specific implementation of the embodiments of the present application will be further described below with reference to the accompanying drawings of the embodiments of the present application.
实施例一Embodiment 1
参照图1,图1为根据本申请实施例一的一种交通信号灯感知方法的步骤流程图。具体地,本实施例提供的交通信号灯感知方法包括以下步骤:Referring to Figure 1, Figure 1 is a flow chart of a traffic light sensing method according to Embodiment 1 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:
步骤102,获取目标位置的多种目标数据,多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据。Step 102: Acquire multiple target data at the target location. The multiple target data include at least two of the following: image data, radar data, and map data.
具体地,目标位置可以为待进行交通信号灯感知的目标路口或者目标路口周围的某一具体位置。图像数据可以为通过相机采集到的目标位置的图像等;雷达数据可以为通过激光雷达采集到的目标位置的点云数据,或者通过毫米波雷达等采集到的目标位置的三维数据,等等;地图数据可以为包含有目标位置车道线、人行横道、绿化带等实例对象的位置、形状以及尺寸等信息的数据。Specifically, the target location may be a target intersection where traffic light sensing is to be performed or a specific location around the target intersection. The image data can be images of the target position collected by cameras, etc.; radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.; The map data may be data containing information such as the location, shape, and size of instance objects such as lane lines, crosswalks, and green belts at the target location.
本申请实施例中,多种目标数据具体可以包括:图像数据、雷达数据,或者,地图数据中的任意两种,也可以上述三种数据均包含。本领域技术人员可以理解,获取的目标数据的种类越多,最终得到的交通信号灯感知结果的准确性以及稳定性也越高。In the embodiment of the present application, the multiple target data may specifically include: image data, radar data, or any two of the map data, or may include all three of the above data. Those skilled in the art can understand that the more types of target data are obtained, the higher the accuracy and stability of the final traffic light sensing result will be.
步骤104,分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量。Step 104: Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data.
具体地,就图像数据而言,可以基于预先训练完成的特征提取模型,对图像数据进行特征提取,得到图像数据对应的目标特征向量;就雷达数据而言,可以通过预先训练完成的三维目标检测模型对雷达数据进行检测,得到目标检测结果;基于目标检测结果,得到雷达数据对应的目标特征向量;就地图数据而言,可以在获取到地图数据之后,对地图数据进行矢量化表示,得到地图数据对应的目标特征向量,例如:就 目标位置的某个具体车道线这一地图实例对象而言,可以获取该车道线上的多个采样点的位置信息,然后以每两个相邻采样点分别作为起点和终点生成向量,该向量即为该车道线实例对象的特征向量,表征该相邻两采样点之间的车道线实例对象。Specifically, as far as image data is concerned, feature extraction can be performed on the image data based on a pre-trained feature extraction model to obtain the target feature vector corresponding to the image data; as far as radar data is concerned, three-dimensional target detection can be accomplished through pre-training. The model detects the radar data and obtains the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained; as far as the map data is concerned, after the map data is obtained, the map data can be vectorized to obtain the map The target feature vector corresponding to the data, for example: For the map instance object of a specific lane line at the target location, the position information of multiple sampling points on the lane line can be obtained, and then each two adjacent sampling points are used as the starting point and end point respectively to generate a vector. The vector It is the feature vector of the lane line instance object, which represents the lane line instance object between the two adjacent sampling points.
步骤106,基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量。Step 106: Based on the cross attention mechanism, various target feature vectors are fused to obtain a fused feature vector.
具体地,针对每种目标数据对应的目标特征向量,可以参考其余目标数据对应的目标特征向量与该种目标数据对应的目标特征向量之间的相似程度,对该种目标数据对应的目标特征向量进行调整,从而使得调整之后的目标特征向量重点表征与其余目标特征向量相关的信息,而忽略掉与其余目标特征向量关联度较低的信息,进而,再对所有调整之后的目标特征向量进行融合处理,从而得到融合特征向量。Specifically, for the target feature vector corresponding to each type of target data, the target feature vector corresponding to the target data can be determined by referring to the similarity between the target feature vectors corresponding to the other target data and the target feature vector corresponding to the target data. Adjustment is made so that the adjusted target feature vector focuses on representing information related to the other target feature vectors, while ignoring information that is less relevant to the other target feature vectors. Then, all the adjusted target feature vectors are fused. Process to obtain the fused feature vector.
步骤108,基于融合特征向量进行分类预测,得到目标位置的交通信号灯感知结果。Step 108: Perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location.
具体地,本申请实施例最终得到的交通信号灯感知结果,可以包括目标位置:直行方向、左转方向以及右转方向的交通信号灯颜色。Specifically, the traffic light perception results finally obtained by the embodiment of the present application may include the target positions: traffic light colors in the straight direction, the left turn direction, and the right turn direction.
可以采用现有的任意分类预测方法,基于融合特征向量得到最终的感知结果。例如:通过用于进行分类预测的分类预测模型得到感知结果,等等。当采用分类预测模型进行分类预测时,该分类预测模型可以为具有3个分支的分类器结构,每个分支均用于输出一个二分类的分类结果,以预测直行、左转或右转三者中一个方向上的交通信号灯颜色。另外,本申请实施例中,对于分类预测模型的具体结构也不做限定,例如:可以采用结构较为简单的多层感知机模型,等等。Any existing classification prediction method can be used to obtain the final perception result based on the fusion feature vector. For example: the perception results are obtained through the classification prediction model used for classification prediction, etc. When a classification prediction model is used for classification prediction, the classification prediction model can be a classifier structure with three branches. Each branch is used to output a two-class classification result to predict whether to go straight, turn left, or turn right. The color of the traffic light in one direction. In addition, in the embodiments of the present application, the specific structure of the classification prediction model is not limited. For example, a multi-layer perceptron model with a relatively simple structure can be used, and so on.
参见图2,图2为本申请实施例一对应的场景示意图,以下,将参考图2所示的示意图,以一个具体场景示例,对本申请实施例进行说明:Referring to Figure 2, Figure 2 is a schematic diagram of a scene corresponding to Embodiment 1 of the present application. Below, with reference to the schematic diagram shown in Figure 2, a specific scenario example will be used to describe the embodiment of the present application:
获取目标位置的图像数据、雷达点云数据以及地图数据共三种目标数据;针对目标位置的上述图像数据,进行特征提取,得到对应的目标特征向量1;针对目标位置的雷达点云数据,对其进行3D目标检测,得到位于目标位置的3个目标,例如:行人、车辆1和车辆2(本申请实施例中可以根据实际需要预先设定可以检测出的目标的种类,本申请实施例中,对于可检测出的预设目标的种类数量以及具体内容不做限定,如:可检测出的预设目标可以包含3种,分别为:行人、车辆以及骑车人,等。图2中仅以雷达数据中包含3个预设目标进行举例,并不构成对本申请实施例的限定),每个目标对应一个目标特征向量(用于表征该目标的类型、位置以及形状等特征),图2中的目标特征向量2、目标特征向量3及目标特征向量4即为雷达数据对应的目 标特征向量;针对目标位置的地图数据(如主要用于自动驾驶的高精度地图数据),可以进行矢量化表示,得到对应的目标特征向量,其中,每个目标特征向量用于表征地图中一个实例对象的特征信息,假设图2地图数据中包含4个实例对象,分别为:车道线1、车道线2、车道线3以及人行横道,对应地,目标特征向量5表征车道线1的特征信息、目标特征向量6表征车道线2的特征信息、目标特征向量7表征车道线3的特征信息、目标特征向量8表征人行横道的特征信息,目标特征向量5-8即为地图数据对应的目标特征向量;在分别得到三种目标数据对应的目标特征向量之后,可以基于交叉注意力机制,对上述目标特征向量进行融合处理,得到融合特征向量,进而基于融合特征向量进行分类预测,得到交通信号灯感知结果:直行、左转以及右转三个方向上分别对应的交通信号灯信息,具体地,如:直行、左转以及右转三个方向上分别对应的交通信号灯的颜色。Acquire three types of target data: image data, radar point cloud data and map data at the target location; perform feature extraction on the above image data at the target location to obtain the corresponding target feature vector 1; for the radar point cloud data at the target location, extract It performs 3D target detection and obtains three targets located at the target position, such as: pedestrian, vehicle 1 and vehicle 2 (in the embodiment of this application, the types of targets that can be detected can be preset according to actual needs. In the embodiment of this application, , there is no limit to the number of types and specific content of the preset targets that can be detected. For example, the preset targets that can be detected can include 3 types, namely: pedestrians, vehicles, and cyclists, etc. In Figure 2, only Taking the radar data containing 3 preset targets as an example, this does not constitute a limitation on the embodiments of the present application), each target corresponds to a target feature vector (used to characterize the type, location, shape and other characteristics of the target), Figure 2 The target feature vector 2, target feature vector 3 and target feature vector 4 in are the targets corresponding to the radar data. Marker feature vectors; map data for target locations (such as high-precision map data mainly used for autonomous driving) can be vectorized to obtain the corresponding target feature vectors, where each target feature vector is used to represent a feature in the map Characteristic information of instance objects, assuming that the map data in Figure 2 contains 4 instance objects, namely: lane line 1, lane line 2, lane line 3 and crosswalk. Correspondingly, target feature vector 5 represents the characteristic information of lane line 1, Target feature vector 6 represents the feature information of lane line 2, target feature vector 7 represents the feature information of lane line 3, and target feature vector 8 represents the feature information of crosswalk. Target feature vectors 5-8 are the target feature vectors corresponding to the map data; After obtaining the target feature vectors corresponding to the three target data respectively, the above target feature vectors can be fused based on the cross attention mechanism to obtain the fused feature vector, and then classification prediction is performed based on the fused feature vector to obtain the traffic light perception result: Traffic light information corresponding to the three directions of going straight, turning left, and turning right, specifically, such as: the colors of the traffic lights corresponding to the three directions of going straight, turning left, and turning right.
根据本申请实施例提供的交通信号灯感知方法,获取目标位置的多种不同目标数据并得到各种目标数据对应的目标特征向量,然后对各目标特征向量进行了基于交叉注意力机制的特征融合,在交通信号灯感知时,是基于融合特征向量进行的。也就是说,本申请实施例中,是基于目标位置周围环境的多种不同模态数据,进行跨模态数据融合及综合分析推理,从而得到最终的感知结果的。因此,与仅依赖图像数据这一单一模态数据进行的感知方式相比,本申请实施例的感知稳定性及准确性均较高。According to the traffic light sensing method provided by the embodiment of the present application, multiple different target data of the target position are obtained and target feature vectors corresponding to various target data are obtained, and then feature fusion based on the cross-attention mechanism is performed on each target feature vector. In traffic light perception, it is based on fused feature vectors. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
本实施例的交通信号灯感知方法可以由任意适当的具有数据处理能力的电子设备执行,包括但不限于:服务器、PC机等。The traffic light sensing method in this embodiment can be executed by any appropriate electronic device with data processing capabilities, including but not limited to: servers, PCs, etc.
实施例二Embodiment 2
参照图3,图3为根据本申请实施例二的一种交通信号灯感知方法的步骤流程图。具体地,本实施例提供的交通信号灯感知方法包括以下步骤:Referring to Figure 3, Figure 3 is a flow chart of a traffic light sensing method according to Embodiment 2 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:
步骤302,获取目标位置的多种目标数据,多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据。Step 302: Acquire multiple target data at the target location. The multiple target data include at least two of the following: image data, radar data, and map data.
具体地,本申请实施例中,针对图像数据这一目标数据,可以同时获取多帧连续图像数据;对于雷达数据这一目标数据,也可以同时获取多帧连续雷达数据。例如:可以获取当前时刻之前,目标位置的预设数量帧的连续图像数据或者雷达数据。Specifically, in the embodiment of the present application, for the target data of image data, multiple frames of continuous image data can be acquired at the same time; for the target data of radar data, multiple frames of continuous radar data can also be acquired at the same time. For example: you can obtain continuous image data or radar data of a preset number of frames of the target position before the current moment.
图像数据可以为通过相机采集到的目标位置的图像等;雷达数据可以为通过激光雷达采集到的目标位置的点云数据,或者通过毫米波雷达等采集到的目标位置的三维数据,等等;地图数据可以为包含有目标位置车道线、人行横道、绿化带等实例对象 的位置、形状以及尺寸等信息的数据。The image data can be images of the target position collected by cameras, etc.; radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.; Map data can include instance objects such as target location lane lines, crosswalks, green belts, etc. data of location, shape, size and other information.
本申请实施例中,多种目标数据具体可以包括:图像数据、雷达数据,或者,地图数据中的任意两种,也可以上述三种数据均包含。本领域技术人员可以理解,获取的目标数据的种类越多,最终得到的交通信号灯感知结果的准确性以及稳定性也越高。In the embodiment of the present application, the multiple target data may specifically include: image data, radar data, or any two of the map data, or may include all three of the above data. Those skilled in the art can understand that the more types of target data are obtained, the higher the accuracy and stability of the final traffic light sensing result will be.
步骤304,针对每种目标数据进行特征提取,得到该种目标数据对应的特征序列,特征序列中包含多种初始特征向量。Step 304: Perform feature extraction for each type of target data to obtain a feature sequence corresponding to the target data. The feature sequence contains multiple initial feature vectors.
其中,对于图像数据,每种初始特征向量表征多帧连续图像数据中的一帧图像数据所包含的特征信息;对于雷达数据,每种初始特征向量表征多帧连续雷达数据中的一帧雷达数据所包含的特征信息;对于地图数据,多种初始特征向量表征至少一个地图实例对象的特征信息。Among them, for image data, each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, multiple initial feature vectors represent feature information of at least one map instance object.
具体地,对于图像数据,特征序列中包含的初始特征向量的种类数,与图像数据的帧数相同,一种初始特征向量对应一帧图像数据,用于表征该帧图像数据中包含的特征信息,例如,当图像数据共3帧时,则对应的初始特征向量也为3种,每种初始特征向量为对一帧图像数据进行特征提取之后得到的特征向量;同样地,对于雷达数据,特征序列中包含的初始特征向量的种类数,与雷达数据的帧数相同,一种初始特征向量对应一帧雷达数据,用于表征该帧雷达数据中包含的特征信息,例如,当雷达数据共3帧时,则对应的初始特征向量也为3种,每种初始特征向量为对一帧雷达数据进行特征提取之后得到的特征向量。Specifically, for image data, the number of types of initial feature vectors contained in the feature sequence is the same as the number of frames of the image data. One initial feature vector corresponds to one frame of image data and is used to characterize the feature information contained in the frame of image data. , for example, when there are 3 frames of image data, there are also 3 corresponding initial feature vectors. Each initial feature vector is a feature vector obtained after feature extraction of one frame of image data; similarly, for radar data, the feature vector The number of types of initial feature vectors contained in the sequence is the same as the number of frames of radar data. One initial feature vector corresponds to one frame of radar data and is used to characterize the feature information contained in the radar data of that frame. For example, when the radar data totals 3 frame, the corresponding initial feature vectors are also three types, and each initial feature vector is a feature vector obtained after feature extraction of one frame of radar data.
对于地图数据,对其进行矢量化表示之后,可以得到包含多种初始特征向量的特征序列,上述多种初始特征向量用于表征地图中地图实例对象(如车道线、人行横道、绿化带,等等)的特征信息,例如:对于某条长度为200米的车道线而言,其前100米的车道线部分可以通过第一初始特征向量表示,其后100米的车道线部分可以通过第二初始特征限量表示,其中,第一初始特征向量为基于前100米车道线部分的起始点以及终止点的坐标位置进行矢量化表示得到的;第二初始特征向量为基于后100米车道线部分的起始点以及终止点的坐标位置进行矢量化表示得到的。For map data, after vectorizing it, a feature sequence containing a variety of initial feature vectors can be obtained. The above multiple initial feature vectors are used to represent map instance objects in the map (such as lane lines, crosswalks, green belts, etc. ) feature information, for example: for a lane line with a length of 200 meters, the lane line part of the first 100 meters can be represented by the first initial feature vector, and the lane line part of the subsequent 100 meters can be represented by the second initial feature vector. Feature limit representation, where the first initial feature vector is vectorized based on the coordinate positions of the starting point and end point of the first 100 meters of the lane line; the second initial feature vector is based on the starting point of the next 100 meters of the lane line. The coordinate positions of the starting point and the ending point are vectorized.
步骤306,基于自注意力机制,对每种目标数据对应特征序列中的各初始特征向量进行特征融合,得到每种目标数据对应的目标特征向量。Step 306: Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each type of target data to obtain a target feature vector corresponding to each type of target data.
具体地,若目标数据为图像数据或者雷达数据,则基于自注意力机制,对目标数据对应特征序列中的各初始特征向量进行特征融合,得到对应的目标特征向量的过程,包括:Specifically, if the target data is image data or radar data, based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:
从目标数据对应特征序列中的各种初始特征向量中选择一种初始特征向量,作为 基准初始向量;基于基准初始向量与其余初始特征向量的关联度,计算其余初始特征向量的注意力值;基于其余初始特征向量的注意力值,更新基准初始向量,得到目标数据对应的目标特征向量。Select an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data, as Baseline initial vector; based on the correlation between the base initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors; based on the attention values of the remaining initial feature vectors, update the base initial vector to obtain the target feature vector corresponding to the target data .
基准初始向量与其余初始特征向量的关联度表征基准初始向量与其余初始特征向量之间的关联程度。在对基准初始向量记性更新时,可以通过注意力权重表征基准初始向量与其余初始特征向量之间的关联程度,基准初始向量与其余初始特征向量之间的关联程度越高,则在对基准初始向量进行更新时,上述其余初始特征向量的注意力权重越大,反之,基准初始向量与其余初始特征向量之间的关联程度越低,则上述其余初始特征向量的注意力权重越小。上述注意力权重可以采用现有的注意力机制(attention方式)计算得到。The correlation degree between the base initial vector and the other initial feature vectors represents the degree of correlation between the base initial vector and the other initial feature vectors. When updating the baseline initial vector, the degree of correlation between the baseline initial vector and other initial feature vectors can be characterized by the attention weight. When the vector is updated, the greater the attention weight of the above-mentioned remaining initial feature vectors is, conversely, the lower the correlation between the baseline initial vector and the remaining initial feature vectors, the smaller the attention weight of the above-mentioned remaining initial feature vectors. The above attention weight can be calculated using the existing attention mechanism (attention method).
具体地,基于基准初始向量与其余初始特征向量的关联度,计算其余初始特征向量的注意力值,具体可以包括:Specifically, based on the correlation between the baseline initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors, which may include:
可以采用注意力机制计算其余初始特征向量的注意力权重,进而将上述注意力权重与其余初始特征向量的乘积作为其余初始特征向量的注意力值。The attention mechanism can be used to calculate the attention weights of the remaining initial feature vectors, and then the product of the above attention weight and the remaining initial feature vectors is used as the attention value of the remaining initial feature vectors.
进一步地,在多帧连续图像数据或者雷达数据中,数据的时间戳越晚,其包含的特征信息的重要性越高,因此,为了使得最终得到的目标数据对应的目标特征向量能够更好地表征目标数据中的特征信息,在其中一些实施例中,若目标数据为图像数据或者雷达数据,则可以:从目标数据对应特征序列中的各种初始特征向量中,选择时间戳最晚的一帧图像数据或者雷达数据对应的初始特征向量,作为基准初始向量,以基于其余初始特征向量的注意力值,更新该基准初始向量,得到目标数据对应的目标特征向量。Furthermore, in multi-frame continuous image data or radar data, the later the timestamp of the data is, the higher the importance of the feature information it contains. Therefore, in order to make the target feature vector corresponding to the final target data better able to Characterize the feature information in the target data. In some embodiments, if the target data is image data or radar data, you can: select the one with the latest timestamp from various initial feature vectors in the feature sequence corresponding to the target data. The initial feature vector corresponding to the frame image data or radar data is used as the base initial vector, and the base initial vector is updated based on the attention values of the remaining initial feature vectors to obtain the target feature vector corresponding to the target data.
若目标数据为地图数据,则基于自注意力机制,对目标数据对应特征序列中的各初始特征向量进行特征融合,得到对应的目标特征向量的过程,包括:If the target data is map data, based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:
针对表征每个地图实例对象特征信息的多种初始特征向量,基于自注意力机制进行特征融合,得到每个地图实例对象的多种自更新特征向量;对每个地图实例对象的多种自更新特征向量进行最大池化操作,得到目标数据的目标特征向量。For a variety of initial feature vectors that represent the feature information of each map instance object, feature fusion is performed based on the self-attention mechanism to obtain multiple self-updating feature vectors for each map instance object; multiple self-updates for each map instance object are obtained The feature vector is subjected to a maximum pooling operation to obtain the target feature vector of the target data.
具体地,可以通过如下方式得到每个地图实例对象的多种自更新特征向量:Specifically, multiple self-updating feature vectors of each map instance object can be obtained in the following ways:
对于地图实例对象中的每种初始特征向量,基于该种初始特征向量与其余种类初始特征向量的关联度,计算其余初始特征向量的注意力值;基于其余种类初始特征向量的注意力值,更新该种初始特征向量,得到自更新特征向量。For each initial feature vector in the map instance object, based on the correlation between this type of initial feature vector and other types of initial feature vectors, calculate the attention values of the other initial feature vectors; based on the attention values of the other types of initial feature vectors, update This kind of initial feature vector obtains a self-updating feature vector.
例如:假设某个地图数据中仅包含一个地图实例对象:某个车道线,该车道线对 应的初始特征向量分别为初始特征向量1和初始特征向量2,则该地图数据的目标特征向量的得到过程可以包括:For example: Suppose a map data contains only one map instance object: a lane line, and the lane line pair The corresponding initial feature vectors are initial feature vector 1 and initial feature vector 2 respectively, then the process of obtaining the target feature vector of the map data can include:
对于初始特征向量1,基于该种初始特征向量1与初始特征向量2的关联度(注意力权重),计算初始特征向量2的注意力值;基于初始特征向量2的注意力值,更新初始特征向量1,得到初始特征向量1对应的自更新特征向量1;同理,对于初始特征向量2,基于该种初始特征向量2与初始特征向量1的关联度(注意力权重),计算初始特征向量1的注意力值;基于初始特征向量1的注意力值,更新初始特征向量2,得到初始特征向量2对应的自更新特征向量2;再对自更新特征向量1和自更新特征向量2进行最大池化操作(分别取各自更新特征向量同一位置处的最大元素,作为目标特征向量对应位置的元素值),得到目标数据(也就是车道线)的目标特征向量。For the initial feature vector 1, based on the correlation (attention weight) between the initial feature vector 1 and the initial feature vector 2, calculate the attention value of the initial feature vector 2; based on the attention value of the initial feature vector 2, update the initial feature Vector 1, obtain the self-updating feature vector 1 corresponding to the initial feature vector 1; similarly, for the initial feature vector 2, calculate the initial feature vector based on the correlation (attention weight) between the initial feature vector 2 and the initial feature vector 1 The attention value of 1; based on the attention value of the initial feature vector 1, update the initial feature vector 2 to obtain the self-updating feature vector 2 corresponding to the initial feature vector 2; then perform the maximum operation on the self-updating feature vector 1 and the self-updating feature vector 2 Pooling operation (take the largest element at the same position of each updated feature vector as the element value of the corresponding position of the target feature vector) to obtain the target feature vector of the target data (that is, the lane line).
步骤308,针对每种目标特征向量,基于该种目标特征向量与其余种类目标特征向量的关联度,计算其余种类目标特征向量的注意力值。Step 308: For each type of target feature vector, calculate the attention values of other types of target feature vectors based on the correlation between this type of target feature vector and other types of target feature vectors.
该种目标特征向量与其余种类目标特征向量的关联度,表征该种目标特征向量与其余种类目标特征向量之间的关联程度。在计算其余种类目标特征向量的注意力值时,可以通过注意力权重表征该种目标特征向量与其余种类目标特征向量之间的关联程度,该种目标特征向量与其余种类目标特征向量之间的关联程度越高,则上述其余种类目标特征向量的注意力权重越大,反之,该种目标特征向量与其余种类目标特征向量之间的关联程度越低,则上述其余种类目标特征向量的注意力权重越小。上述注意力权重也可以采用现有的注意力机制(attention方式)计算得到。The degree of correlation between this type of target feature vector and other types of target feature vectors represents the degree of correlation between this type of target feature vector and other types of target feature vectors. When calculating the attention value of other types of target feature vectors, the degree of association between this type of target feature vector and other types of target feature vectors can be characterized by the attention weight. The higher the degree of correlation, the greater the attention weight of the above-mentioned other types of target feature vectors. On the contrary, the lower the degree of correlation between this type of target feature vector and other types of target feature vectors, the greater the attention weight of the above-mentioned other types of target feature vectors. The smaller the weight. The above attention weight can also be calculated using the existing attention mechanism (attention method).
具体地,针对每个其余种类的目标特征向量,可以计算该其余种类的目标特征向量与上述目标特征向量之间的关联度(注意力权重),然后将该关联度(注意力权重)与该其余种类的目标特征向量的乘积作为该其余种类的目标特征向量的注意力值。Specifically, for each other type of target feature vector, the correlation degree (attention weight) between the target feature vector of the other type and the above-mentioned target feature vector can be calculated, and then the correlation degree (attention weight) and the above-mentioned target feature vector can be calculated. The product of the target feature vectors of other categories is used as the attention value of the target feature vectors of the other categories.
例如:针对目标特征向量1和另一种目标特征向量2而言,计算目标特征向量2的注意力值的过程包括:先计算目标特征向量1和目标特征向量2之间的关联度(注意力权重),再将该关联度(注意力权重)与目标特征向量2的乘积作为目标特征向量2的注意力值。For example: for target feature vector 1 and another target feature vector 2, the process of calculating the attention value of target feature vector 2 includes: first calculating the correlation (attention) between target feature vector 1 and target feature vector 2 weight), and then the product of the correlation (attention weight) and the target feature vector 2 is used as the attention value of the target feature vector 2.
步骤310,基于其余种类目标特征向量的注意力值,更新该种目标特征向量,得到更新后目标向量。之后,确定是否达到预设的更新停止条件,若否,将更新后目标向量作为新的目标特征向量,并返回执行步骤308;若是,执行步骤312。Step 310: Based on the attention values of other types of target feature vectors, update the target feature vectors of this type to obtain updated target vectors. Afterwards, it is determined whether the preset update stop condition is reached. If not, the updated target vector is used as the new target feature vector, and the process returns to step 308; if so, step 312 is executed.
具体地,针对每种目标特征向量,在分别得到其余种类目标特征向量的注意力值之后,可以将该种目标特征向量与各其余种类目标特征向量的注意力值之和,作为该 种目标特征向量对应的更新后目标向量。Specifically, for each type of target feature vector, after obtaining the attention values of other types of target feature vectors respectively, the sum of the attention values of this type of target feature vector and the other types of target feature vectors can be used as the The updated target vector corresponding to the target feature vector.
另外,本申请实施例中,更新停止条件可以根据实际需要自定义设定,此处对于更新停止条件的具体内容不做限定。例如:更新停止条件可以为当得到更新后目标向量的次数达到预设数量;更新停止条件也可以为前后两次更新后目标向量之间的关联度(注意力权重)大于预设关联度阈值(注意力权重阈值),等等。In addition, in the embodiment of the present application, the update stop condition can be customized according to actual needs, and the specific content of the update stop condition is not limited here. For example: the update stop condition can be that the number of times the target vector is updated reaches a preset number; the update stop condition can also be that the correlation (attention weight) between the target vectors after two updates is greater than the preset correlation threshold ( attention weight threshold), etc.
步骤312,对各种更新后目标向量进行融合处理,得到融合特征向量。Step 312: Perform fusion processing on various updated target vectors to obtain a fused feature vector.
本申请实施例中,对于具体的融合处理方法不作限定,例如:可以直接将各种更新后目标向量之和作为融合特征向量;也可以分别为每种更新后目标向量设置权重值,再基于上述设置的权重值,对各种更新后目标向量进行加权求和,得到融合特征向量;还可以对各种更新后目标向量进行最大池化操作,得到融合特征向量,等等。In the embodiment of the present application, there is no limit to the specific fusion processing method. For example, the sum of various updated target vectors can be directly used as the fusion feature vector; a weight value can also be set separately for each updated target vector, and then based on the above According to the set weight value, perform a weighted sum of various updated target vectors to obtain a fusion feature vector; you can also perform a maximum pooling operation on various updated target vectors to obtain a fusion feature vector, and so on.
步骤314,基于融合特征向量进行分类预测,得到目标位置的交通信号灯感知结果。Step 314: Classify and predict based on the fused feature vector to obtain the traffic light perception result at the target location.
具体地,本申请实施例最终得到的交通信号灯感知结果,可以包括目标位置:直行方向、左转方向以及右转方向的交通信号灯信息,具体地,如直行方向、左转方向以及右转方向的交通信号灯颜色。Specifically, the traffic light perception results finally obtained by the embodiment of the present application may include the target position: traffic light information of the straight direction, left turn direction and right turn direction, specifically, such as the straight direction, left turn direction and right turn direction. Traffic light colors.
可以采用现有的任意分类预测方法,基于融合特征向量得到最终的感知结果。例如:通过用于进行分类预测的分类预测模型得到感知结果,等等。当采用分类预测模型进行分类预测时,该分类预测模型可以为具有3个分支的分类器结构,每个分支均用于输出一个二分类的分类结果,以预测直行、左转或右转三者中一个方向上的交通信号灯颜色。另外,本申请实施例中,对于分类预测模型的具体结构也不做限定,例如:可以采用结构较为简单的多层感知机模型,等等。Any existing classification prediction method can be used to obtain the final perception result based on the fusion feature vector. For example: the perception results are obtained through the classification prediction model used for classification prediction, etc. When a classification prediction model is used for classification prediction, the classification prediction model can be a classifier structure with three branches. Each branch is used to output a two-class classification result to predict whether to go straight, turn left, or turn right. The color of the traffic light in one direction. In addition, in the embodiments of the present application, the specific structure of the classification prediction model is not limited. For example, a multi-layer perceptron model with a relatively simple structure can be used, and so on.
本申请实施例中,对步骤304中对每种目标数据进行特征提取可以基于特征提取模型进行;步骤306中,基于自注意力机制,对每种目标数据对应特征序列中的各初始特征向量进行特征融合,可以基于自注意力模型(例如:基于自注意力机制的transformer模型等等)进行;步骤308-步骤310,可以基于交叉注意力模型(例如:基于自注意力机制的transformer模型等等)进行;步骤314,可以基于分类预测模型进行。因此,本申请实施例提供的交通信号灯感知方法,在获取到目标数据之后,可以基于一系列的机器学习模型,输出最终的感知结果,也就是说,本申请实施例提供了一种端到端的交通信号灯感知方案,无需进行复杂的后处理操作,因此,方案更加简便、且适用范围更广。In this embodiment of the present application, the feature extraction for each type of target data in step 304 can be performed based on the feature extraction model; in step 306, based on the self-attention mechanism, each initial feature vector in the corresponding feature sequence of each type of target data is extracted. Feature fusion can be based on a self-attention model (for example: a transformer model based on a self-attention mechanism, etc.); Steps 308 to 310 can be based on a cross-attention model (for example: a transformer model based on a self-attention mechanism, etc.) ); Step 314 can be performed based on the classification prediction model. Therefore, the traffic light sensing method provided by the embodiment of the present application can, after obtaining the target data, output the final sensing result based on a series of machine learning models. In other words, the embodiment of the present application provides an end-to-end The traffic light sensing solution does not require complex post-processing operations, so the solution is simpler and has a wider scope of application.
参见图4,图4为本申请实施例二对应的场景示意图,以下,将参考图4所示的 示意图,以一个具体场景示例,对本申请实施例进行说明:Referring to Figure 4, Figure 4 is a schematic diagram of a scene corresponding to Embodiment 2 of the present application. Below, reference will be made to the The schematic diagram uses a specific scenario example to illustrate the embodiment of the present application:
获取目标位置的图像数据、雷达点云数据以及地图数据共三种目标数据,其中,图像数据为连续的3帧:第一帧图像数据、第二帧图像数据以及第三帧图像数据;雷达数据也为连续的3帧:第一帧雷达数据、第二帧雷达数据以及第三帧雷达数据;针对上述3帧图像数据,分别进行特征提取,得到图像数据对应的特征序列(图4图像数据对应的特征序列中,每个空心圆圈表征一帧图像数据对应的一种初始特征向量);针对上述3帧雷达数据中的每帧雷达数据,分别进行特征提取,得到由各帧雷达数据对应的初始特征数据组成的特征序列(假设雷达数据中共包含3种目标:行人、车辆1以及车辆2,则图4中雷达数据对应的特征序列中,每列的3个实心圆圈表征一帧雷达数据的初始特征向量,1个实心圆圈表征该帧雷达数据中一个目标的初始特征向量;每行的3个实心圆圈表征同一目标在不同雷达数据帧中的初始特征向量);针对地图数据,进行特征提取(矢量化表示),得到地图数据对应的多种初始特征向量组成的特征序列(假设地图数据中共包含4种地图实例对象:车道线1、车道线2、车道线3以及人行横道,则图4地图数据对应的特征序列中,每条带箭头的直线(包括实线和虚线),均表征一种初始特征向量,其中,车道线1对应2种初始特征向量、车道线2对应2种初始特征向量、车道线3对应2种初始特征向量、人行横道对应4种初始特征向量);基于自注意力机制分别对每种目标数据特征序列中的初始特征向量进行特征融合,得到每种目标数据对应的目标特征向量,具体地:对图像数据对应特征序列中的初始特征向量进行特征融合,得到图像数据对应的目标特征向量1;对雷达数据对应特征序列中的初始特征向量进行特征融合(分别对位于同一行的各初始特征向量进行特征融合),得到图像数据对应的目标特征向量2、目标特征向量3以及目标特征向量4;对地图数据对应特征序列中的初始特征向量进行特征融合(分别对同一地图实例对象对应的各初始特征向量进行特征融合),得到地图数据对应的目标特征向量5、目标特征向量6、目标特征向量7以及目标特征向量8;最终,基于交叉注意力机制,对上述目标特征向量1-8进行融合处理,得到融合特征向量,进而基于融合特征向量进行分类预测,得到交通信号灯感知结果:直行、左转以及右转三个方向上分别对应的交通信号灯颜色。Acquire three types of target data: image data of the target location, radar point cloud data and map data. Among them, the image data is 3 consecutive frames: the first frame of image data, the second frame of image data and the third frame of image data; radar data It is also three consecutive frames: the first frame of radar data, the second frame of radar data and the third frame of radar data; for the above three frames of image data, feature extraction is performed respectively to obtain the feature sequence corresponding to the image data (Figure 4 Correspondence of image data In the feature sequence, each open circle represents an initial feature vector corresponding to one frame of image data); for each frame of radar data in the above three frames of radar data, feature extraction is performed separately to obtain the initial feature vector corresponding to each frame of radar data. Feature sequence composed of feature data (assuming that the radar data contains a total of 3 types of targets: pedestrians, vehicle 1 and vehicle 2, then in the feature sequence corresponding to the radar data in Figure 4, the 3 solid circles in each column represent the initial start of a frame of radar data Feature vector, 1 solid circle represents the initial feature vector of a target in the radar data of this frame; 3 solid circles in each row represent the initial feature vector of the same target in different radar data frames); for map data, feature extraction ( Vectorized representation), and obtain a feature sequence consisting of multiple initial feature vectors corresponding to the map data (assuming that the map data contains a total of 4 map instance objects: lane line 1, lane line 2, lane line 3 and crosswalk, then Figure 4 map data In the corresponding feature sequence, each straight line with an arrow (including solid lines and dotted lines) represents an initial feature vector. Among them, lane line 1 corresponds to two kinds of initial feature vectors, and lane line 2 corresponds to two kinds of initial feature vectors. Lane line 3 corresponds to 2 kinds of initial feature vectors, and crosswalk corresponds to 4 kinds of initial feature vectors); based on the self-attention mechanism, the initial feature vectors in the feature sequence of each target data are feature fused respectively, and the target features corresponding to each target data are obtained. Vector, specifically: perform feature fusion on the initial feature vectors in the feature sequence corresponding to the image data to obtain the target feature vector 1 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the radar data (respectively on the same row) Perform feature fusion on each initial feature vector of the image data) to obtain the target feature vector 2, target feature vector 3 and target feature vector 4 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the map data (respectively for the same map instance Each initial feature vector corresponding to the object is subjected to feature fusion) to obtain the target feature vector 5, target feature vector 6, target feature vector 7 and target feature vector 8 corresponding to the map data; finally, based on the cross attention mechanism, the above target feature vector is 1-8 perform fusion processing to obtain the fusion feature vector, and then perform classification prediction based on the fusion feature vector to obtain the traffic light perception results: the corresponding traffic light colors in the three directions of going straight, turning left, and turning right.
本申请实施例提供的交通信号灯感知方法、装置、设备及存储介质,获取目标位置的多种不同目标数据并得到各种目标数据对应的目标特征向量,然后对各目标特征向量进行了基于交叉注意力机制的特征融合,在交通信号灯感知时,是基于融合特征向量进行的。也就是说,本申请实施例中,是基于目标位置周围环境的多种不同模态 数据,进行跨模态数据融合及综合分析推理,从而得到最终的感知结果的。因此,与仅依赖图像数据这一单一模态数据进行的感知方式相比,本申请实施例的感知稳定性及准确性均较高。The traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector. The feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. In other words, in the embodiment of this application, it is based on multiple different modes of the environment around the target location. Data, cross-modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
另外,在基于交叉注意力机制,对不同目标数据对应的目标特征向量进行融合之前,先基于自注意力机制,对连续多个图像帧或者雷达帧的各初始特征向量进行了特征融合,对地图数据中不同地图实例对象的各初始特征向量进行了特征融合,从而得到不同目标数据对应的目标特征向量,上述基于自注意力机制进行特征融合操作得到目标特征向量的过程,对图像或雷达序列以及周围环境中的交通参与者的历史状态进行了关联融合,与仅基于单帧图像或雷达数据进行特征提取直接得到目标特征向量的方式相比,目标特征向量包含的信息更加丰富和重要,因此,基于上述目标特征向量进行后续的逻辑推理,最终得到的交通信号灯感知结果的准确性以及稳定性更高。In addition, before fusing the target feature vectors corresponding to different target data based on the cross-attention mechanism, the feature fusion of each initial feature vector of multiple consecutive image frames or radar frames is first based on the self-attention mechanism, and the map Each initial feature vector of different map instance objects in the data is feature fused to obtain target feature vectors corresponding to different target data. The above-mentioned feature fusion operation based on the self-attention mechanism to obtain the target feature vector is applicable to image or radar sequences and The historical status of traffic participants in the surrounding environment is correlated and fused. Compared with the method of directly obtaining the target feature vector through feature extraction based only on a single frame image or radar data, the target feature vector contains richer and more important information. Therefore, Based on the subsequent logical reasoning based on the above target feature vector, the final traffic light perception result is more accurate and stable.
本实施例的交通信号灯感知方法可以由任意适当的具有数据处理能力的电子设备执行,包括但不限于:服务器、PC机等。The traffic light sensing method in this embodiment can be executed by any appropriate electronic device with data processing capabilities, including but not limited to: servers, PCs, etc.
实施例三Embodiment 3
参照图5,图5为根据本申请实施例三的一种交通信号灯感知装置的结构框图。本申请实施例提供的交通信号灯感知装置包括:Referring to Figure 5, Figure 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application. The traffic light sensing device provided by the embodiment of the present application includes:
目标数据获取模块502,用于获取目标位置的多种目标数据,多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据;The target data acquisition module 502 is used to acquire multiple target data of the target location. The multiple target data include at least two of the following: image data, radar data, and map data;
目标特征向量得到模块504,用于分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量;The target feature vector obtaining module 504 is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;
融合模块506,用于基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量;The fusion module 506 is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector;
结果得到模块508,基于融合特征向量进行分类预测,得到目标位置的交通信号灯感知结果。The result obtaining module 508 performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target location.
可选地,在其中一些实施例中,融合模块506,具体用于:Optionally, in some embodiments, the fusion module 506 is specifically used to:
针对每种目标特征向量,基于该种目标特征向量与其余种类目标特征向量的关联度,计算其余种类目标特征向量的注意力值;For each type of target feature vector, based on the correlation between this type of target feature vector and other types of target feature vectors, calculate the attention values of other types of target feature vectors;
基于其余种类目标特征向量的注意力值,更新该种目标特征向量,得到更新后目标向量;Based on the attention values of other types of target feature vectors, update the target feature vector of this type to obtain the updated target vector;
对各种更新后目标向量进行融合处理,得到融合特征向量。 Various updated target vectors are fused to obtain a fused feature vector.
可选地,在其中一些实施例中,融合模块506,在执行对各种更新后目标向量进行融合处理,得到融合特征向量之前,还用于:Optionally, in some embodiments, the fusion module 506, before performing the fusion process on various updated target vectors to obtain the fused feature vector, is also used to:
确定是否达到预设的更新停止条件;Determine whether the preset update stop condition is reached;
若否,将更新后目标向量作为新的目标特征向量,并返回针对每种目标特征向量,基于该种目标特征向量与其余种类目标特征向量的关联度,计算其余种类目标特征向量的注意力值的步骤,直至满足更新停止条件。If not, use the updated target vector as the new target feature vector, and return each target feature vector. Based on the correlation between this target feature vector and other types of target feature vectors, calculate the attention values of other types of target feature vectors. steps until the update stop condition is met.
可选地,在其中一些实施例中,融合模块506,在执行对各种更新后目标向量进行融合处理,得到融合特征向量的步骤时,具体用于:Optionally, in some embodiments, the fusion module 506, when performing the step of fusion processing on various updated target vectors to obtain a fused feature vector, is specifically used to:
对各种更新后目标向量进行最大池化操作,得到融合特征向量。Perform maximum pooling operations on various updated target vectors to obtain fusion feature vectors.
可选地,在其中一些实施例中,目标特征向量得到模块504,具体用于:Optionally, in some embodiments, the target feature vector obtaining module 504 is specifically used for:
针对每种目标数据进行特征提取,得到该种目标数据对应的特征序列,特征序列中包含多种初始特征向量;Perform feature extraction for each type of target data to obtain a feature sequence corresponding to that type of target data. The feature sequence contains a variety of initial feature vectors;
基于自注意力机制,对每种目标数据对应特征序列中的各初始特征向量进行特征融合,得到每种目标数据对应的目标特征向量;Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each target data to obtain the target feature vector corresponding to each target data;
其中,对于图像数据,每种初始特征向量表征多帧连续图像数据中的一帧图像数据所包含的特征信息;对于雷达数据,每种初始特征向量表征多帧连续雷达数据中的一帧雷达数据所包含的特征信息;对于地图数据,多种初始特征向量表征至少一个地图实例对象的特征信息。Among them, for image data, each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, multiple initial feature vectors represent feature information of at least one map instance object.
可选地,在其中一些实施例中,若目标数据为图像数据或者雷达数据,目标特征向量得到模块504在执行基于自注意力机制,对目标数据对应特征序列中的各初始特征向量进行特征融合,得到目标数据对应的目标特征向量的步骤时,具体用于:Optionally, in some embodiments, if the target data is image data or radar data, the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism. , when obtaining the target feature vector corresponding to the target data, it is specifically used for:
从目标数据对应特征序列中的各种初始特征向量中选择一种初始特征向量,作为基准初始向量;Select an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as the base initial vector;
基于基准初始向量与其余初始特征向量的关联度,计算其余初始特征向量的注意力值;Based on the correlation between the baseline initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors;
基于其余初始特征向量的注意力值,更新基准初始向量,得到目标数据对应的目标特征向量。Based on the attention values of the remaining initial feature vectors, the baseline initial vector is updated to obtain the target feature vector corresponding to the target data.
可选地,在其中一些实施例中,目标特征向量得到模块504在执行从目标数据对应特征序列中的各种初始特征向量中选择一种初始特征向量,作为基准初始向量的步骤时,具体用于:Optionally, in some embodiments, when the target feature vector obtaining module 504 performs the step of selecting an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as the reference initial vector, specifically use At:
从目标数据对应特征序列中的各种初始特征向量中,选择时间戳最晚的一帧图像 数据或者雷达数据对应的初始特征向量,作为基准初始向量。From various initial feature vectors in the feature sequence corresponding to the target data, select the frame image with the latest timestamp The initial feature vector corresponding to the data or radar data is used as the baseline initial vector.
可选地,在其中一些实施例中,若目标数据为地图数据,目标特征向量得到模块504在执行基于自注意力机制,对目标数据对应特征序列中的各初始特征向量进行特征融合,得到目标数据对应的目标特征向量的步骤时,具体用于:Optionally, in some embodiments, if the target data is map data, the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism to obtain the target When selecting the target feature vector corresponding to the data, it is specifically used for:
针对表征每个地图实例对象特征信息的多种初始特征向量,基于自注意力机制进行特征融合,得到每个地图实例对象的多种自更新特征向量;For a variety of initial feature vectors that represent the feature information of each map instance object, feature fusion is performed based on the self-attention mechanism to obtain a variety of self-updating feature vectors for each map instance object;
对每个地图实例对象的多种自更新特征向量进行最大池化操作,得到目标数据的目标特征向量。Perform a maximum pooling operation on multiple self-updating feature vectors of each map instance object to obtain the target feature vector of the target data.
可选地,在其中一些实施例中,目标特征向量得到模块504在得到图像数据对应的目标特征向量时,具体用于:Optionally, in some embodiments, when obtaining the target feature vector corresponding to the image data, the target feature vector obtaining module 504 is specifically used to:
基于预先训练完成的特征提取模型,对图像数据进行特征提取,得到图像数据对应的目标特征向量;Based on the pre-trained feature extraction model, feature extraction is performed on the image data to obtain the target feature vector corresponding to the image data;
目标特征向量得到模块504在得到雷达数据对应的目标特征向量时,具体用于:When obtaining the target feature vector corresponding to the radar data, the target feature vector obtaining module 504 is specifically used to:
通过预先训练完成的三维目标检测模型对雷达数据进行检测,得到目标检测结果;基于目标检测结果,得到雷达数据对应的目标特征向量;The radar data is detected through the pre-trained three-dimensional target detection model to obtain the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained;
目标特征向量得到模块504在得到地图数据对应的目标特征向量时,具体用于:When obtaining the target feature vector corresponding to the map data, the target feature vector obtaining module 504 is specifically used to:
对地图数据进行矢量化表示,得到地图数据对应的目标特征向量。Vectorize the map data to obtain the target feature vector corresponding to the map data.
本申请实施例的交通信号灯感知装置用于实现前述方法实施例一或实施例二中相应的交通信号灯感知方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本申请实施例的交通信号灯感知装置中的各个模块的功能实现均可参照前述方法实施例一或实施例二中的相应部分的描述,在此亦不再赘述。The traffic light sensing device in the embodiment of the present application is used to implement the corresponding traffic light sensing method in the first or second method embodiment, and has the beneficial effects of the corresponding method embodiment, which will not be described again here. In addition, for the functional implementation of each module in the traffic light sensing device in the embodiment of the present application, reference can be made to the description of the corresponding parts in the first or second embodiment of the method, and will not be described again here.
实施例四Embodiment 4
参照图6,示出了根据本申请实施例四的一种电子设备的结构示意图,本申请具体实施例并不对电子设备的具体实现做限定。Referring to FIG. 6 , a schematic structural diagram of an electronic device according to Embodiment 4 of the present application is shown. The specific embodiments of the present application do not limit the specific implementation of the electronic device.
如图6所示,该电子设备可以包括:处理器(processor)602、通信接口(Communications Interface)604、存储器(memory)606、以及通信总线608。As shown in Figure 6, the electronic device may include: a processor (processor) 602, a communications interface (Communications Interface) 604, a memory (memory) 606, and a communication bus 608.
其中:in:
处理器602、通信接口604、以及存储器606通过通信总线608完成相互间的通信。The processor 602, the communication interface 604, and the memory 606 complete communication with each other through the communication bus 608.
通信接口604,用于与其它电子设备或服务器进行通信。Communication interface 604 is used to communicate with other electronic devices or servers.
处理器1202,用于执行程序610,具体可以执行上述交通信号灯感知方法实施例 中的相关步骤。The processor 1202 is used to execute the program 610. Specifically, it can execute the above embodiment of the traffic light sensing method. related steps.
具体地,程序610可以包括程序代码,该程序代码包括计算机操作指令。Specifically, program 610 may include program code including computer operating instructions.
处理器602可能是CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。The processor 602 may be a CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors included in the smart device can be the same type of processor, such as one or more CPUs; or they can be different types of processors, such as one or more CPUs and one or more ASICs.
存储器606,用于存放程序610。存储器606可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。Memory 606 is used to store program 610. The memory 606 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
程序610具体可以用于使得处理器602执行以下操作:获取目标位置的多种目标数据,多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据;分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量;基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量;基于融合特征向量进行分类预测,得到目标位置的交通信号灯感知结果。The program 610 can be specifically used to cause the processor 602 to perform the following operations: obtain multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data; characterize the various target data respectively. Extract and obtain target feature vectors corresponding to various target data; based on the cross-attention mechanism, fuse various target feature vectors to obtain a fused feature vector; perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location .
程序610中各步骤的具体实现可以参见上述交通信号灯感知方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。For the specific implementation of each step in program 610, please refer to the corresponding steps and corresponding descriptions in the units in the above embodiment of the traffic light sensing method, and will not be described again here. Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described devices and modules can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be described again here.
通过本实施例的电子设备,获取目标位置的多种不同目标数据并得到各种目标数据对应的目标特征向量,然后对各目标特征向量进行了基于交叉注意力机制的特征融合,在交通信号灯感知时,是基于融合特征向量进行的。也就是说,本申请实施例中,是基于目标位置周围环境的多种不同模态数据,进行跨模态数据融合及综合分析推理,从而得到最终的感知结果的。因此,与仅依赖图像数据这一单一模态数据进行的感知方式相比,本申请实施例的感知稳定性及准确性均较高。Through the electronic device of this embodiment, a variety of different target data at the target position are obtained and target feature vectors corresponding to various target data are obtained. Then feature fusion based on the cross-attention mechanism is performed on each target feature vector, and in traffic light perception When, it is based on the fusion feature vector. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
本申请实施例还提供了一种计算机程序产品,包括计算机指令,该计算机指令指示计算设备执行上述多个方法实施例中的任一交通信号灯感知方法对应的操作。Embodiments of the present application also provide a computer program product, including computer instructions, which instruct the computing device to perform operations corresponding to any of the traffic light sensing methods in the multiple method embodiments mentioned above.
需要指出,根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。It should be pointed out that according to the needs of implementation, each component/step described in the embodiments of this application can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of this application.
上述根据本申请实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在 本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的交通信号灯感知方法。此外,当通用计算机访问用于实现在此示出的交通信号灯感知方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的交通信号灯感知方法的专用计算机。The above-mentioned methods according to the embodiments of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optical disk), or by Network downloads are originally stored on remote recording media or non-transitory machine-readable media and will be stored in Computer code is local to the recording medium so that the methods described herein can be processed by such software stored on the recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It will be understood that a computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code when the software or computer code is used by the computer, When accessed and executed by a processor or hardware, the traffic light sensing methods described herein are implemented. Furthermore, when a general-purpose computer accesses code for implementing the traffic light sensing method illustrated herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the traffic light sensing method illustrated herein.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。Those of ordinary skill in the art will appreciate that the units and method steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of the embodiments of the present application.
以上实施方式仅用于说明本申请实施例,而并非对本申请实施例的限制,有关技术领域的普通技术人员,在不脱离本申请实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本申请实施例的范畴,本申请实施例的专利保护范围应由权利要求限定。 The above embodiments are only used to illustrate the embodiments of the present application, but are not intended to limit the embodiments of the present application. Those of ordinary skill in the relevant technical fields can also make various modifications without departing from the spirit and scope of the embodiments of the present application. Changes and modifications, therefore all equivalent technical solutions also fall within the scope of the embodiments of this application, and the patent protection scope of the embodiments of this application should be limited by the claims.

Claims (13)

  1. 一种交通信号灯感知方法,包括:A traffic light sensing method, including:
    获取目标位置的多种目标数据,所述多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据;Acquire multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data;
    分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量;Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data;
    基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量;Based on the cross-attention mechanism, various target feature vectors are fused to obtain a fused feature vector;
    基于所述融合特征向量进行分类预测,得到所述目标位置的交通信号灯感知结果。Classification prediction is performed based on the fused feature vector to obtain the traffic light perception result of the target position.
  2. 根据权利要求1所述的方法,其中,所述基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量,包括:The method according to claim 1, wherein the fusion processing is performed on various target feature vectors based on a cross-attention mechanism to obtain a fused feature vector, including:
    针对每种目标特征向量,基于该种目标特征向量与其余种类目标特征向量的关联度,计算其余种类目标特征向量的注意力值;For each type of target feature vector, based on the correlation between this type of target feature vector and other types of target feature vectors, calculate the attention values of other types of target feature vectors;
    基于其余种类目标特征向量的注意力值,更新该种目标特征向量,得到更新后目标向量;Based on the attention values of other types of target feature vectors, update the target feature vector of this type to obtain the updated target vector;
    对各种更新后目标向量进行融合处理,得到融合特征向量。Various updated target vectors are fused to obtain a fused feature vector.
  3. 根据权利要求2所述的方法,其中,在所述对各种更新后目标向量进行融合处理,得到融合特征向量之前,所述方法还包括:The method according to claim 2, wherein before performing fusion processing on various updated target vectors to obtain the fused feature vector, the method further includes:
    确定是否达到预设的更新停止条件;Determine whether the preset update stop condition is reached;
    若否,将所述更新后目标向量作为新的目标特征向量,并返回所述针对每种目标特征向量,基于该种目标特征向量与其余种类目标特征向量的关联度,计算其余种类目标特征向量的注意力值的步骤,直至满足所述更新停止条件。If not, use the updated target vector as a new target feature vector, return the target feature vector for each type, and calculate other types of target feature vectors based on the correlation between this type of target feature vector and other types of target feature vectors. The attention value steps until the update stop condition is met.
  4. 根据权利要求2或3所述的方法,其中,所述对各种更新后目标向量进行融合处理,得到融合特征向量,包括: The method according to claim 2 or 3, wherein the fusion processing of various updated target vectors to obtain the fusion feature vector includes:
    对各种更新后目标向量进行最大池化操作,得到融合特征向量。Perform maximum pooling operations on various updated target vectors to obtain fusion feature vectors.
  5. 根据权利要求1所述的方法,其中,所述分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量,包括:The method according to claim 1, wherein the feature extraction is performed on various target data respectively to obtain target feature vectors corresponding to various target data, including:
    针对每种目标数据进行特征提取,得到该种目标数据对应的特征序列,所述特征序列中包含多种初始特征向量;Perform feature extraction for each type of target data to obtain a feature sequence corresponding to the target data, and the feature sequence contains a variety of initial feature vectors;
    基于自注意力机制,对每种目标数据对应特征序列中的各初始特征向量进行特征融合,得到每种目标数据对应的目标特征向量;Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each target data to obtain the target feature vector corresponding to each target data;
    其中,对于图像数据,每种初始特征向量表征多帧连续图像数据中的一帧图像数据所包含的特征信息;对于雷达数据,每种初始特征向量表征多帧连续雷达数据中的一帧雷达数据所包含的特征信息;对于地图数据,所述多种初始特征向量表征至少一个地图实例对象的特征信息。Among them, for image data, each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, the multiple initial feature vectors represent feature information of at least one map instance object.
  6. 根据权利要求5所述的方法,其中,若所述目标数据为图像数据或者雷达数据,基于自注意力机制,对所述目标数据对应特征序列中的各初始特征向量进行特征融合,得到所述目标数据对应的目标特征向量的过程,包括:The method according to claim 5, wherein if the target data is image data or radar data, feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data based on a self-attention mechanism to obtain the The process of target feature vector corresponding to target data includes:
    从所述目标数据对应特征序列中的各种初始特征向量中选择一种初始特征向量,作为基准初始向量;Select an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as a reference initial vector;
    基于所述基准初始向量与其余初始特征向量的关联度,计算其余初始特征向量的注意力值;Based on the correlation between the reference initial vector and the remaining initial feature vectors, calculate the attention values of the remaining initial feature vectors;
    基于所述其余初始特征向量的注意力值,更新所述基准初始向量,得到所述目标数据对应的目标特征向量。Based on the attention values of the remaining initial feature vectors, the reference initial vector is updated to obtain a target feature vector corresponding to the target data.
  7. 根据权利要求6所述的方法,其中,所述从所述目标数据对应特征序列中的各种初始特征向量中选择一种初始特征向量,作为基准初始向量,包括:The method according to claim 6, wherein selecting an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as a reference initial vector includes:
    从所述目标数据对应特征序列中的各种初始特征向量中,选择时间戳最晚的一帧图像数据或者雷达数据对应的初始特征向量,作为基准初始向量。From various initial feature vectors in the feature sequence corresponding to the target data, select the initial feature vector corresponding to the frame of image data or radar data with the latest timestamp as the reference initial vector.
  8. 根据权利要求5所述的方法,其中,若所述目标数据为地图数据,基于自注 意力机制,对所述目标数据对应特征序列中的各初始特征向量进行特征融合,得到所述目标数据对应的目标特征向量的过程,包括:The method according to claim 5, wherein if the target data is map data, based on self-injection The attention mechanism performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data to obtain the target feature vector corresponding to the target data, including:
    针对表征每个地图实例对象特征信息的多种初始特征向量,基于自注意力机制进行特征融合,得到每个地图实例对象的多种自更新特征向量;For a variety of initial feature vectors that represent the feature information of each map instance object, feature fusion is performed based on the self-attention mechanism to obtain a variety of self-updating feature vectors for each map instance object;
    对每个地图实例对象的多种自更新特征向量进行最大池化操作,得到所述目标数据的目标特征向量。Perform a maximum pooling operation on multiple self-updating feature vectors of each map instance object to obtain a target feature vector of the target data.
  9. 根据权利要求1所述的方法,其中,所述图像数据对应的目标特征向量的得到过程,包括:The method according to claim 1, wherein the process of obtaining the target feature vector corresponding to the image data includes:
    基于预先训练完成的特征提取模型,对所述图像数据进行特征提取,得到所述图像数据对应的目标特征向量;Based on the pre-trained feature extraction model, perform feature extraction on the image data to obtain the target feature vector corresponding to the image data;
    所述雷达数据对应的目标特征向量的得到过程,包括:The process of obtaining the target feature vector corresponding to the radar data includes:
    通过预先训练完成的三维目标检测模型对所述雷达数据进行检测,得到目标检测结果;基于所述目标检测结果,得到所述雷达数据对应的目标特征向量;The radar data is detected through a pre-trained three-dimensional target detection model to obtain a target detection result; based on the target detection result, a target feature vector corresponding to the radar data is obtained;
    所述地图数据对应的目标特征向量的得到过程,包括:The process of obtaining the target feature vector corresponding to the map data includes:
    对所述地图数据进行矢量化表示,得到所述地图数据对应的目标特征向量。The map data is vectorized to obtain a target feature vector corresponding to the map data.
  10. 一种交通信号灯感知装置,包括:A traffic light sensing device, including:
    目标数据获取模块,用于获取目标位置的多种目标数据,所述多种目标数据包括以下至少两种:图像数据、雷达数据、地图数据;The target data acquisition module is used to acquire multiple target data of the target location. The multiple target data include at least two of the following: image data, radar data, and map data;
    目标特征向量得到模块,用于分别对各种目标数据进行特征提取,得到各种目标数据对应的目标特征向量;The target feature vector obtaining module is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;
    融合模块,用于基于交叉注意力机制,对各种目标特征向量进行融合处理,得到融合特征向量;The fusion module is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector;
    结果得到模块,基于所述融合特征向量进行分类预测,得到所述目标位置的交通信号灯感知结果。 The result obtaining module performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target position.
  11. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;An electronic device, including: a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-9中任一项所述的交通信号灯感知方法对应的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform operations corresponding to the traffic light sensing method according to any one of claims 1-9.
  12. 一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-9中任一所述的交通信号灯感知方法。A computer storage medium on which a computer program is stored. When the program is executed by a processor, the traffic light sensing method as described in any one of claims 1-9 is implemented.
  13. 一种计算机程序产品,包括计算机指令,所述计算机指令指示计算设备执行如权利要求1-9中任一所述的交通信号灯感知方法对应的操作。 A computer program product includes computer instructions, and the computer instructions instruct a computing device to perform operations corresponding to the traffic light sensing method according to any one of claims 1-9.
PCT/CN2023/096961 2022-05-30 2023-05-29 Traffic signal lamp sensing method and apparatus, and device and storage medium WO2023231991A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210599282.6A CN114694123B (en) 2022-05-30 2022-05-30 Traffic signal lamp sensing method, device, equipment and storage medium
CN202210599282.6 2022-05-30

Publications (1)

Publication Number Publication Date
WO2023231991A1 true WO2023231991A1 (en) 2023-12-07

Family

ID=82144742

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096961 WO2023231991A1 (en) 2022-05-30 2023-05-29 Traffic signal lamp sensing method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN114694123B (en)
WO (1) WO2023231991A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102652486B1 (en) * 2021-09-24 2024-03-29 (주)오토노머스에이투지 Method for predicting traffic light information by using lidar and server using the same
CN114694123B (en) * 2022-05-30 2022-09-27 阿里巴巴达摩院(杭州)科技有限公司 Traffic signal lamp sensing method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583415A (en) * 2018-12-11 2019-04-05 兰州大学 A kind of traffic lights detection and recognition methods merged based on laser radar with video camera
CN111563551A (en) * 2020-04-30 2020-08-21 支付宝(杭州)信息技术有限公司 Multi-mode information fusion method and device and electronic equipment
KR20200102907A (en) * 2019-11-12 2020-09-01 써모아이 주식회사 Method and apparatus for object recognition based on visible light and infrared fusion image
CN114254696A (en) * 2021-11-30 2022-03-29 上海西虹桥导航技术有限公司 Visible light, infrared and radar fusion target detection method based on deep learning
US20220101087A1 (en) * 2020-09-30 2022-03-31 Qualcomm Incorporated Multi-modal representation based event localization
CN114419412A (en) * 2022-03-31 2022-04-29 江西财经大学 Multi-modal feature fusion method and system for point cloud registration
CN114694123A (en) * 2022-05-30 2022-07-01 阿里巴巴达摩院(杭州)科技有限公司 Traffic signal lamp sensing method, device, equipment and storage medium

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102015214743A1 (en) * 2015-08-03 2017-02-09 Audi Ag Method and device in a motor vehicle for improved data fusion in an environment detection
CN107316488B (en) * 2017-08-23 2021-01-12 苏州豪米波技术有限公司 Signal lamp identification method, device and system
DE102019215440B4 (en) * 2019-10-09 2021-04-29 Zf Friedrichshafen Ag Recognition of traffic signs
CN111507210B (en) * 2020-03-31 2023-11-21 华为技术有限公司 Traffic signal lamp identification method, system, computing equipment and intelligent vehicle
CN111652050B (en) * 2020-04-20 2024-04-02 宁波吉利汽车研究开发有限公司 Traffic sign positioning method, device, equipment and medium
CN111582189B (en) * 2020-05-11 2023-06-23 腾讯科技(深圳)有限公司 Traffic signal lamp identification method and device, vehicle-mounted control terminal and motor vehicle
CN111950467B (en) * 2020-08-14 2021-06-25 清华大学 Fusion network lane line detection method based on attention mechanism and terminal equipment
CN112580460A (en) * 2020-12-11 2021-03-30 西人马帝言(北京)科技有限公司 Traffic signal lamp identification method, device, equipment and storage medium
CN112507947A (en) * 2020-12-18 2021-03-16 宜通世纪物联网研究院(广州)有限公司 Gesture recognition method, device, equipment and medium based on multi-mode fusion
CN112488083B (en) * 2020-12-24 2024-04-05 杭州电子科技大学 Identification method, device and medium of traffic signal lamp based on key point extraction of hetmap
CN112861748B (en) * 2021-02-22 2022-07-12 奥特酷智能科技(南京)有限公司 Traffic light detection system and method in automatic driving
CN113065590B (en) * 2021-03-26 2021-10-08 清华大学 Vision and laser radar multi-mode data fusion method based on attention mechanism
CN113343849A (en) * 2021-06-07 2021-09-03 西安恒盛安信智能技术有限公司 Fusion sensing equipment based on radar and video
CN113421305B (en) * 2021-06-29 2023-06-02 上海高德威智能交通系统有限公司 Target detection method, device, system, electronic equipment and storage medium
CN113269156B (en) * 2021-07-02 2023-04-18 昆明理工大学 Signal lamp detection and identification method and system based on multi-scale feature fusion
CN114398937B (en) * 2021-12-01 2022-12-27 北京航空航天大学 Image-laser radar data fusion method based on mixed attention mechanism
CN113879339A (en) * 2021-12-07 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Decision planning method for automatic driving, electronic device and computer storage medium
CN114549542A (en) * 2021-12-24 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 Visual semantic segmentation method, device and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583415A (en) * 2018-12-11 2019-04-05 兰州大学 A kind of traffic lights detection and recognition methods merged based on laser radar with video camera
KR20200102907A (en) * 2019-11-12 2020-09-01 써모아이 주식회사 Method and apparatus for object recognition based on visible light and infrared fusion image
CN111563551A (en) * 2020-04-30 2020-08-21 支付宝(杭州)信息技术有限公司 Multi-mode information fusion method and device and electronic equipment
US20220101087A1 (en) * 2020-09-30 2022-03-31 Qualcomm Incorporated Multi-modal representation based event localization
CN114254696A (en) * 2021-11-30 2022-03-29 上海西虹桥导航技术有限公司 Visible light, infrared and radar fusion target detection method based on deep learning
CN114419412A (en) * 2022-03-31 2022-04-29 江西财经大学 Multi-modal feature fusion method and system for point cloud registration
CN114694123A (en) * 2022-05-30 2022-07-01 阿里巴巴达摩院(杭州)科技有限公司 Traffic signal lamp sensing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114694123B (en) 2022-09-27
CN114694123A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN109063768B (en) Vehicle weight identification method, device and system
CN110148196B (en) Image processing method and device and related equipment
WO2023231991A1 (en) Traffic signal lamp sensing method and apparatus, and device and storage medium
US11373067B2 (en) Parametric top-view representation of scenes
US10074020B2 (en) Vehicular lane line data processing method, apparatus, storage medium, and device
WO2018084942A1 (en) Deep cross-correlation learning for object tracking
Šegvić et al. A computer vision assisted geoinformation inventory for traffic infrastructure
JP2016062610A (en) Feature model creation method and feature model creation device
Ojha et al. Vehicle detection through instance segmentation using mask R-CNN for intelligent vehicle system
CN111222395A (en) Target detection method and device and electronic equipment
CN112949366B (en) Obstacle identification method and device
CN111008576B (en) Pedestrian detection and model training method, device and readable storage medium
CN111767831B (en) Method, apparatus, device and storage medium for processing image
CN113378693B (en) Method and device for generating target detection system and detecting target
CN110853085B (en) Semantic SLAM-based mapping method and device and electronic equipment
CN110909656B (en) Pedestrian detection method and system integrating radar and camera
CN112634368A (en) Method and device for generating space and OR graph model of scene target and electronic equipment
CN104700620A (en) Traffic checkpoint-based method and device for recognizing fake-licensed vehicles
CN111597986A (en) Method, apparatus, device and storage medium for generating information
Qing et al. A novel particle filter implementation for a multiple-vehicle detection and tracking system using tail light segmentation
Liu et al. Real-time traffic light recognition based on smartphone platforms
CN113971795A (en) Violation inspection system and method based on self-driving visual sensing
CN112529917A (en) Three-dimensional target segmentation method, device, equipment and storage medium
CN116434156A (en) Target detection method, storage medium, road side equipment and automatic driving system
Prakash et al. Multiple Objects Identification for Autonomous Car using YOLO and CNN

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23815170

Country of ref document: EP

Kind code of ref document: A1