WO2023231991A1 - Procédé et appareil de détection de lampe de signalisation routière, dispositif, et support de stockage - Google Patents

Procédé et appareil de détection de lampe de signalisation routière, dispositif, et support de stockage Download PDF

Info

Publication number
WO2023231991A1
WO2023231991A1 PCT/CN2023/096961 CN2023096961W WO2023231991A1 WO 2023231991 A1 WO2023231991 A1 WO 2023231991A1 CN 2023096961 W CN2023096961 W CN 2023096961W WO 2023231991 A1 WO2023231991 A1 WO 2023231991A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
data
feature
feature vector
initial
Prior art date
Application number
PCT/CN2023/096961
Other languages
English (en)
Chinese (zh)
Inventor
王磊
刘挺
卿泉
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2023231991A1 publication Critical patent/WO2023231991A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data

Definitions

  • Embodiments of the present application relate to the field of computer technology, and in particular to a traffic light sensing method, device, equipment and storage medium.
  • Traffic light perception refers to accurately identifying the color and control direction of traffic lights at intersections. It is a very important task in fields such as autonomous driving.
  • a common solution for traffic light perception is to obtain image data containing traffic lights, and detect the above image data through a target detection model to obtain corresponding perception results.
  • the above scheme has a high degree of dependence on image content, and the stability of the scheme is poor. For example: when the traffic lights are blocked by other surrounding objects such as large cars, or when the traffic lights are invisible in the image due to rainy weather, the above solution cannot obtain the perception results.
  • embodiments of the present application provide a traffic light sensing method, device, equipment and storage medium to at least partially solve the above problems.
  • a traffic light sensing method including:
  • Acquire multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data;
  • Classification prediction is performed based on the fused feature vector to obtain the traffic light perception result of the target position.
  • a traffic light sensing device including:
  • the target data acquisition module is used to acquire multiple target data of the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data;
  • the target feature vector obtaining module is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;
  • the fusion module is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector
  • the result obtaining module performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target position.
  • an electronic device including: a processor, a memory, a communication interface, and a communication bus.
  • the processor, the memory, and the communication interface complete each other through the communication bus. communication between; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform operations corresponding to the traffic light sensing method described in the first aspect.
  • a computer storage medium on which a computer program is stored.
  • the program is executed by a processor, the traffic light sensing method as described in the first aspect is implemented.
  • the traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector.
  • the feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. That is to say, in the embodiment of this application, based on multiple different modal data of the environment around the target position, modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • Figure 1 is a step flow chart of a traffic light sensing method according to Embodiment 1 of the present application.
  • Figure 2 is a schematic diagram of an example scenario in the embodiment shown in Figure 1;
  • Figure 3 is a step flow chart of a traffic light sensing method according to Embodiment 2 of the present application.
  • Figure 4 is a schematic diagram of an example scenario in the embodiment shown in Figure 3;
  • FIG. 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present application.
  • FIG 1 is a flow chart of a traffic light sensing method according to Embodiment 1 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:
  • Step 102 Acquire multiple target data at the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data.
  • the target location may be a target intersection where traffic light sensing is to be performed or a specific location around the target intersection.
  • the image data can be images of the target position collected by cameras, etc.; radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.;
  • the map data may be data containing information such as the location, shape, and size of instance objects such as lane lines, crosswalks, and green belts at the target location.
  • the multiple target data may specifically include: image data, radar data, or any two of the map data, or may include all three of the above data.
  • Step 104 Perform feature extraction on various target data respectively to obtain target feature vectors corresponding to various target data.
  • feature extraction can be performed on the image data based on a pre-trained feature extraction model to obtain the target feature vector corresponding to the image data; as far as radar data is concerned, three-dimensional target detection can be accomplished through pre-training.
  • the model detects the radar data and obtains the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained; as far as the map data is concerned, after the map data is obtained, the map data can be vectorized to obtain the map
  • the target feature vector corresponding to the data for example: For the map instance object of a specific lane line at the target location, the position information of multiple sampling points on the lane line can be obtained, and then each two adjacent sampling points are used as the starting point and end point respectively to generate a vector.
  • the vector It is the feature vector of the lane line instance object, which represents the lane line instance object between the two adjacent sampling points.
  • Step 106 Based on the cross attention mechanism, various target feature vectors are fused to obtain a fused feature vector.
  • the target feature vector corresponding to the target data can be determined by referring to the similarity between the target feature vectors corresponding to the other target data and the target feature vector corresponding to the target data. Adjustment is made so that the adjusted target feature vector focuses on representing information related to the other target feature vectors, while ignoring information that is less relevant to the other target feature vectors. Then, all the adjusted target feature vectors are fused. Process to obtain the fused feature vector.
  • Step 108 Perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location.
  • the traffic light perception results finally obtained by the embodiment of the present application may include the target positions: traffic light colors in the straight direction, the left turn direction, and the right turn direction.
  • any existing classification prediction method can be used to obtain the final perception result based on the fusion feature vector.
  • the perception results are obtained through the classification prediction model used for classification prediction, etc.
  • the classification prediction model can be a classifier structure with three branches. Each branch is used to output a two-class classification result to predict whether to go straight, turn left, or turn right. The color of the traffic light in one direction.
  • the specific structure of the classification prediction model is not limited. For example, a multi-layer perceptron model with a relatively simple structure can be used, and so on.
  • Figure 2 is a schematic diagram of a scene corresponding to Embodiment 1 of the present application.
  • Figure 2 a specific scenario example will be used to describe the embodiment of the present application:
  • the types of targets that can be detected can be preset according to actual needs. In the embodiment of this application, there is no limit to the number of types and specific content of the preset targets that can be detected.
  • the preset targets that can be detected can include 3 types, namely: pedestrians, vehicles, and cyclists, etc.
  • each target corresponds to a target feature vector (used to characterize the type, location, shape and other characteristics of the target),
  • the target feature vector 2, target feature vector 3 and target feature vector 4 in are the targets corresponding to the radar data.
  • Marker feature vectors; map data for target locations can be vectorized to obtain the corresponding target feature vectors, where each target feature vector is used to represent a feature in the map Characteristic information of instance objects, assuming that the map data in Figure 2 contains 4 instance objects, namely: lane line 1, lane line 2, lane line 3 and crosswalk.
  • target feature vector 5 represents the characteristic information of lane line 1
  • Target feature vector 6 represents the feature information of lane line 2
  • target feature vector 7 represents the feature information of lane line 3
  • target feature vector 8 represents the feature information of crosswalk.
  • Target feature vectors 5-8 are the target feature vectors corresponding to the map data; After obtaining the target feature vectors corresponding to the three target data respectively, the above target feature vectors can be fused based on the cross attention mechanism to obtain the fused feature vector, and then classification prediction is performed based on the fused feature vector to obtain the traffic light perception result: Traffic light information corresponding to the three directions of going straight, turning left, and turning right, specifically, such as: the colors of the traffic lights corresponding to the three directions of going straight, turning left, and turning right.
  • the traffic light sensing method provided by the embodiment of the present application, multiple different target data of the target position are obtained and target feature vectors corresponding to various target data are obtained, and then feature fusion based on the cross-attention mechanism is performed on each target feature vector.
  • traffic light perception it is based on fused feature vectors. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • the traffic light sensing method in this embodiment can be executed by any appropriate electronic device with data processing capabilities, including but not limited to: servers, PCs, etc.
  • FIG 3 is a flow chart of a traffic light sensing method according to Embodiment 2 of the present application. Specifically, the traffic light sensing method provided by this embodiment includes the following steps:
  • Step 302 Acquire multiple target data at the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data.
  • multiple frames of continuous image data can be acquired at the same time; for the target data of radar data, multiple frames of continuous radar data can also be acquired at the same time. For example: you can obtain continuous image data or radar data of a preset number of frames of the target position before the current moment.
  • the image data can be images of the target position collected by cameras, etc.
  • radar data can be point cloud data of the target position collected by lidar, or three-dimensional data of the target position collected by millimeter wave radar, etc.
  • Map data can include instance objects such as target location lane lines, crosswalks, green belts, etc. data of location, shape, size and other information.
  • the multiple target data may specifically include: image data, radar data, or any two of the map data, or may include all three of the above data.
  • Step 304 Perform feature extraction for each type of target data to obtain a feature sequence corresponding to the target data.
  • the feature sequence contains multiple initial feature vectors.
  • each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, multiple initial feature vectors represent feature information of at least one map instance object.
  • the number of types of initial feature vectors contained in the feature sequence is the same as the number of frames of the image data.
  • One initial feature vector corresponds to one frame of image data and is used to characterize the feature information contained in the frame of image data. , for example, when there are 3 frames of image data, there are also 3 corresponding initial feature vectors.
  • Each initial feature vector is a feature vector obtained after feature extraction of one frame of image data; similarly, for radar data, the feature vector
  • the number of types of initial feature vectors contained in the sequence is the same as the number of frames of radar data.
  • One initial feature vector corresponds to one frame of radar data and is used to characterize the feature information contained in the radar data of that frame. For example, when the radar data totals 3 frame, the corresponding initial feature vectors are also three types, and each initial feature vector is a feature vector obtained after feature extraction of one frame of radar data.
  • a feature sequence containing a variety of initial feature vectors can be obtained.
  • the above multiple initial feature vectors are used to represent map instance objects in the map (such as lane lines, crosswalks, green belts, etc. ) feature information, for example: for a lane line with a length of 200 meters, the lane line part of the first 100 meters can be represented by the first initial feature vector, and the lane line part of the subsequent 100 meters can be represented by the second initial feature vector.
  • Feature limit representation where the first initial feature vector is vectorized based on the coordinate positions of the starting point and end point of the first 100 meters of the lane line; the second initial feature vector is based on the starting point of the next 100 meters of the lane line. The coordinate positions of the starting point and the ending point are vectorized.
  • Step 306 Based on the self-attention mechanism, feature fusion is performed on each initial feature vector in the feature sequence corresponding to each type of target data to obtain a target feature vector corresponding to each type of target data.
  • the target data is image data or radar data
  • feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:
  • the correlation degree between the base initial vector and the other initial feature vectors represents the degree of correlation between the base initial vector and the other initial feature vectors.
  • the degree of correlation between the baseline initial vector and other initial feature vectors can be characterized by the attention weight.
  • the greater the attention weight of the above-mentioned remaining initial feature vectors is, conversely, the lower the correlation between the baseline initial vector and the remaining initial feature vectors, the smaller the attention weight of the above-mentioned remaining initial feature vectors.
  • the above attention weight can be calculated using the existing attention mechanism (attention method).
  • the attention values of the remaining initial feature vectors which may include:
  • the attention mechanism can be used to calculate the attention weights of the remaining initial feature vectors, and then the product of the above attention weight and the remaining initial feature vectors is used as the attention value of the remaining initial feature vectors.
  • the later the timestamp of the data is, the higher the importance of the feature information it contains. Therefore, in order to make the target feature vector corresponding to the final target data better able to Characterize the feature information in the target data.
  • the target data is image data or radar data
  • the initial feature vector corresponding to the frame image data or radar data is used as the base initial vector, and the base initial vector is updated based on the attention values of the remaining initial feature vectors to obtain the target feature vector corresponding to the target data.
  • the target data is map data
  • feature fusion is performed on each initial feature vector in the feature sequence corresponding to the target data, and the process of obtaining the corresponding target feature vector includes:
  • feature fusion is performed based on the self-attention mechanism to obtain multiple self-updating feature vectors for each map instance object; multiple self-updates for each map instance object are obtained
  • the feature vector is subjected to a maximum pooling operation to obtain the target feature vector of the target data.
  • multiple self-updating feature vectors of each map instance object can be obtained in the following ways:
  • a map data contains only one map instance object: a lane line, and the lane line pair
  • the corresponding initial feature vectors are initial feature vector 1 and initial feature vector 2 respectively
  • the process of obtaining the target feature vector of the map data can include:
  • the initial feature vector 1 For the initial feature vector 1, based on the correlation (attention weight) between the initial feature vector 1 and the initial feature vector 2, calculate the attention value of the initial feature vector 2; based on the attention value of the initial feature vector 2, update the initial feature Vector 1, obtain the self-updating feature vector 1 corresponding to the initial feature vector 1; similarly, for the initial feature vector 2, calculate the initial feature vector based on the correlation (attention weight) between the initial feature vector 2 and the initial feature vector 1
  • the attention value of 1 based on the attention value of the initial feature vector 1, update the initial feature vector 2 to obtain the self-updating feature vector 2 corresponding to the initial feature vector 2; then perform the maximum operation on the self-updating feature vector 1 and the self-updating feature vector 2 Pooling operation (take the largest element at the same position of each updated feature vector as the element value of the corresponding position of the target feature vector) to obtain the target feature vector of the target data (that is, the lane line).
  • Step 308 For each type of target feature vector, calculate the attention values of other types of target feature vectors based on the correlation between this type of target feature vector and other types of target feature vectors.
  • the degree of correlation between this type of target feature vector and other types of target feature vectors represents the degree of correlation between this type of target feature vector and other types of target feature vectors.
  • the degree of association between this type of target feature vector and other types of target feature vectors can be characterized by the attention weight.
  • the higher the degree of correlation the greater the attention weight of the above-mentioned other types of target feature vectors.
  • the lower the degree of correlation between this type of target feature vector and other types of target feature vectors the greater the attention weight of the above-mentioned other types of target feature vectors.
  • the above attention weight can also be calculated using the existing attention mechanism (attention method).
  • the correlation degree (attention weight) between the target feature vector of the other type and the above-mentioned target feature vector can be calculated, and then the correlation degree (attention weight) and the above-mentioned target feature vector can be calculated.
  • the product of the target feature vectors of other categories is used as the attention value of the target feature vectors of the other categories.
  • the process of calculating the attention value of target feature vector 2 includes: first calculating the correlation (attention) between target feature vector 1 and target feature vector 2 weight), and then the product of the correlation (attention weight) and the target feature vector 2 is used as the attention value of the target feature vector 2.
  • Step 310 Based on the attention values of other types of target feature vectors, update the target feature vectors of this type to obtain updated target vectors. Afterwards, it is determined whether the preset update stop condition is reached. If not, the updated target vector is used as the new target feature vector, and the process returns to step 308; if so, step 312 is executed.
  • the sum of the attention values of this type of target feature vector and the other types of target feature vectors can be used as the The updated target vector corresponding to the target feature vector.
  • the update stop condition can be customized according to actual needs, and the specific content of the update stop condition is not limited here.
  • the update stop condition can be that the number of times the target vector is updated reaches a preset number
  • the update stop condition can also be that the correlation (attention weight) between the target vectors after two updates is greater than the preset correlation threshold ( attention weight threshold), etc.
  • Step 312 Perform fusion processing on various updated target vectors to obtain a fused feature vector.
  • the sum of various updated target vectors can be directly used as the fusion feature vector; a weight value can also be set separately for each updated target vector, and then based on the above According to the set weight value, perform a weighted sum of various updated target vectors to obtain a fusion feature vector; you can also perform a maximum pooling operation on various updated target vectors to obtain a fusion feature vector, and so on.
  • Step 314 Classify and predict based on the fused feature vector to obtain the traffic light perception result at the target location.
  • the traffic light perception results finally obtained by the embodiment of the present application may include the target position: traffic light information of the straight direction, left turn direction and right turn direction, specifically, such as the straight direction, left turn direction and right turn direction. Traffic light colors.
  • any existing classification prediction method can be used to obtain the final perception result based on the fusion feature vector.
  • the perception results are obtained through the classification prediction model used for classification prediction, etc.
  • the classification prediction model can be a classifier structure with three branches. Each branch is used to output a two-class classification result to predict whether to go straight, turn left, or turn right. The color of the traffic light in one direction.
  • the specific structure of the classification prediction model is not limited. For example, a multi-layer perceptron model with a relatively simple structure can be used, and so on.
  • the feature extraction for each type of target data in step 304 can be performed based on the feature extraction model; in step 306, based on the self-attention mechanism, each initial feature vector in the corresponding feature sequence of each type of target data is extracted.
  • Feature fusion can be based on a self-attention model (for example: a transformer model based on a self-attention mechanism, etc.);
  • Steps 308 to 310 can be based on a cross-attention model (for example: a transformer model based on a self-attention mechanism, etc.) );
  • Step 314 can be performed based on the classification prediction model.
  • the traffic light sensing method provided by the embodiment of the present application can, after obtaining the target data, output the final sensing result based on a series of machine learning models.
  • the embodiment of the present application provides an end-to-end
  • the traffic light sensing solution does not require complex post-processing operations, so the solution is simpler and has a wider scope of application.
  • Figure 4 is a schematic diagram of a scene corresponding to Embodiment 2 of the present application. Below, reference will be made to the The schematic diagram uses a specific scenario example to illustrate the embodiment of the present application:
  • the image data is 3 consecutive frames: the first frame of image data, the second frame of image data and the third frame of image data; radar data It is also three consecutive frames: the first frame of radar data, the second frame of radar data and the third frame of radar data; for the above three frames of image data, feature extraction is performed respectively to obtain the feature sequence corresponding to the image data ( Figure 4 Correspondence of image data In the feature sequence, each open circle represents an initial feature vector corresponding to one frame of image data); for each frame of radar data in the above three frames of radar data, feature extraction is performed separately to obtain the initial feature vector corresponding to each frame of radar data.
  • Feature sequence composed of feature data (assuming that the radar data contains a total of 3 types of targets: pedestrians, vehicle 1 and vehicle 2, then in the feature sequence corresponding to the radar data in Figure 4, the 3 solid circles in each column represent the initial start of a frame of radar data Feature vector, 1 solid circle represents the initial feature vector of a target in the radar data of this frame; 3 solid circles in each row represent the initial feature vector of the same target in different radar data frames); for map data, feature extraction ( Vectorized representation), and obtain a feature sequence consisting of multiple initial feature vectors corresponding to the map data (assuming that the map data contains a total of 4 map instance objects: lane line 1, lane line 2, lane line 3 and crosswalk, then Figure 4 map data
  • each straight line with an arrow represents an initial feature vector.
  • lane line 1 corresponds to two kinds of initial feature vectors
  • lane line 2 corresponds to two kinds of initial feature vectors
  • Lane line 3 corresponds to 2 kinds of initial feature vectors
  • crosswalk corresponds to 4 kinds of initial feature vectors
  • Target feature vector specifically: perform feature fusion on the initial feature vectors in the feature sequence corresponding to the image data to obtain the target feature vector 1 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the radar data (respectively on the same row) Perform feature fusion on each initial feature vector of the image data) to obtain the target feature vector 2, target feature vector 3 and target feature vector 4 corresponding to the image data; perform feature fusion on the initial feature vectors in the feature sequence corresponding to the map data (respectively for the same map instance Each initial feature vector corresponding to the object is subjected to feature fusion) to obtain the target feature vector 5, target feature vector 6, target feature vector 7 and target feature vector 8 corresponding to the map data; finally, based on the cross attention mechanism, the above target feature vector is 1-8 perform fusion processing to obtain the fusion feature vector, and then perform classification prediction based on the fusion feature vector to obtain the traffic light perception results: the corresponding traffic light colors in the three directions of going straight, turning left, and turning right.
  • the traffic light sensing method, device, equipment and storage medium provided by the embodiments of the present application acquire a variety of different target data at the target location and obtain target feature vectors corresponding to various target data, and then conduct cross-attention-based analysis on each target feature vector.
  • the feature fusion of the force mechanism is based on the fusion feature vector during traffic light perception. In other words, in the embodiment of this application, it is based on multiple different modes of the environment around the target location. Data, cross-modal data fusion and comprehensive analysis and reasoning are performed to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • the feature fusion of each initial feature vector of multiple consecutive image frames or radar frames is first based on the self-attention mechanism, and the map Each initial feature vector of different map instance objects in the data is feature fused to obtain target feature vectors corresponding to different target data.
  • the above-mentioned feature fusion operation based on the self-attention mechanism to obtain the target feature vector is applicable to image or radar sequences and The historical status of traffic participants in the surrounding environment is correlated and fused.
  • the target feature vector contains richer and more important information. Therefore, Based on the subsequent logical reasoning based on the above target feature vector, the final traffic light perception result is more accurate and stable.
  • the traffic light sensing method in this embodiment can be executed by any appropriate electronic device with data processing capabilities, including but not limited to: servers, PCs, etc.
  • FIG 5 is a structural block diagram of a traffic light sensing device according to Embodiment 3 of the present application.
  • the traffic light sensing device provided by the embodiment of the present application includes:
  • the target data acquisition module 502 is used to acquire multiple target data of the target location.
  • the multiple target data include at least two of the following: image data, radar data, and map data;
  • the target feature vector obtaining module 504 is used to extract features from various target data respectively and obtain target feature vectors corresponding to various target data;
  • the fusion module 506 is used to fuse various target feature vectors based on the cross-attention mechanism to obtain a fused feature vector
  • the result obtaining module 508 performs classification prediction based on the fused feature vector to obtain the traffic light perception result of the target location.
  • the fusion module 506 is specifically used to:
  • the fusion module 506, before performing the fusion process on various updated target vectors to obtain the fused feature vector is also used to:
  • the fusion module 506 when performing the step of fusion processing on various updated target vectors to obtain a fused feature vector, is specifically used to:
  • the target feature vector obtaining module 504 is specifically used for:
  • the feature sequence contains a variety of initial feature vectors
  • feature fusion is performed on each initial feature vector in the feature sequence corresponding to each target data to obtain the target feature vector corresponding to each target data;
  • each initial feature vector represents the feature information contained in one frame of image data in multiple frames of continuous image data; for radar data, each initial feature vector represents one frame of radar data in multiple frames of continuous radar data. Contained feature information; for map data, multiple initial feature vectors represent feature information of at least one map instance object.
  • the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism. , when obtaining the target feature vector corresponding to the target data, it is specifically used for:
  • the baseline initial vector is updated to obtain the target feature vector corresponding to the target data.
  • the target feature vector obtaining module 504 when the target feature vector obtaining module 504 performs the step of selecting an initial feature vector from various initial feature vectors in the feature sequence corresponding to the target data as the reference initial vector, specifically use At:
  • the initial feature vector corresponding to the data or radar data is used as the baseline initial vector.
  • the target feature vector obtaining module 504 performs feature fusion on each initial feature vector in the feature sequence corresponding to the target data based on the self-attention mechanism to obtain the target
  • the target feature vector corresponding to the data it is specifically used for:
  • feature fusion is performed based on the self-attention mechanism to obtain a variety of self-updating feature vectors for each map instance object;
  • the target feature vector obtaining module 504 when obtaining the target feature vector corresponding to the image data, is specifically used to:
  • feature extraction is performed on the image data to obtain the target feature vector corresponding to the image data;
  • the target feature vector obtaining module 504 is specifically used to:
  • the radar data is detected through the pre-trained three-dimensional target detection model to obtain the target detection results; based on the target detection results, the target feature vector corresponding to the radar data is obtained;
  • the target feature vector obtaining module 504 is specifically used to:
  • the traffic light sensing device in the embodiment of the present application is used to implement the corresponding traffic light sensing method in the first or second method embodiment, and has the beneficial effects of the corresponding method embodiment, which will not be described again here.
  • the functional implementation of each module in the traffic light sensing device in the embodiment of the present application reference can be made to the description of the corresponding parts in the first or second embodiment of the method, and will not be described again here.
  • FIG. 6 a schematic structural diagram of an electronic device according to Embodiment 4 of the present application is shown.
  • the specific embodiments of the present application do not limit the specific implementation of the electronic device.
  • the electronic device may include: a processor (processor) 602, a communications interface (Communications Interface) 604, a memory (memory) 606, and a communication bus 608.
  • processor processor
  • communications interface Communication Interface
  • memory memory
  • the processor 602, the communication interface 604, and the memory 606 complete communication with each other through the communication bus 608.
  • Communication interface 604 is used to communicate with other electronic devices or servers.
  • the processor 1202 is used to execute the program 610. Specifically, it can execute the above embodiment of the traffic light sensing method. related steps.
  • program 610 may include program code including computer operating instructions.
  • the processor 602 may be a CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present application.
  • the one or more processors included in the smart device can be the same type of processor, such as one or more CPUs; or they can be different types of processors, such as one or more CPUs and one or more ASICs.
  • Memory 606 is used to store program 610.
  • the memory 606 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the program 610 can be specifically used to cause the processor 602 to perform the following operations: obtain multiple target data of the target location, and the multiple target data include at least two of the following: image data, radar data, and map data; characterize the various target data respectively. Extract and obtain target feature vectors corresponding to various target data; based on the cross-attention mechanism, fuse various target feature vectors to obtain a fused feature vector; perform classification prediction based on the fused feature vector to obtain the traffic light perception result at the target location .
  • each step in program 610 please refer to the corresponding steps and corresponding descriptions in the units in the above embodiment of the traffic light sensing method, and will not be described again here.
  • Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described devices and modules can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be described again here.
  • the electronic device of this embodiment Through the electronic device of this embodiment, a variety of different target data at the target position are obtained and target feature vectors corresponding to various target data are obtained. Then feature fusion based on the cross-attention mechanism is performed on each target feature vector, and in traffic light perception When, it is based on the fusion feature vector. That is to say, in the embodiments of this application, cross-modal data fusion and comprehensive analysis and reasoning are performed based on multiple different modal data of the environment around the target position, so as to obtain the final perception result. Therefore, compared with the perception method that only relies on a single modality data such as image data, the perception stability and accuracy of the embodiments of the present application are higher.
  • Embodiments of the present application also provide a computer program product, including computer instructions, which instruct the computing device to perform operations corresponding to any of the traffic light sensing methods in the multiple method embodiments mentioned above.
  • each component/step described in the embodiments of this application can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of this application.
  • the above-mentioned methods according to the embodiments of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optical disk), or by Network downloads are originally stored on remote recording media or non-transitory machine-readable media and will be stored in Computer code is local to the recording medium so that the methods described herein can be processed by such software stored on the recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA.
  • a recording medium such as CD ROM, RAM, floppy disk, hard disk or magneto-optical disk
  • Computer code is local to the recording medium so that the methods described herein can be processed by such software stored on the recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA.
  • a computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code when the software or computer code is used by the computer, When accessed and executed by a processor or hardware, the traffic light sensing methods described herein are implemented. Furthermore, when a general-purpose computer accesses code for implementing the traffic light sensing method illustrated herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the traffic light sensing method illustrated herein.
  • storage components e.g., RAM, ROM, flash memory, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Traffic Control Systems (AREA)

Abstract

Les modes de réalisation de la présente demande concernent un procédé et un appareil de détection de lampe de signalisation routière, un dispositif, et un support de stockage. Le procédé de détection de lampe de signalisation routière comprend : l'acquisition d'une pluralité de types de données cibles d'une position cible, la pluralité de types de données cibles comprenant au moins deux types de ce qui suit : des données d'image, des données de radar et des données de carte ; la réalisation de manière respective d'une extraction de caractéristiques sur les divers types de données cibles pour obtenir des vecteurs de caractéristiques cibles correspondant aux divers types de données cibles ; la réalisation d'un traitement de fusion sur les divers vecteurs de caractéristiques cibles sur la base d'un mécanisme d'attention croisée pour obtenir un vecteur de caractéristiques fusionnées ; et la réalisation d'une prédiction de classification sur la base du vecteur de caractéristiques fusionnées pour obtenir un résultat de détection de lampe de signalisation routière de la position cible. Selon les modes de réalisation de la présente demande, une fusion de données intermodales et un raisonnement d'analyse complète sont réalisés sur la base d'une pluralité de types de différentes données modales d'un environnement immédiat d'une position cible de façon à obtenir un résultat de détection final de sorte que la stabilité et la précision de détection sont relativement élevées.
PCT/CN2023/096961 2022-05-30 2023-05-29 Procédé et appareil de détection de lampe de signalisation routière, dispositif, et support de stockage WO2023231991A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210599282.6A CN114694123B (zh) 2022-05-30 2022-05-30 交通信号灯感知方法、装置、设备及存储介质
CN202210599282.6 2022-05-30

Publications (1)

Publication Number Publication Date
WO2023231991A1 true WO2023231991A1 (fr) 2023-12-07

Family

ID=82144742

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096961 WO2023231991A1 (fr) 2022-05-30 2023-05-29 Procédé et appareil de détection de lampe de signalisation routière, dispositif, et support de stockage

Country Status (2)

Country Link
CN (1) CN114694123B (fr)
WO (1) WO2023231991A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366312A (zh) * 2024-06-19 2024-07-19 徐州市交通控股集团智能科技有限公司 一种交通检测系统及方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102652486B1 (ko) * 2021-09-24 2024-03-29 (주)오토노머스에이투지 라이다를 이용한 신호등 정보 예측 방법 및 이를 이용한 서버
CN114694123B (zh) * 2022-05-30 2022-09-27 阿里巴巴达摩院(杭州)科技有限公司 交通信号灯感知方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583415A (zh) * 2018-12-11 2019-04-05 兰州大学 一种基于激光雷达与摄像机融合的交通灯检测与识别方法
CN111563551A (zh) * 2020-04-30 2020-08-21 支付宝(杭州)信息技术有限公司 一种多模态信息融合方法、装置及电子设备
KR20200102907A (ko) * 2019-11-12 2020-09-01 써모아이 주식회사 가시광 및 적외선 융합 영상 기반 객체 검출 방법 및 장치
CN114254696A (zh) * 2021-11-30 2022-03-29 上海西虹桥导航技术有限公司 基于深度学习的可见光、红外和雷达融合目标检测方法
US20220101087A1 (en) * 2020-09-30 2022-03-31 Qualcomm Incorporated Multi-modal representation based event localization
CN114419412A (zh) * 2022-03-31 2022-04-29 江西财经大学 一种用于点云配准的多模态特征融合方法与系统
CN114694123A (zh) * 2022-05-30 2022-07-01 阿里巴巴达摩院(杭州)科技有限公司 交通信号灯感知方法、装置、设备及存储介质

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102015214743A1 (de) * 2015-08-03 2017-02-09 Audi Ag Verfahren und Vorrichtung in einem Kraftfahrzeug zur verbesserten Datenfusionierung bei einer Umfelderfassung
CN107316488B (zh) * 2017-08-23 2021-01-12 苏州豪米波技术有限公司 信号灯的识别方法、装置和系统
DE102019215440B4 (de) * 2019-10-09 2021-04-29 Zf Friedrichshafen Ag Erkennung von Verkehrsschildern
CN111507210B (zh) * 2020-03-31 2023-11-21 华为技术有限公司 交通信号灯的识别方法、系统、计算设备和智能车
CN111652050B (zh) * 2020-04-20 2024-04-02 宁波吉利汽车研究开发有限公司 一种交通标志的定位方法、装置、设备和介质
CN111582189B (zh) * 2020-05-11 2023-06-23 腾讯科技(深圳)有限公司 交通信号灯识别方法、装置、车载控制终端及机动车
CN111950467B (zh) * 2020-08-14 2021-06-25 清华大学 基于注意力机制的融合网络车道线检测方法及终端设备
CN112580460A (zh) * 2020-12-11 2021-03-30 西人马帝言(北京)科技有限公司 一种交通信号灯的识别方法、装置、设备及存储介质
CN112507947B (zh) * 2020-12-18 2024-10-18 广东宜通联云智能信息有限公司 基于多模态融合的手势识别方法、装置、设备及介质
CN112488083B (zh) * 2020-12-24 2024-04-05 杭州电子科技大学 基于heatmap提取关键点的交通信号灯的识别方法、装置、介质
CN112861748B (zh) * 2021-02-22 2022-07-12 奥特酷智能科技(南京)有限公司 一种自动驾驶中的红绿灯检测系统及方法
CN113065590B (zh) * 2021-03-26 2021-10-08 清华大学 一种基于注意力机制的视觉与激光雷达多模态数据融合方法
CN113343849A (zh) * 2021-06-07 2021-09-03 西安恒盛安信智能技术有限公司 一种基于雷达和视频的融合感知设备
CN113421305B (zh) * 2021-06-29 2023-06-02 上海高德威智能交通系统有限公司 目标检测方法、装置、系统、电子设备及存储介质
CN113269156B (zh) * 2021-07-02 2023-04-18 昆明理工大学 一种基于多尺度特征融合的信号灯检测识别方法及系统
CN114398937B (zh) * 2021-12-01 2022-12-27 北京航空航天大学 一种基于混合注意力机制的图像-激光雷达数据融合方法
CN113879339A (zh) * 2021-12-07 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 自动驾驶的决策规划方法、电子设备及计算机存储介质
CN114549542A (zh) * 2021-12-24 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 视觉语义分割方法、装置及设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583415A (zh) * 2018-12-11 2019-04-05 兰州大学 一种基于激光雷达与摄像机融合的交通灯检测与识别方法
KR20200102907A (ko) * 2019-11-12 2020-09-01 써모아이 주식회사 가시광 및 적외선 융합 영상 기반 객체 검출 방법 및 장치
CN111563551A (zh) * 2020-04-30 2020-08-21 支付宝(杭州)信息技术有限公司 一种多模态信息融合方法、装置及电子设备
US20220101087A1 (en) * 2020-09-30 2022-03-31 Qualcomm Incorporated Multi-modal representation based event localization
CN114254696A (zh) * 2021-11-30 2022-03-29 上海西虹桥导航技术有限公司 基于深度学习的可见光、红外和雷达融合目标检测方法
CN114419412A (zh) * 2022-03-31 2022-04-29 江西财经大学 一种用于点云配准的多模态特征融合方法与系统
CN114694123A (zh) * 2022-05-30 2022-07-01 阿里巴巴达摩院(杭州)科技有限公司 交通信号灯感知方法、装置、设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366312A (zh) * 2024-06-19 2024-07-19 徐州市交通控股集团智能科技有限公司 一种交通检测系统及方法

Also Published As

Publication number Publication date
CN114694123B (zh) 2022-09-27
CN114694123A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
WO2023231991A1 (fr) Procédé et appareil de détection de lampe de signalisation routière, dispositif, et support de stockage
JP6709283B2 (ja) 低解像度リモートセンシング画像を用いた移動車両の検出及び分析
CN109063768B (zh) 车辆重识别方法、装置及系统
CN110148196B (zh) 一种图像处理方法、装置以及相关设备
US11373067B2 (en) Parametric top-view representation of scenes
US10074020B2 (en) Vehicular lane line data processing method, apparatus, storage medium, and device
Ojha et al. Vehicle detection through instance segmentation using mask R-CNN for intelligent vehicle system
WO2018084942A1 (fr) Apprentissage en profondeur par corrélation croisée destiné au suivi d'objet
Šegvić et al. A computer vision assisted geoinformation inventory for traffic infrastructure
CN111222395A (zh) 目标检测方法、装置与电子设备
JP2016062610A (ja) 特徴モデル生成方法及び特徴モデル生成装置
CN112949366B (zh) 障碍物识别方法和装置
CN111008576B (zh) 行人检测及其模型训练、更新方法、设备及可读存储介质
CN110853085B (zh) 基于语义slam的建图方法和装置及电子设备
CN115797736B (zh) 目标检测模型的训练和目标检测方法、装置、设备和介质
CN111767831B (zh) 用于处理图像的方法、装置、设备及存储介质
CN110909656B (zh) 一种雷达与摄像机融合的行人检测方法和系统
CN113971795A (zh) 基于自驾车视觉感测的违规巡检系统及其方法
CN111597986A (zh) 用于生成信息的方法、装置、设备和存储介质
Liu et al. Real-time traffic light recognition based on smartphone platforms
Qing et al. A novel particle filter implementation for a multiple-vehicle detection and tracking system using tail light segmentation
CN112529917A (zh) 一种三维目标分割方法、装置、设备和存储介质
CN116343143A (zh) 目标检测方法、存储介质、路侧设备及自动驾驶系统
Prakash et al. Multiple Objects Identification for Autonomous Car using YOLO and CNN
CN114429631B (zh) 三维对象检测方法、装置、设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23815170

Country of ref document: EP

Kind code of ref document: A1