WO2023216654A1 - 多视角语义分割方法、装置、电子设备和存储介质 - Google Patents

多视角语义分割方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2023216654A1
WO2023216654A1 PCT/CN2023/074402 CN2023074402W WO2023216654A1 WO 2023216654 A1 WO2023216654 A1 WO 2023216654A1 CN 2023074402 W CN2023074402 W CN 2023074402W WO 2023216654 A1 WO2023216654 A1 WO 2023216654A1
Authority
WO
WIPO (PCT)
Prior art keywords
semantic segmentation
features
fused
segmentation features
image data
Prior art date
Application number
PCT/CN2023/074402
Other languages
English (en)
French (fr)
Inventor
王梦圆
朱红梅
张骞
Original Assignee
北京地平线机器人技术研发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京地平线机器人技术研发有限公司 filed Critical 北京地平线机器人技术研发有限公司
Publication of WO2023216654A1 publication Critical patent/WO2023216654A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features

Definitions

  • the present disclosure relates to computer vision technology, and in particular, to a multi-view semantic segmentation method, device, electronic device and storage medium.
  • Embodiments of the present disclosure provide a multi-view semantic segmentation method, device, electronic device and storage medium.
  • a multi-view semantic segmentation method including: determining first image data corresponding to at least two first-type views, and obtaining at least two first image data; determining the at least The first semantic segmentation features under the second type of perspective corresponding to the two first image data respectively obtain at least two first semantic segmentation features; the at least two first semantic segmentation features are fused to obtain the fused semantic segmentation features ; Based on the fused semantic segmentation features, obtain a fused semantic segmentation result.
  • a multi-view semantic segmentation device including: a first determination module, configured to determine first image data respectively corresponding to at least two first-type views, and obtain at least two first-type viewpoints. An image data; a first processing module, used to determine the first semantic segmentation features under the second type of perspective corresponding to the at least two first image data respectively, and obtain at least two first semantic segmentation features; the first fusion module , used to fuse the at least two first semantic segmentation features to obtain fused semantic segmentation features; the second processing module is used to obtain a fused semantic segmentation result based on the fused semantic segmentation features.
  • a computer-readable storage medium stores a computer program, the computer program is used to execute the multi-view semantic segmentation method described in any of the above embodiments of the present disclosure.
  • an electronic device includes: a processor; a memory for storing instructions executable by the processor; and the processor is configured to retrieve instructions from the memory. Read the executable instructions and execute the instructions to implement The multi-view semantic segmentation method described in any of the above embodiments of the present disclosure.
  • the second type of bird's-eye view perspective is determined based on the image data corresponding to the first type of perspective such as camera perspective and radar perspective.
  • the semantic segmentation features of the perspective are fused in the feature stage to obtain the fused semantic segmentation features of the bird's-eye view.
  • the fused semantic classification results are determined based on the fused semantic segmentation features, so that only cameras, radars, etc. can be used to achieve end-to-end fusion through mid-range fusion.
  • the multi-view semantic segmentation results do not require post-processing, which effectively reduces the processing time, thereby reducing the auxiliary delay, and solves the problem of large delays caused by the existing technology that needs to be transmitted to the post-processing module for post-processing.
  • Figure 1 is an exemplary application scenario of the multi-view semantic segmentation method provided by the present disclosure
  • Figure 2 is a schematic flowchart of a multi-view semantic segmentation method provided by an exemplary embodiment of the present disclosure
  • Figure 3 is a schematic diagram of the fusion of first semantic segmentation features provided by an exemplary embodiment of the present disclosure
  • Figure 4 is a schematic flowchart of a multi-view semantic segmentation method provided by an exemplary embodiment of the present disclosure
  • Figure 5 is a schematic flowchart of step 202 provided by an exemplary embodiment of the present disclosure.
  • Figure 6 is a schematic diagram of the training process of the first semantic segmentation network model provided by an exemplary embodiment of the present disclosure
  • Figure 7 is a schematic diagram of the training process of the second semantic segmentation network model provided by an exemplary embodiment of the present disclosure
  • Figure 8 is a schematic diagram of the principle of fusion of two first semantic segmentation features provided by an exemplary embodiment of the present disclosure
  • Figure 9 is a schematic flowchart of step 203 provided by an exemplary embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a multi-view semantic segmentation device provided by an exemplary embodiment of the present disclosure
  • Figure 11 is a schematic structural diagram of the first processing module 502 provided by an exemplary embodiment of the present disclosure.
  • Figure 12 is a schematic structural diagram of a multi-view semantic segmentation device provided by another exemplary embodiment of the present disclosure.
  • Figure 13 is a schematic structural diagram of the first fusion module 503 provided by an exemplary embodiment of the present disclosure.
  • Figure 14 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure.
  • plural may refer to two or more than two, and “at least one” may refer to one, two, or more than two.
  • Embodiments of the present disclosure may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general or special purpose computing system environments or configurations.
  • Examples of well-known terminal devices, computing systems, environments and/or configurations suitable for use with terminal devices, computer systems, servers and other electronic devices include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients Computers, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems and distributed cloud computing technology environments including any of the above systems, etc.
  • the inventor found that in the field of computer vision such as autonomous driving, in order to assist planning and control, image data from multiple viewing angles of the surroundings are usually collected through cameras from multiple viewing angles provided on the movable device, and then Based on the neural network model, the image data from each perspective are semantically segmented, and the semantic segmentation results corresponding to each perspective are obtained, which are transmitted to the post-processing module for post-processing, such as filtering, fusion, etc., to obtain the semantic information of the surrounding environment of the mobile device.
  • the post-processing process of the existing post-fusion method takes a long time, which brings a large delay to assisted driving.
  • Figure 1 is an exemplary application scenario of the multi-view semantic segmentation method provided by the present disclosure.
  • the movable device is a vehicle, and the vehicle is equipped with cameras with four perspectives: front, rear, left, and right.
  • the first type of perspective is the camera perspective as an example
  • the second type of perspective is the bird's-eye perspective.
  • the four perspectives are
  • the camera is used to collect image data from the front, rear, left and right angles of the vehicle, and transmits it to a multi-view semantic segmentation device for executing the multi-view semantic segmentation method of the present disclosure
  • the semantic segmentation features under the bird's-eye view corresponding to the camera view can be determined based on the image data of each camera view, and then the semantic segmentation features under the bird's-eye view corresponding to each camera view can be fused to obtain the fused semantic segmentation features, and then based on the fused semantics Segmentation features determine the fused semantic segmentation results from a bird's-eye view.
  • semantic segmentation results can include segmentation results belonging to ground areas, segmentation results belonging to lane lines, etc., and are not specifically limited.
  • Embodiments of the present disclosure achieve end-to-end multi-view semantic segmentation through fusion in the feature stage, without post-processing, effectively reducing processing time, thereby reducing auxiliary delays, improving accuracy, and solving the need for existing technologies to be transferred to post-processing Module post-processing causes problems such as large delays.
  • Figure 2 is a schematic flowchart of a multi-view semantic segmentation method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic equipment On the vehicle computing platform, for example, as shown in Figure 2, it includes the following steps:
  • Step 201 Determine first image data corresponding to at least two first-type viewing angles, and obtain at least two first image data.
  • the first type of perspective may be a camera perspective (or camera perspective), a radar perspective, or other sensor perspective that collects environmental information around the movable device.
  • each camera corresponds to a perspective.
  • the first image data corresponding to the above-mentioned at least two first-type perspectives can be determined from at least two cameras, and each first image data corresponds to a first-category perspective; for the radar perspective , the collected three-dimensional point cloud data can be converted into two-dimensional image data to obtain at least two first image data, which can be set according to actual needs.
  • cameras with 4 or 6 viewing angles are required to cover the collection of images of the surrounding environment of the vehicle, and the first image data corresponding to the number of viewing angles (4 or 6) can be obtained at each moment.
  • Step 202 Determine the first semantic segmentation features under the second type of perspective corresponding to at least two first image data, and obtain at least two first semantic segmentation features.
  • the bird's-eye perspective is the perspective of birds flying in the sky.
  • the image from the bird's-eye perspective is called a bird's-eye view (BEV (Birds Eye Views) image).
  • BEV Bill's Eye Views
  • the mobile device From the bird's-eye perspective, the mobile device can be obtained A global image of a certain range around it.
  • each first image data can obtain a corresponding first semantic segmentation feature under the second type of perspective, and at least two first image data can obtain at least two second type of perspective.
  • the first semantic segmentation feature below.
  • the specific number of viewing angles of the first type of viewing angle can be set according to actual needs, and is not limited in this disclosure. For example, from the front, rear, left, and right perspectives of an autonomous vehicle, four first semantic segmentation features from the second type of perspective can be obtained. The details will not be described again.
  • the first semantic segmentation feature can be obtained based on feature extraction and perspective conversion under the first type of perspective. For example, first perform feature extraction on the first image data under the first type of perspective to obtain the semantic segmentation features under the first type of perspective, and then based on the coordinate transformation relationship between the first type of perspective and the second type of perspective, extract the features from the first type of perspective.
  • the semantic segmentation features are converted to the second type of perspective, such as perspective conversion based on inverse perspective transformation (IPM), which is not specifically limited.
  • IPM inverse perspective transformation
  • Step 203 Fusion of at least two first semantic segmentation features to obtain fused semantic segmentation features.
  • the feature map of the first semantic segmentation feature from a bird's-eye view is a global feature map that includes a certain range around the movable device. That is, the first semantic segmentation feature includes pixels in the global range. For each first type of perspective, It is said that among the corresponding first semantic segmentation features, only the corresponding pixel area of the first type of perspective in the second type of perspective has a valid feature value, and the feature values of other pixel areas are 0. After fusing at least two first semantic segmentation features, each pixel area of the obtained fused semantic segmentation features has a valid feature value.
  • Figure 3 is a schematic diagram of the fusion of first semantic segmentation features provided by an exemplary embodiment of the present disclosure.
  • the first semantic segmentation feature in the bird's-eye view corresponding to the image essentially includes the areas corresponding to the front, rear, left, and right views respectively.
  • the corresponding first semantic segmentation feature has the gray front-view area pixel feature.
  • the value is extracted and converted from the front-view camera image.
  • the feature value is 0 or other representations, which can be set according to actual needs.
  • the obtained fused semantic segmentation features fuse the first semantic segmentation features of each viewing angle to form global semantic segmentation features within a certain range of the vehicle.
  • a simple example is used to illustrate the relationship between the first semantic segmentation feature and the fused semantic segmentation feature without limiting it.
  • the shape and size of the corresponding areas in the first semantic segmentation feature from different perspectives may be the same or different.
  • the fusion method may also be other methods.
  • the representation method of fused semantic segmentation features Other methods may also be used, such as merging the first semantic segmentation features through concat (splicing), which is not limited by this disclosure.
  • Step 204 Obtain a fused semantic segmentation result based on the fused semantic segmentation features.
  • the fused semantic segmentation results may include segmentation types and corresponding segmentation areas set according to actual needs. For example, the ground area, lane line area, etc. can be set according to actual needs.
  • any implementable method can be used to obtain the fused semantic segmentation results.
  • any implementable trained semantic segmentation network model can be used to perform semantic segmentation on the fused semantic segmentation features to obtain the fused semantic segmentation results. The details can be set according to actual needs.
  • the fused semantic segmentation results of the present disclosure can be used in scenarios such as positioning, navigation, and planning control.
  • the multi-view semantic segmentation method provided in this embodiment uses a medium fusion method to determine the semantic segmentation features of the second type of bird's-eye view based on the image data corresponding to the first type of perspective such as camera perspective and radar perspective, and performs fusion in the feature stage. , obtain the fused semantic segmentation features from a bird's-eye view, and determine the fused semantic classification results based on the fused semantic segmentation features, thereby achieving end-to-end multi-view semantic segmentation results through mid-fusion using only cameras, radars, etc., without the need for post-processing. It effectively reduces the processing time, thereby reducing the auxiliary delay, and solves the problem that the existing technology needs to be transmitted to the post-processing module for post-processing, resulting in large delays.
  • Figure 4 is a schematic flowchart of a multi-view semantic segmentation method provided by an exemplary embodiment of the present disclosure.
  • step 202 may specifically include the following steps:
  • Step 2021 Perform feature extraction on at least two first image data respectively, determine second semantic segmentation features under the first type of perspective corresponding to the at least two first image data, and obtain at least two second semantic segmentation features.
  • the feature extraction of the first image data can be performed in any implementable manner.
  • feature extraction can be performed based on a trained feature extraction network model, or based on the feature extraction network part in the trained first semantic segmentation network model.
  • Perform feature extraction which can be set according to actual needs.
  • Each first image data obtains a corresponding second semantic segmentation feature.
  • Step 2022 Convert at least two second semantic segmentation features to the coordinate system corresponding to the second type of perspective, respectively, to obtain at least two first semantic segmentation features.
  • the coordinate system corresponding to the second type of perspective can be the self-coordinate system of the movable device (such as the vehicle coordinate system), the world coordinate system, or the map coordinate system, which can be set according to actual needs.
  • the first type of perspective takes the camera perspective as an example
  • the second type of perspective takes the vehicle coordinate system as an example.
  • the coordinate system corresponding to the camera perspective is the image coordinate system.
  • the conversion relationship between the image coordinate system and the vehicle coordinate system can be based on the internal and external parameters of the camera. , and determine the preset point coordinates of the image coordinate system obtained in advance.
  • the homography transformation matrix corresponding to the camera perspective can be determined based on the internal parameters and external parameters of the camera and the preset point coordinates of the image coordinate system obtained in advance, and the second semantic segmentation feature can be implemented into the second category based on the homography transformation matrix.
  • the conversion can also be performed through other conversion methods, which are not limited in this embodiment.
  • Each second semantic segmentation feature can obtain a corresponding first semantic segmentation feature.
  • This disclosure uses feature extraction under the first type of perspective combined with perspective conversion to obtain the first semantic segmentation features under the second type of perspective corresponding to the first image data under the first type of perspective, thereby realizing perspective conversion in the feature stage to facilitate subsequent Feature fusion under the second type of perspective, thereby achieving medium fusion.
  • Figure 5 is a schematic flowchart of step 202 provided by an exemplary embodiment of the present disclosure.
  • step 2021 performs feature extraction on at least two first image data respectively, determines the second semantic segmentation features under the first type of perspective corresponding to the at least two first image data, and obtains at least two
  • the second semantic segmentation features include:
  • Step 20211 Perform feature extraction on at least two first image data based on the first semantic segmentation network model obtained through pre-training to obtain at least two second semantic segmentation features.
  • the first semantic segmentation network model can adopt any implementable network structure, such as the semantic segmentation network model and its series based on FCN (Fully Convolutional Networks, fully convolutional network), the semantic segmentation network model and its series based on UNet, Semantic segmentation network model and its series based on DeepLab, etc.
  • FCN Full Convolutional Networks, fully convolutional network
  • UNet Semantic segmentation network model and its series based on DeepLab
  • DeepLab DeepLab
  • the training of the first semantic segmentation network model uses segmentation type label data for supervision.
  • the feature map output before the last normalization layer (such as softmax layer) in the first semantic segmentation network model can be used as the extracted feature.
  • the second semantic segmentation feature is used as the extracted feature.
  • FIG. 6 is a schematic diagram of the training process of the first semantic segmentation network model provided by an exemplary embodiment of the present disclosure.
  • the first semantic segmentation network model is obtained by:
  • Step 301 Determine first training data.
  • the first training data includes training image data under the first type of perspective and corresponding first label data;
  • the training image data under the first type of perspective may include image data from multiple perspectives
  • the first label data includes a first preset semantic segmentation type label to which each pixel of each training image data belongs
  • the first preset semantic segmentation type may Set according to actual requirements, for example, it may include at least one of ground type, curb type, lane line type, vehicle type and other possible types.
  • the first preset semantic segmentation type label can be represented in any implementable manner, such as represented by numbers 0, 1, 2, 3, etc., or it can also be represented in other ways. For the case where there is only one segmentation type, each pixel corresponds to The segmentation type can be represented by 0 or 1. 0 means it does not belong to this type, and 1 means it belongs to this type. There is no specific limit.
  • the setting of the first preset semantic segmentation type can be set according to the fusion semantic segmentation requirements under the second type of perspective.
  • the first preset semantic segmentation type can be set to be at the same height as the ground.
  • Types such as lane line type, sidewalk type, curb, stop line, arrow sign, etc. that are at the same height as the ground, as well as ground types other than road signs, can be set according to actual needs.
  • Step 302 Train the pre-established first semantic segmentation network based on the first training image data and the first label data to obtain a first semantic segmentation network model.
  • any implementable loss function can be used, such as cross-entropy loss function, focal loss function (focal loss), etc.
  • the first tag data can be obtained in any implementable manner. Specifically, the first training image data is used as the input of the first semantic segmentation network, the corresponding first output data is obtained, and based on the first output data, the corresponding first label data and the first loss function, determine Based on the current loss, adjust the network parameters and enter the next iteration process, and so on, until the current loss converges, and the first semantic segmentation network model is obtained.
  • the specific training principles will not be described again.
  • step 2022 converts at least two second semantic segmentation features into the coordinate system corresponding to the second type of perspective to obtain at least two first semantic segmentation features, including:
  • Step 20221 Based on the preset point coordinates in the image coordinate system corresponding to at least two first-type visual angles and the pre-obtained camera parameters, determine the homography transformation matrices corresponding to at least two first-type visual angles, and obtain at least two homography transformation matrix.
  • the preset point coordinates can include 4 point coordinates. Taking a camera as an example, specifically the coordinates of 4 points on the ground in the image coordinate system of the camera are expressed as I img .
  • Step 20222 Based on at least two homography transformation matrices, convert at least two second semantic segmentation features into the coordinate system corresponding to the second type of perspective, respectively, to obtain at least two first semantic segmentation features.
  • the second semantic segmentation features of the corresponding perspective can be converted to the coordinate system corresponding to the second type of perspective based on each homography transformation matrix to obtain the corresponding The first semantic segmentation feature.
  • step 204 obtains a fused semantic segmentation result based on the fused semantic segmentation features, including:
  • Step 2041 Obtain the fused semantic segmentation result based on the fused semantic segmentation features and the second semantic segmentation network model obtained by pre-training.
  • the second semantic segmentation network model can use any implementable semantic segmentation network model, such as the semantic segmentation network model and its series based on FCN (Fully Convolutional Networks, fully convolutional network), the semantic segmentation network model and its series based on UNet series, DeepLab-based semantic segmentation network model and its series, etc.
  • the input of the second semantic segmentation network model is the fused semantic segmentation feature.
  • its input is also the fused semantic segmentation feature from a bird's-eye view.
  • FIG. 7 is a schematic diagram of the training process of the second semantic segmentation network model provided by an exemplary embodiment of the present disclosure.
  • the second semantic segmentation network model is obtained by:
  • Step 401 Determine second training data.
  • the second training data includes training semantic segmentation feature data from a second type of perspective and corresponding second label data.
  • the training semantic segmentation feature data under the second type of perspective is the training fusion semantic segmentation feature data after multi-view fusion.
  • the second label data includes a second preset semantic segmentation type to which each pixel in the training semantic segmentation feature data belongs.
  • the second preset semantic segmentation type is similar to the first preset semantic segmentation type and will not be described again here.
  • Step 402 Based on the training semantic segmentation feature data and the second label data, train the pre-established second semantic segmentation network to obtain a second semantic segmentation network model.
  • the training semantic segmentation feature data is used as the input of the second semantic segmentation network
  • the second label data is used as supervision
  • the network parameters are adjusted through the loss until the loss converges to obtain the second semantic segmentation network model.
  • the specific training process will not be described again.
  • the loss function during the training process can use any implementable loss function, such as cross-entropy loss function, focal loss function (focal loss), etc.
  • the second label data can be automatically generated based on a high-definition map or radar projection.
  • the coordinate system corresponding to the second type of perspective is the vehicle coordinate system, which can determine the global information around the vehicle position in the high-definition map.
  • the semantic segmentation type of each position in the high-definition map is determined. It can be known that based on the conversion relationship between the vehicle coordinate system and the high-definition map coordinate system, the segmentation type to which each pixel of the training semantic segmentation feature data belongs can be obtained from the high-definition map, thereby automatically obtaining the second label data.
  • Radar projection is similar to high-definition maps. Radar projection can determine the three-dimensional information around the vehicle. Based on the conversion relationship between the vehicle coordinate system and the radar coordinate system, the segmentation type of each pixel in the training semantic segmentation feature data can be determined, that is, the third Two label data.
  • step 203 fuses at least two first semantic segmentation features to obtain the fused semantic segmentation features, including:
  • Step 2031a Add the feature values of the same pixel positions in at least two first semantic segmentation features to obtain the fused semantic segmentation features.
  • each first semantic segmentation feature is a feature map of the same size, such as a feature map of 512*512*1.
  • the first semantic segmentation feature corresponding to each first type of perspective includes the features of the corresponding area of the first type of perspective. value, and the feature value of other areas is 0. Therefore, the feature values of the same pixel position in the first semantic segmentation features corresponding to multiple viewing angles can be added together to realize the fusion of the feature values of different first-type viewing angle areas into one feature map. , forming global semantic segmentation features containing multi-view information.
  • each view area includes multiple pixels.
  • each pixel in the corresponding view area has a corresponding feature value, and the feature values of pixels in other areas are 0.
  • the fused semantic segmentation features each pixel has a corresponding feature value, and the feature values of each viewing area are fused to form a global semantic segmentation feature map under the second type of perspective. The details will not be described again.
  • FIG. 8 is a schematic diagram of the principle of fusion of two first semantic segmentation features provided by an exemplary embodiment of the present disclosure.
  • the feature values of the overlapping area can also be directly added, because the first semantic segmentation feature and the fused semantic segmentation feature are obtained from the first semantic segmentation feature.
  • the feature map in the middle of the end-to-end processing process from the image data of the first-class perspective to the semantic segmentation result of the second-class perspective.
  • the fused semantic segmentation features will also go through a multi-layer network. Processing, and the same method is used for feature fusion during the model training process, thereby learning the possible errors and ensuring the accuracy of the model. Therefore, the direct addition of feature values will not affect the semantic segmentation results from the second type of perspective.
  • FIG. 9 is a schematic flowchart of step 203 provided by an exemplary embodiment of the present disclosure.
  • step 203 includes:
  • Step 2031b in response to the feature values of the same pixel position of at least two first semantic segmentation features being less than or equal to a non-zero feature value, add the feature values of the pixel position as the fusion feature value of the pixel position.
  • the same pixel position has a non-zero eigenvalue less than or equal to 0, it can mean that the pixel position is a pixel in a non-overlapping area. Therefore, the eigenvalues can be added directly as the fusion feature value of the pixel position. See the aforementioned additive fusion. The content will not be described in detail.
  • Step 2032b in response to at least two non-zero feature values among the feature values of the same pixel position of the at least two first semantic segmentation features, average the feature values of the pixel position according to the number of non-zero feature values, as the pixel The fused feature value of the location.
  • At least two first semantic segmentation features when at least two first semantic segmentation features have non-zero eigenvalues at the same pixel position, it means that the at least two first semantic segmentation features are overlapping areas at the pixel position, and the average of the eigenvalues can be calculated as the fusion feature value.
  • Calculating the average based on the number of non-zero eigenvalues means that when there are N non-zero eigenvalues at the pixel position, the sum of the N non-zero eigenvalues is added and divided by N to obtain the mean as the fused feature value of the pixel position. .
  • Step 2033b Obtain fused semantic segmentation features based on the fused feature values of each pixel position.
  • the fused feature value of each pixel position can be obtained, thereby obtaining the fused semantic segmentation features.
  • the embodiments of the present disclosure realize end-to-end multi-view semantic segmentation through fusion in the feature stage, without the need for post-processing, effectively reducing processing time, thereby reducing auxiliary delays, and solving the problem that existing technologies need to be transmitted to the post-processing module for post-processing.
  • causes problems such as large delays.
  • the semantic segmentation results from different perspectives may be different in the overlapping parts of the two views, resulting in low accuracy of the semantic segmentation results.
  • the mid-fusion method of the present disclosure effectively solves the problem of the prior art. For this problem, global semantic segmentation is directly implemented based on fused features, which avoids the occurrence of different segmentation results in overlapping areas from different perspectives, and effectively improves the accuracy of semantic segmentation results.
  • Any multi-view semantic segmentation method provided by the embodiments of the present disclosure can be executed by any appropriate device with data processing capabilities, including but not limited to: terminal devices and servers.
  • any multi-view semantic segmentation method provided by the embodiments of the present disclosure can be executed by a processor.
  • the processor executes any multi-view semantic segmentation method mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in the memory. No further details will be given below.
  • Figure 10 is a schematic structural diagram of a multi-view semantic segmentation device provided by an exemplary embodiment of the present disclosure.
  • the device of this embodiment can be used to implement corresponding method embodiments of the present disclosure.
  • the device shown in Figure 10 includes: a first determination module 501, a first processing module 502, a first fusion module 503, and a second processing module 504.
  • the first determination module 501 is used to determine the first image data corresponding to at least two first-type viewing angles, and obtain at least two first image data; the first processing module 502 is used to determine at least the first image data obtained by the first determination module 501.
  • the first semantic segmentation features under the second type of perspective corresponding to the two first image data respectively obtain at least two first semantic segmentation features;
  • the first fusion module 503 is used to combine the at least two first semantic segmentation features obtained by the first processing module 502.
  • the first semantic segmentation features are fused to obtain fused semantic segmentation features;
  • the second processing module 504 is used to obtain a fused semantic segmentation result based on the fused semantic segmentation features obtained by the first fusion module 503 .
  • Figure 11 is a schematic structural diagram of the first processing module 502 provided by an exemplary embodiment of the present disclosure.
  • the first processing module 502 includes: a feature extraction unit 5021 and a perspective conversion unit 5022.
  • the feature extraction unit 5021 is configured to perform feature extraction on at least two first image data respectively, determine the second semantic segmentation features under the first type of perspective corresponding to the at least two first image data, and obtain at least two second semantic segmentation features. Segmentation features; the perspective conversion unit 5022 is used to convert at least two second semantic segmentation features into the coordinate system corresponding to the second type of perspective to obtain at least two first semantic segmentation features.
  • the feature extraction unit 5021 is specifically configured to: perform feature extraction on at least two first image data based on the first semantic segmentation network model obtained through pre-training, and obtain at least two second semantic segmentation features.
  • the viewing angle conversion unit 5022 is specifically configured to: determine at least two first-type viewing angles based on preset point coordinates in the image coordinate system corresponding to the at least two first-category viewing angles and pre-obtained camera parameters. Corresponding homography transformation matrices respectively, obtain at least two homography transformation matrices; based on at least two homography transformation matrices, convert at least two second semantic segmentation features into the coordinate system corresponding to the second type of perspective. , obtain at least two first semantic segmentation features.
  • Figure 12 is a schematic structural diagram of a multi-view semantic segmentation device provided by another exemplary embodiment of the present disclosure.
  • the second processing module 504 includes: a first processing unit 5041, configured to obtain a fused semantic segmentation result based on the fused semantic segmentation features and the second semantic segmentation network model obtained by pre-training.
  • the first fusion module 503 includes: a fusion unit 5031a, configured to add feature values of the same pixel position in at least two first semantic segmentation features to obtain a fused semantic segmentation feature.
  • FIG. 13 is a schematic structural diagram of the first fusion module 503 provided by an exemplary embodiment of the present disclosure.
  • the first fusion module 503 includes:
  • the second processing unit 5031b is configured to respond to the feature values of the same pixel position of at least two first semantic segmentation features being less than or equal to a non-zero feature value, and add the feature values of the pixel position as the pixel. Fusion feature value of position;
  • the third processing unit 5032b is configured to respond to at least two non-zero feature values among the feature values of the same pixel position of the at least two first semantic segmentation features, and calculate the feature value of the pixel position according to the number of non-zero feature values.
  • the mean value is used as the fusion feature value of the pixel position;
  • the fourth processing unit 5033b is used to obtain fused semantic segmentation features based on the fused feature values of each pixel position.
  • An embodiment of the present disclosure also provides an electronic device, including: a memory for storing a computer program;
  • a processor configured to execute a computer program stored in the memory, and when the computer program is executed, implement the multi-view semantic segmentation method described in any of the above embodiments of the present disclosure.
  • Figure 14 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure.
  • the electronic device 10 includes one or more processors 11 and memories 12 .
  • the processor 11 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
  • CPU central processing unit
  • the processor 11 may control other components in the electronic device 10 to perform desired functions.
  • Memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may execute the program instructions to implement the methods of various embodiments of the present disclosure described above and/or other desired Function.
  • Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.
  • the electronic device 10 may further include an input device 13 and an output device 14, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 13 may be the above-mentioned microphone or microphone array, used to capture the input signal of the sound source.
  • the input device 13 may also include, for example, a keyboard, a mouse, and the like.
  • the output device 14 can output various information to the outside, including determined distance information, direction information, etc.
  • the output device 14 may include, for example, a display, a speaker, a printer, a communication network and remote output devices connected thereto, and the like.
  • the electronic device 10 may also include any other appropriate components depending on the specific application.
  • embodiments of the present disclosure may also be a computer program product, which includes computer program instructions that, when executed by a processor, cause the processor to perform the “exemplary method” described above in this specification According to this Steps in methods of various embodiments are disclosed.
  • the methods and apparatus of the present disclosure may be implemented in many ways.
  • the methods and devices of the present disclosure may be implemented through software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above order for the steps of the methods is for illustration only, and the steps of the methods of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated.
  • the present disclosure may also be implemented as programs recorded in recording media, and these programs include machine-readable instructions for implementing methods according to the present disclosure.
  • the present disclosure also covers recording media storing programs for executing methods according to the present disclosure.
  • each component or each step can be decomposed and/or recombined. These decompositions and/or recombinations should be considered equivalent versions of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开实施例公开了一种多视角语义分割方法、装置、电子设备和存储介质,其中,方法包括:确定至少两个第一类视角分别对应的第一图像数据,得到至少两个第一图像数据;确定至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征;将至少两个第一语义分割特征进行融合,获得融合语义分割特征;基于融合语义分割特征,获得融合语义分割结果。本公开实施例可以实现仅利用相机、雷达等,通过中融合即可实现端到端的多视角语义分割结果,无需进行后处理,有效降低处理时间,从而减小辅助延迟,解决了现有技术需要传输到后处理模块进行后处理导致延迟较大等问题。

Description

多视角语义分割方法、装置、电子设备和存储介质
本公开要求在2022年5月11日提交的、申请号为202210512773.2、发明名称为“多视角语义分割方法、装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及计算机视觉技术,尤其是一种多视角语义分割方法、装置、电子设备和存储介质。
背景技术
在自动驾驶等计算机视觉领域,为了辅助规划与控制,获得可移动设备(比如自动驾驶车辆、半自动驾驶车辆、自动机器人等)周围环境信息成为关键操作,相关技术中,通常通过设置在可移动设备上的多个视角的摄像头采集周围多个视角的图像数据,然后基于神经网络模型分别对各视角的图像数据进行语义分割,获得各视角分别对应的语义分割结果,传输到后处理模块进行后处理,比如滤波、融合等,获得可移动设备的周围环境语义信息。但是,现有这种后融合的方法的后处理过程处理时间较长,从而为辅助驾驶带来较大延迟。
发明内容
为了解决上述后处理过程处理时间较长等技术问题,提出了本公开。本公开的实施例提供了一种多视角语义分割方法、装置、电子设备和存储介质。
根据本公开实施例的一个方面,提供了一种多视角语义分割方法,包括:确定至少两个第一类视角分别对应的第一图像数据,得到至少两个第一图像数据;确定所述至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征;将所述至少两个第一语义分割特征进行融合,获得融合语义分割特征;基于所述融合语义分割特征,获得融合语义分割结果。
根据本公开实施例的另一个方面,提供了一种多视角语义分割装置,包括:第一确定模块,用于确定至少两个第一类视角分别对应的第一图像数据,得到至少两个第一图像数据;第一处理模块,用于确定所述至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征;第一融合模块,用于将所述至少两个第一语义分割特征进行融合,获得融合语义分割特征;第二处理模块,用于基于所述融合语义分割特征,获得融合语义分割结果。
根据本公开实施例的再一方面,提供一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行本公开上述任一实施例所述的多视角语义分割方法。
根据本公开实施例的又一方面,提供一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现 本公开上述任一实施例所述的多视角语义分割方法。
基于本公开上述实施例提供的多视角语义分割方法、装置、电子设备和存储介质,通过中融合方式,基于相机视角、雷达视角等第一类视角对应的图像数据,确定鸟瞰视角的第二类视角的语义分割特征,在特征阶段进行融合,获得鸟瞰视角的融合语义分割特征,基于融合语义分割特征确定融合语义分类结果,从而实现仅利用相机、雷达等,通过中融合即可实现端到端的多视角语义分割结果,无需进行后处理,有效降低处理时间,从而减小辅助延迟,解决了现有技术需要传输到后处理模块进行后处理导致延迟较大等问题。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1是本公开提供的多视角语义分割方法的一个示例性的应用场景;
图2是本公开一示例性实施例提供的多视角语义分割方法的流程示意图;
图3是本公开一示例性实施例提供的第一语义分割特征的融合示意图;
图4是本公开一个示例性实施例提供的多视角语义分割方法的流程示意图;
图5是本公开一示例性实施例提供的步骤202的流程示意图;
图6是本公开一示例性实施例提供的第一语义分割网络模型的训练流程示意图;
图7是本公开一示例性实施例提供的第二语义分割网络模型的训练流程示意图;
图8是本公开一示例性实施例提供的两个第一语义分割特征融合的原理示意图;
图9是本公开一示例性实施例提供的步骤203的流程示意图;
图10是本公开一示例性实施例提供的多视角语义分割装置的结构示意图;
图11是本公开一示例性实施例提供的第一处理模块502的结构示意图;
图12是本公开另一示例性实施例提供的多视角语义分割装置的结构示意图;
图13是本公开一示例性实施例提供的第一融合模块503的结构示意图;
图14是本公开电子设备一个应用实施例的结构示意图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数 值不限制本公开的范围。
本领域技术人员可以理解,本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
本公开概述
在实现本公开的过程中,发明人发现,在自动驾驶等计算机视觉领域,为了辅助规划与控制,通常通过设置在可移动设备上的多个视角的摄像头采集周围多个视角的图像数据,然后基于神经网络模型分别对各视角的图像数据进行语义分割,获得各视角分别对应的语义分割结果,传输到后处理模块进行后处理,比如滤波、融合等,获得可移动设备的周围环境语义信息。但是,现有这种后融合的方法的后处理过程处理时间较长,从而为辅助驾驶带来较大延迟。
示例性概述
图1是本公开提供的多视角语义分割方法的一个示例性的应用场景。
在该场景中,可移动设备为车辆,在车辆上设置有前、后、左、右4个视角的摄像头,第一类视角以摄像头视角为例,第二类视角为鸟瞰视角,4个视角的摄像头用于采集车辆前、后、左、右4个视角的图像数据,并传输到用于执行本公开的多视角语义分割方法的多视角语义分割装置,利用本公开的多视角语义分割方法,可以基于各摄像头视角的图像数据分别确定该视角对应的鸟瞰视角下的语义分割特征,进而将各摄像头视角对应的鸟瞰视角下的语义分割特征进行融合,获得融合语义分割特征,进而基于融合语义分割特征确定鸟瞰视角下的融合语义分割结果。具体语义分割的类型可以根据实际需求设置,比如语义分割结果可以包括属于地面区域的分割结果、属于车道线的分割结果,等等,具体不做限定。本公开实施例通过特征阶段的中融合实现了端到端的多视角语义分割,无需进行后处理,有效降低处理时间,从而减小辅助延迟,提高准确度,解决了现有技术需要传输到后处理模块进行后处理导致延迟较大等问题。
示例性方法
图2是本公开一示例性实施例提供的多视角语义分割方法的流程示意图。本实施例可应用在电子设备 上,具体比如车载计算平台上,如图2所示,包括如下步骤:
步骤201,确定至少两个第一类视角分别对应的第一图像数据,得到至少两个第一图像数据。
其中,第一类视角可以为摄像头视角(或称相机视角)、雷达视角等采集可移动设备周围环境信息的传感器视角。对于摄像头视角,每个摄像头对应一个视角,从至少两个摄像头可以确定上述至少两个第一类视角分别对应的第一图像数据,每个第一图像数据对应一个第一类视角;对于雷达视角,可以将采集的三维点云数据转换成二维图像数据,得到至少两个第一图像数据,具体可以根据实际需求设置。
示例性的,在自动驾驶领域,需要4个或6个视角的摄像头,覆盖车辆周围环境图像的采集,在每个时刻可以获得对应视角数量(4个或6个)的第一图像数据。
步骤202,确定至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征。
其中,第二类视角为鸟瞰视角,鸟瞰视角是在天空中飞翔的鸟类的视角,鸟瞰视角下的图像称为鸟瞰图(BEV(Birds Eye Views)图),鸟瞰视角下可以获得可移动设备周围一定范围的全局图像。
在得到至少两个第一图像数据后,每个第一图像数据可以获得一个对应的第二类视角下的第一语义分割特征,至少两个第一图像数据可得到至少两个第二类视角下的第一语义分割特征。具体第一类视角的视角数量可以根据实际需求设置,本公开不做限定。比如自动驾驶车辆的前、后、左、右4个视角,可以获得4个第二类视角下的第一语义分割特征。具体不再赘述。
在一个可选示例中,第一语义分割特征可以基于第一类视角下的特征提取及视角转换获得。比如先在第一类视角下对第一图像数据进行特征提取,获得第一类视角下的语义分割特征,再基于第一类视角与第二类视角的坐标转换关系将第一类视角下的语义分割特征转换到第二类视角下,比如基于逆透视变换(IPM)实现视角转换,具体不做限定。
步骤203,将至少两个第一语义分割特征进行融合,获得融合语义分割特征。
其中,鸟瞰视角下的第一语义分割特征的特征图是包括了可移动设备周围一定范围的全局特征图,也即第一语义分割特征包括了全局范围的像素,对于每个第一类视角来说,其对应的第一语义分割特征中只有该第一类视角在第二类视角中的对应像素区域具有有效的特征值,其他像素区域特征值为0。将至少两个第一语义分割特征融合后,获得的融合语义分割特征的各像素区域均具有有效特征值。
示例性的,图3是本公开一示例性实施例提供的第一语义分割特征的融合示意图,对于设置有前、后、左、右4个视角的摄像头的车辆来说,每个视角的摄像头图像对应的鸟瞰视角下的第一语义分割特征实质上包括前、后、左、右视角分别对应的区域,以前视视角为例,其对应的第一语义分割特征中灰色的前视区域像素特征值从前视摄像头图像提取并转换获得,对于其他区域,由于前视摄像头图像中没有相关信息,因此特征值为0或其他表示,具体可以根据实际需求设置。各视角的第一语义分割特征融合后,获得的融合语义分割特征融合了各视角的第一语义分割特征,形成了车辆一定范围内的全局语义分割特征。这里仅 以一简单示例说明第一语义分割特征与融合语义分割特征的关系,并不对其进行限定。
在实际应用中,不同视角之间可能存在重叠区域,不同视角在第一语义分割特征中对应的区域形状、大小可能相同也可能不同,融合方式也可能是其他方式,融合语义分割特征的表示方式也可能是其他方式,比如可以是将各第一语义分割特征通过concat(拼接)方式融合,本公开不做限定。
步骤204,基于融合语义分割特征,获得融合语义分割结果。
其中,融合语义分割结果可以包括根据实际需求设置的分割类型及对应的分割区域。比如地面区域、车道线区域,等等,具体可以根据实际需求设置。基于融合语义分割特征,可以采用任意可实施的方式获得融合语义分割结果。比如可以采用任意可实施的训练好的语义分割网络模型对融合语义分割特征进行语义分割,获得融合语义分割结果,具体可以根据实际需求设置。本公开的融合语义分割结果可以用于定位、导航、规划控制等场景。
本实施例提供的多视角语义分割方法,通过中融合方式,基于相机视角、雷达视角等第一类视角对应的图像数据,确定鸟瞰视角的第二类视角的语义分割特征,在特征阶段进行融合,获得鸟瞰视角的融合语义分割特征,基于融合语义分割特征确定融合语义分类结果,从而实现仅利用相机、雷达等,通过中融合即可实现端到端的多视角语义分割结果,无需进行后处理,有效降低处理时间,从而减小辅助延迟,解决了现有技术需要传输到后处理模块进行后处理导致延迟较大等问题。
图4是本公开一个示例性实施例提供的多视角语义分割方法的流程示意图。
在一个可选示例中,步骤202具体可以包括以下步骤:
步骤2021,分别对至少两个第一图像数据进行特征提取,确定至少两个第一图像数据分别对应的第一类视角下的第二语义分割特征,得到至少两个第二语义分割特征。
其中,对第一图像数据进行特征提取可以采用任意可实施的方式,比如可以基于训练好的特征提取网络模型进行特征提取,还可以基于训练好的第一语义分割网络模型中的特征提取网络部分进行特征提取,具体可以根据实际需求设置。每个第一图像数据得到一个对应的第二语义分割特征。
步骤2022,将至少两个第二语义分割特征分别转换到第二类视角对应的坐标系下,得到至少两个第一语义分割特征。
其中,第二类视角对应的坐标系可以为可移动设备的自坐标系(比如车辆坐标系),也可以为世界坐标系,还可以是地图坐标系,具体可以根据实际需求设置,本公开不做限定。第一类视角以摄像头视角为例,第二类视角以车辆坐标系为例,摄像头视角对应的坐标系为图像坐标系,图像坐标系与车辆坐标系的转换关系可以基于摄像头的内参和外参、及预先获得的图像坐标系的预设点坐标确定。比如可以基于摄像头的内参和外参、及预先获得的图像坐标系的预设点坐标确定该摄像头视角对应的单应性变换矩阵,基于单应性变换矩阵实现第二语义分割特征到第二类视角的转换。还可以通过其他转换方式进行转换,本实施例不做限定。每个第二语义分割特征可以得到一个对应的第一语义分割特征。
本公开通过第一类视角下的特征提取结合视角转换,获得第一类视角下的第一图像数据对应的第二类视角下的第一语义分割特征,实现了特征阶段的视角转换,便于后续第二类视角下的特征融合,从而实现中融合。
图5是本公开一示例性实施例提供的步骤202的流程示意图。
在一个可选示例中,步骤2021的分别对至少两个第一图像数据进行特征提取,确定至少两个第一图像数据分别对应的第一类视角下的第二语义分割特征,得到至少两个第二语义分割特征,包括:
步骤20211,基于预先训练获得的第一语义分割网络模型对至少两个第一图像数据进行特征提取,获得至少两个第二语义分割特征。
其中,第一语义分割网络模型可以采用任意可实施的网络结构,比如基于FCN(Fully Convolutional Networks,全卷积网络)的语义分割网络模型及其系列、基于UNet的语义分割网络模型及其系列、基于DeepLab的语义分割网络模型及其系列,等等。第一语义分割网络模型的训练采用分割类型标签数据进行监督,在用于特征提取时,可以将第一语义分割网络模型中最后的归一化层(比如softmax层)之前输出的特征图作为提取的第二语义分割特征。
在一个可选示例中,图6是本公开一示例性实施例提供的第一语义分割网络模型的训练流程示意图。在本示例中,第一语义分割网络模型通过以下方式获得:
步骤301,确定第一训练数据,第一训练数据包括第一类视角下的训练图像数据及对应的第一标签数据;
其中,第一类视角下的训练图像数据可以包括多个视角的图像数据,第一标签数据包括各训练图像数据各像素所属的第一预设语义分割类型标签,第一预设语义分割类型可以根据实际需求设置,比如可以包括地面类型、路沿类型、车道线类型、车辆类型及其他可能的类型等中的至少一种。第一预设语义分割类型标签可以采用任意可实施的表示方式,比如用0、1、2、3等编号表示,也可以用其他方式表示,对于只有一种分割类型的情况,每个像素对应的分割类型可以用0、1表示,0表示不属于该类型,1表示属于该类型,具体不做限定。第一预设语义分割类型的设置可以根据第二类视角下的融合语义分割需求设置。
在一个可选示例中,由于需要采用逆透视变换,会对高于地面的物体产生拉伸效果,为了保证语义分割结果的精准度,第一预设语义分割类型可以设置为与地面处于同一高度的类型,比如车道线类型、人行道类型、路沿、停止线、箭头标志等与地面处于同一高度的类型,以及除道路标志外的地面类型,具体可以根据实际需求设置。
步骤302,基于第一训练图像数据和第一标签数据,对预先建立的第一语义分割网络进行训练,获得第一语义分割网络模型。
其中,训练过程中,可以采用任意可实施的损失函数,比如交叉熵损失函数、聚焦损失函数(focal loss),等等。第一标签数据可以采用任意可实施的方式获得。具体来说,将第一训练图像数据作为第一语义分割网络的输入,获得对应的第一输出数据,基于第一输出数据、对应的第一标签数据及第一损失函数,确定 当前损失,基于当前损失调整网络参数,进入下一迭代流程,以此类推,直至当前损失收敛,获得第一语义分割网络模型。具体训练原理不再赘述。
在一个可选示例中,步骤2022的将至少两个第二语义分割特征分别转换到第二类视角对应的坐标系下,得到至少两个第一语义分割特征,包括:
步骤20221,基于至少两个第一类视角分别对应的图像坐标系中的预设点坐标和预先获得的相机参数,确定至少两个第一类视角分别对应的单应性变换矩阵,得到至少两个单应性变换矩阵。
其中,预设点坐标可以包括4个点坐标,以一个摄像头为例,具体为该摄像头的图像坐标系中地面的4个点的坐标,表示为Iimg,相机参数即摄像头参数,可以包括内参和外参。每个第一类视角确定出一个对应的单应性变换矩阵。具体来说,在车辆标定出厂后,车辆上部署的摄像头的内参k)是固定的,可以通过一系列标定,确定各摄像头的外参p,当第二类视角对应的坐标系为车辆坐标系时,可以确定摄像头到车辆坐标系原点(通常为车辆后轴中心)的外参p。基于摄像头内参k和外参p,可以得到上述4个点坐标对应的鸟瞰视角的坐标IBEV,表示如下:
IBEV=kpIimg
基于图像坐标系和鸟瞰视角的车辆坐标系的4个点对,可以得到该摄像头对应的单应性变换矩阵,比如可以通过相应的IPM变换函数getPerspectiveTransform实现,表示如下:
H=getPerspectiveTransform(Iimg,IBEV)
具体IPM变换原理不再赘述。
步骤20222,基于至少两个单应性变换矩阵,分别将至少两个第二语义分割特征转换到第二类视角对应的坐标系下,得到至少两个第一语义分割特征。
在确定了各第一类视角分别对应的单应性变换矩阵后,即可基于各单应性变换矩阵将对应视角的第二语义分割特征转换到第二类视角对应的坐标系下,得到对应的第一语义分割特征。
示例性的,通过上述单应性变换矩阵H,将第二语义分割特征Fimg,转换到鸟瞰视角,得到第一语义分割特征FBEV,表示如下:
FBEV=HFimg
具体转换原理不再赘述。
在一个可选示例中,步骤204的基于融合语义分割特征,获得融合语义分割结果,包括:
步骤2041,基于融合语义分割特征及预先训练获得的第二语义分割网络模型,获得融合语义分割结果。
其中,第二语义分割网络模型可以采用任意可实施的语义分割网络模型,比如基于FCN(Fully Convolutional Networks,全卷积网络)的语义分割网络模型及其系列、基于UNet的语义分割网络模型及其系列、基于DeepLab的语义分割网络模型及其系列,等等。第二语义分割网络模型的输入为融合语义分割特征,在训练过程中,其输入也为鸟瞰视角下的融合语义分割特征。
在一个可选示例中,图7是本公开一示例性实施例提供的第二语义分割网络模型的训练流程示意图。在本示例中,第二语义分割网络模型通过以下方式获得:
步骤401,确定第二训练数据,第二训练数据包括第二类视角下的训练语义分割特征数据及对应的第二标签数据。
其中,第二类视角下的训练语义分割特征数据是多视角融合后的训练融合语义分割特征数据。第二标签数据包括训练语义分割特征数据中各像素所属的第二预设语义分割类型,第二预设语义分割类型与第一预设语义分割类型类似,在此不再赘述。
步骤402,基于训练语义分割特征数据和第二标签数据,对预先建立的第二语义分割网络进行训练,获得第二语义分割网络模型。
其中,训练语义分割特征数据作为第二语义分割网络的输入,第二标签数据作为监督,通过损失调整网络参数,直至损失收敛获得第二语义分割网络模型。具体训练过程不再赘述。训练过程中的损失函数可以采用任意可实施的损失函数,比如交叉熵损失函数、聚焦损失函数(focal loss),等等。
在一个可选示例中,第二标签数据可以根据高清地图或者雷达投影自动生成。
具体来说,在确定车辆位置的情况下,第二类视角对应的坐标系为车辆坐标系,可以确定高清地图中车辆位置周围范围的全局信息,高清地图中每个位置的语义分割类型是确定可知的,因此基于车辆坐标系与高清地图坐标系之间的转换关系,可以从高清地图获取训练语义分割特征数据每个像素所属的分割类型,从而自动获得第二标签数据。雷达投影与高清地图类似,雷达投影可以确定车辆周围的三维信息,进而基于车辆坐标系与雷达坐标系的转换关系,可以确定训练语义分割特征数据中每个像素所属的分割类型,即获得了第二标签数据。
通过高清地图或雷达投影自动生成第二标签数据,实现自动化标注,有效减少人工标注成本,提高模型训练效率。
在一个可选示例中,步骤203的将至少两个第一语义分割特征进行融合,获得融合语义分割特征,包括:
步骤2031a,将至少两个第一语义分割特征中相同像素位置的特征值相加,获得融合语义分割特征。
其中,各第一语义分割特征是相同尺寸的特征图,比如均为512*512*1的特征图,每个第一类视角对应的第一语义分割特征包含该第一类视角对应区域的特征值,其他区域特征值为0,因此,可以将多个视角分别对应的第一语义分割特征中相同像素位置的特征值相加,实现不同第一类视角区域的特征值融合到了一个特征图上,形成了包含多视角信息的全局语义分割特征。参见上述图3,每个视角区域包括多个像素,在各视角的第一语义分割特征中,相应视角区域的各像素具有对应的特征值,其他区域像素的特征值为0,在融合获得的融合语义分割特征中,各像素均具有对应的特征值,融合了各视角区域的特征值,形成了第二类视角下的全局语义分割特征图。具体不再赘述。
示例性的,图8是本公开一示例性实施例提供的两个第一语义分割特征融合的原理示意图。
需要说明的是,在实际应用中各第一类视角之间可能存在重叠区域,转换到第二类视角后,表现为各第一语义分割特征之间可能存在重叠的像素,即在同一像素位置,两个第一语义分割特征中特征值均不为0,在该示例中,对于重叠区域的特征值同样可以采用直接相加方式,由于第一语义分割特征和融合语义分割特征是从第一类视角的图像数据到第二类视角的语义分割结果的端到端处理过程的中间的特征图,在基于第二语义分割网络模型进行语义分割时,融合语义分割特征还会经过多层网络的处理,且在模型训练过程同样采用相同的方式进行特征融合,从而学习了因此可能带来的误差,保证模型精度,因此,特征值直接相加不会影响第二类视角下的语义分割结果。
在一个可选示例中,图9是本公开一示例性实施例提供的步骤203的流程示意图。在本示例中,步骤203包括:
步骤2031b,响应于至少两个第一语义分割特征的相同像素位置的特征值中,有小于或等于一个非0特征值,将该像素位置的特征值相加,作为该像素位置的融合特征值。
其中,相同像素位置有小于或等于一个非0特征值,可以表示该像素位置为非重叠区域的像素,因此,特征值可以直接相加,作为该像素位置的融合特征值,参见前述相加融合内容,具体不再赘述。
步骤2032b,响应于至少两个第一语义分割特征的相同像素位置的特征值中,有至少两个非0特征值,将该像素位置的特征值按非0特征值数量求均值,作为该像素位置的融合特征值。
其中,当相同像素位置,有至少两个第一语义分割特征存在非0特征值,表示该至少两个第一语义分割特征在该像素位置为重叠区域,可以求特征值的均值作为融合特征值。按非0特征值数量求均值,是指当该像素位置有N个非0特征值时,将N个非0特征值相加的和再除以N,获得均值作为该像素位置的融合特征值。
步骤2033b,基于各像素位置的融合特征值,获得融合语义分割特征。
基于上述步骤的处理,可以获得每个像素位置的融合特征值,从而获得融合语义分割特征。
本公开实施例通过特征阶段的中融合实现了端到端的多视角语义分割,无需进行后处理,有效降低处理时间,从而减小辅助延迟,解决了现有技术需要传输到后处理模块进行后处理导致延迟较大等问题。并且现有技术的后融合方式在两视角重叠的部分,不同视角的语义分割结果可能不同,导致语义分割结果准确度低,相对于现有技术,本公开的中融合方式有效解决了现有技术这一问题,基于融合的特征直接实现全局的语义分割,避免不同视角重叠区域分割结果不同的情况发生,有效提高语义分割结果的准确度。
本公开上述各实施例或可选示例可以单独实施,也可以在不冲突的情况下,以任意组合方式结合实施。
本公开实施例提供的任一种多视角语义分割方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种多视角语义分割方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种多视角语义分割方法。 下文不再赘述。
示例性装置
图10是本公开一示例性实施例提供的多视角语义分割装置的结构示意图。该实施例的装置可用于实现本公开相应的方法实施例,如图10所示的装置包括:第一确定模块501、第一处理模块502、第一融合模块503和第二处理模块504。
第一确定模块501,用于确定至少两个第一类视角分别对应的第一图像数据,得到至少两个第一图像数据;第一处理模块502,用于确定第一确定模块501得到的至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征;第一融合模块503,用于将第一处理模块502得到的至少两个第一语义分割特征进行融合,获得融合语义分割特征;第二处理模块504,用于基于第一融合模块503获得的融合语义分割特征,获得融合语义分割结果。
图11是本公开一示例性实施例提供的第一处理模块502的结构示意图。
在一个可选示例中,第一处理模块502包括:特征提取单元5021和视角转换单元5022。
特征提取单元5021,用于分别对至少两个第一图像数据进行特征提取,确定至少两个第一图像数据分别对应的第一类视角下的第二语义分割特征,得到至少两个第二语义分割特征;视角转换单元5022,用于将至少两个第二语义分割特征分别转换到第二类视角对应的坐标系下,得到至少两个第一语义分割特征。
在一个可选示例中,特征提取单元5021具体用于:基于预先训练获得的第一语义分割网络模型对至少两个第一图像数据进行特征提取,获得至少两个第二语义分割特征。
在一个可选示例中,视角转换单元5022具体用于:基于至少两个第一类视角分别对应的图像坐标系中的预设点坐标和预先获得的相机参数,确定至少两个第一类视角分别对应的单应性变换矩阵,得到至少两个单应性变换矩阵;基于至少两个单应性变换矩阵,分别将至少两个第二语义分割特征转换到第二类视角对应的坐标系下,得到至少两个第一语义分割特征。
图12是本公开另一示例性实施例提供的多视角语义分割装置的结构示意图。
在一个可选示例中,第二处理模块504,包括:第一处理单元5041,用于基于融合语义分割特征及预先训练获得的第二语义分割网络模型,获得融合语义分割结果。
在一个可选示例中,第一融合模块503,包括:融合单元5031a,用于将至少两个第一语义分割特征中相同像素位置的特征值相加,获得融合语义分割特征。
在一个可选示例中,图13是本公开一示例性实施例提供的第一融合模块503的结构示意图。在本示例中,第一融合模块503包括:
第二处理单元5031b,用于响应于至少两个第一语义分割特征的相同像素位置的特征值中,有小于或等于一个非0特征值,将该像素位置的特征值相加,作为该像素位置的融合特征值;
第三处理单元5032b,用于响应于至少两个第一语义分割特征的相同像素位置的特征值中,有至少两个非0特征值,将该像素位置的特征值按非0特征值数量求均值,作为该像素位置的融合特征值;
第四处理单元5033b,用于基于各像素位置的融合特征值,获得融合语义分割特征。
本公开提供的多视角语义分割装置中各模块的具体操作参见前述方法实施例,在此不再赘述。
示例性电子设备
本公开实施例还提供了一种电子设备,包括:存储器,用于存储计算机程序;
处理器,用于执行所述存储器中存储的计算机程序,且所述计算机程序被执行时,实现本公开上述任一实施例所述的多视角语义分割方法。
图14是本公开电子设备一个应用实施例的结构示意图。本实施例中,该电子设备10包括一个或多个处理器11和存储器12。
处理器11可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备10中的其他组件以执行期望的功能。
存储器12可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器11可以运行所述程序指令,以实现上文所述的本公开的各个实施例的方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。
在一个示例中,电子设备10还可以包括:输入装置13和输出装置14,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
例如,该输入装置13可以是上述的麦克风或麦克风阵列,用于捕捉声源的输入信号。
此外,该输入装置13还可以包括例如键盘、鼠标等等。
该输出装置14可以向外部输出各种信息,包括确定出的距离信息、方向信息等。该输出装置14可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图14中仅示出了该电子设备10中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备10还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本 公开各种实施例的方法中的步骤。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
还需要指出的是,在本公开的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。

Claims (11)

  1. 一种多视角语义分割方法,包括:
    确定至少两个第一类视角分别对应的第一图像数据,得到至少两个第一图像数据;
    确定所述至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征;
    将所述至少两个第一语义分割特征进行融合,获得融合语义分割特征;
    基于所述融合语义分割特征,获得融合语义分割结果。
  2. 根据权利要求1所述的方法,其中,所述确定所述至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征,包括:
    分别对所述至少两个第一图像数据进行特征提取,确定所述至少两个第一图像数据分别对应的所述第一类视角下的第二语义分割特征,得到至少两个第二语义分割特征;
    将所述至少两个第二语义分割特征分别转换到所述第二类视角对应的坐标系下,得到所述至少两个第一语义分割特征。
  3. 根据权利要求2所述的方法,其中,所述分别对所述至少两个第一图像数据进行特征提取,确定所述至少两个第一图像数据分别对应的所述第一类视角下的第二语义分割特征,得到至少两个第二语义分割特征,包括:
    基于预先训练获得的第一语义分割网络模型对所述至少两个第一图像数据进行特征提取,获得所述至少两个第二语义分割特征。
  4. 根据权利要求2所述的方法,其中,所述将所述至少两个第二语义分割特征分别转换到所述第二类视角对应的坐标系下,得到所述至少两个第一语义分割特征,包括:
    基于所述至少两个第一类视角分别对应的图像坐标系中的预设点坐标和预先获得的相机参数,确定所述至少两个第一类视角分别对应的单应性变换矩阵,得到至少两个单应性变换矩阵;
    基于所述至少两个单应性变换矩阵,分别将所述至少两个第二语义分割特征转换到所述第二类视角对应的坐标系下,得到所述至少两个第一语义分割特征。
  5. 根据权利要求1所述的方法,其中,所述基于所述融合语义分割特征,获得融合语义分割结果,包括:
    基于所述融合语义分割特征及预先训练获得的第二语义分割网络模型,获得所述融合语义分割结果。
  6. 根据权利要求1-5任一所述的方法,其中,所述将所述至少两个第一语义分割特征进行融合,获得融合语义分割特征,包括:
    将所述至少两个第一语义分割特征中相同像素位置的特征值相加,获得所述融合语义分割特征。
  7. 根据权利要求1-5任一所述的方法,其中,所述将所述至少两个第一语义分割特征进行融合,获得融合语义分割特征,包括:
    响应于所述至少两个第一语义分割特征的相同像素位置的特征值中,有小于或等于一个非0特征值,将该像素位置的特征值相加,作为该像素位置的融合特征值;
    响应于所述至少两个第一语义分割特征的相同像素位置的特征值中,有至少两个非0特征值,将该像素位置的特征值按非0特征值数量求均值,作为该像素位置的融合特征值;
    基于各所述像素位置的所述融合特征值,获得所述融合语义分割特征。
  8. 一种多视角语义分割装置,包括:
    第一确定模块,用于确定至少两个第一类视角分别对应的第一图像数据,得到至少两个第一图像数据;
    第一处理模块,用于确定所述至少两个第一图像数据分别对应的第二类视角下的第一语义分割特征,得到至少两个第一语义分割特征;
    第一融合模块,用于将所述至少两个第一语义分割特征进行融合,获得融合语义分割特征;
    第二处理模块,用于基于所述融合语义分割特征,获得融合语义分割结果。
  9. 根据权利要求8所述的装置,其中,所述第一处理模块,包括:
    特征提取单元,用于分别对所述至少两个第一图像数据进行特征提取,确定所述至少两个第一图像数据分别对应的所述第一类视角下的第二语义分割特征,得到至少两个第二语义分割特征;
    视角转换单元,用于将所述至少两个第二语义分割特征分别转换到所述第二类视角对应的坐标系下,得到所述至少两个第一语义分割特征。
  10. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-7任一所述的多视角语义分割方法。
  11. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-7任一所述的多视角语义分割方法。
PCT/CN2023/074402 2022-05-11 2023-02-03 多视角语义分割方法、装置、电子设备和存储介质 WO2023216654A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210512773.2A CN114821506A (zh) 2022-05-11 2022-05-11 多视角语义分割方法、装置、电子设备和存储介质
CN202210512773.2 2022-05-11

Publications (1)

Publication Number Publication Date
WO2023216654A1 true WO2023216654A1 (zh) 2023-11-16

Family

ID=82513294

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/074402 WO2023216654A1 (zh) 2022-05-11 2023-02-03 多视角语义分割方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN114821506A (zh)
WO (1) WO2023216654A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821506A (zh) * 2022-05-11 2022-07-29 北京地平线机器人技术研发有限公司 多视角语义分割方法、装置、电子设备和存储介质
CN115578702B (zh) * 2022-09-26 2023-12-05 北京百度网讯科技有限公司 道路元素的提取方法、装置、电子设备、存储介质及车辆

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345887A (zh) * 2018-01-29 2018-07-31 清华大学深圳研究生院 图像语义分割模型的训练方法及图像语义分割方法
CN110348351A (zh) * 2019-07-01 2019-10-18 深圳前海达闼云端智能科技有限公司 一种图像语义分割的方法、终端和可读存储介质
CN112733919A (zh) * 2020-12-31 2021-04-30 山东师范大学 基于空洞卷积和多尺度多分支的图像语义分割方法及系统
US20210334556A1 (en) * 2018-09-12 2021-10-28 Toyota Motor Europe Electronic device, system and method for determining a semantic grid of an environment of a vehicle
CN114187311A (zh) * 2021-12-14 2022-03-15 京东鲲鹏(江苏)科技有限公司 一种图像语义分割方法、装置、设备及存储介质
CN114821506A (zh) * 2022-05-11 2022-07-29 北京地平线机器人技术研发有限公司 多视角语义分割方法、装置、电子设备和存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447990B (zh) * 2018-10-22 2021-06-22 北京旷视科技有限公司 图像语义分割方法、装置、电子设备和计算机可读介质
CN113362338B (zh) * 2021-05-24 2022-07-29 国能朔黄铁路发展有限责任公司 铁轨分割方法、装置、计算机设备和铁轨分割处理系统
CN113408454B (zh) * 2021-06-29 2024-02-06 上海高德威智能交通系统有限公司 一种交通目标检测方法、装置、电子设备及检测系统
CN113673444B (zh) * 2021-08-19 2022-03-11 清华大学 一种基于角点池化的路口多视角目标检测方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345887A (zh) * 2018-01-29 2018-07-31 清华大学深圳研究生院 图像语义分割模型的训练方法及图像语义分割方法
US20210334556A1 (en) * 2018-09-12 2021-10-28 Toyota Motor Europe Electronic device, system and method for determining a semantic grid of an environment of a vehicle
CN110348351A (zh) * 2019-07-01 2019-10-18 深圳前海达闼云端智能科技有限公司 一种图像语义分割的方法、终端和可读存储介质
CN112733919A (zh) * 2020-12-31 2021-04-30 山东师范大学 基于空洞卷积和多尺度多分支的图像语义分割方法及系统
CN114187311A (zh) * 2021-12-14 2022-03-15 京东鲲鹏(江苏)科技有限公司 一种图像语义分割方法、装置、设备及存储介质
CN114821506A (zh) * 2022-05-11 2022-07-29 北京地平线机器人技术研发有限公司 多视角语义分割方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN114821506A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
WO2023216654A1 (zh) 多视角语义分割方法、装置、电子设备和存储介质
US10817752B2 (en) Virtually boosted training
CN111639663B (zh) 多传感器数据融合的方法
EP3822852B1 (en) Method, apparatus, computer storage medium and program for training a trajectory planning model
WO2023221566A1 (zh) 一种基于多视角融合的3d目标检测方法及装置
WO2022206414A1 (zh) 三维目标检测方法及装置
CN113111751B (zh) 一种自适应融合可见光与点云数据的三维目标检测方法
WO2023185564A1 (zh) 基于多网联车空间对齐特征融合的视觉增强方法及系统
WO2020215254A1 (zh) 车道线地图的维护方法、电子设备与存储介质
CN113095154A (zh) 基于毫米波雷达与单目相机的三维目标检测系统及方法
WO2023216460A1 (zh) 基于鸟瞰图的多视角3d目标检测方法、存储器及系统
WO2023155580A1 (zh) 一种对象识别方法和装置
CN115879060B (zh) 基于多模态的自动驾驶感知方法、装置、设备和介质
CN114913290A (zh) 多视角融合的场景重建方法、感知网络训练方法及装置
CN115049820A (zh) 遮挡区域的确定方法、装置和分割模型的训练方法
CN115578709A (zh) 一种车路协同的特征级协同感知融合方法和系统
CN112241963A (zh) 基于车载视频的车道线识别方法、系统和电子设备
CN114648639B (zh) 一种目标车辆的检测方法、系统及装置
Unger et al. Multi-camera bird’s eye view perception for autonomous driving
CN114972945A (zh) 多机位信息融合的车辆识别方法、系统、设备及存储介质
CN114913329A (zh) 一种图像处理方法、语义分割网络的训练方法及装置
CN113837270B (zh) 一种目标识别方法、装置、设备及存储介质
CN116343158B (zh) 车道线检测模型的训练方法、装置、设备及存储介质
CN111815667B (zh) 一种相机移动条件下高精度检测运动目标的方法
US20240101158A1 (en) Determining a location of a target vehicle relative to a lane

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802430

Country of ref document: EP

Kind code of ref document: A1