WO2023019444A1 - 语义分割模型的优化方法和装置 - Google Patents

语义分割模型的优化方法和装置 Download PDF

Info

Publication number
WO2023019444A1
WO2023019444A1 PCT/CN2021/113095 CN2021113095W WO2023019444A1 WO 2023019444 A1 WO2023019444 A1 WO 2023019444A1 CN 2021113095 W CN2021113095 W CN 2021113095W WO 2023019444 A1 WO2023019444 A1 WO 2023019444A1
Authority
WO
WIPO (PCT)
Prior art keywords
semantic segmentation
image
map
feature maps
segmentation model
Prior art date
Application number
PCT/CN2021/113095
Other languages
English (en)
French (fr)
Inventor
高彬
郑晓旭
徐航
金欢
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/113095 priority Critical patent/WO2023019444A1/zh
Priority to CN202180100913.9A priority patent/CN117693768A/zh
Publication of WO2023019444A1 publication Critical patent/WO2023019444A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation

Definitions

  • the present application relates to the technical field of image processing, and more specifically, to a method and device for optimizing a semantic segmentation model.
  • Semantic segmentation technology is an understanding of the pixel level of the image, and it is to classify the objects on the image at the pixel level, that is, to classify the pixels belonging to the same type of object into one category, and use the specified label (label) to mark.
  • semantic segmentation technology is widely used in unmanned driving, assisted driving, automatic driving, security, monitoring and other scenarios.
  • the present application provides a method and device for optimizing a semantic segmentation model, which can improve the prediction accuracy of the image semantic segmentation model.
  • the present application provides a method for optimizing a semantic segmentation model, which can be used in an optimization device for a semantic segmentation model, and the method can include: the optimization device obtains a target image, and the target image is based on an annotated image and an unlabeled image Obtained; the optimization device inputs the target image into the first semantic segmentation model to obtain the first output result; the optimization device inputs the unlabeled image into the second semantic segmentation model to obtain the second output result, the second semantic segmentation model
  • the model structure is the same as that of the first semantic segmentation model; the optimizing device optimizes the first semantic segmentation model based on the target image, the first output result and the second output result.
  • the above target image is an image obtained by mixing an annotated image and an unannotated image.
  • the above labeled image means that each pixel included in the image has a label value, and the label value of each pixel is used to indicate the object category to which each pixel belongs.
  • the aforementioned unlabeled image means that each pixel included in the image does not have a label value.
  • the above labeled values are usually the real values labeled manually.
  • a dual-model structure composed of the first semantic segmentation model and the second semantic segmentation model is adopted, wherein the first semantic segmentation model can be used as a student model, and the second semantic segmentation model can be used as a student model.
  • the model can be used as a teacher model, and the output results of the teacher model can be used to assist and guide the training and optimization of the student model, which can improve the optimization effect of the student model.
  • the target image input by the first semantic segmentation model is annotated image and no After the labeled image is mixed, it can dig deeper into the association between the labeled image and the unlabeled image (that is, it can enhance the association between the labeled image and the unlabeled image), so as to reduce the relationship between the unlabeled image and the labeled image. Therefore, optimizing the first semantic segmentation model through the target image can improve the domain adaptability of the first semantic segmentation model, thereby improving the prediction accuracy of the first semantic segmentation model.
  • the input of the second semantic segmentation model is an unlabeled image
  • training the second semantic segmentation model through the unlabeled image can reduce the dependence of the second semantic segmentation model on the labeled image, and can reduce the cost of the labeled image.
  • the labeled image and the unlabeled image are usually collected in similar application scenarios or similar environments, that is, the labeled image and the unlabeled image include at least some common object categories.
  • the object categories included in the labeled image are cars, people, trees, and buildings
  • the objects included in the unlabeled images are cars, trees, and buildings.
  • the target image may include a partial area of the labeled image and a partial area of the unlabeled image.
  • the optimization device may obtain the target image in various ways, which are not limited in this application.
  • the optimizing device may receive the target image sent by other devices (such as an image generating device). That is, the target image can be generated by the image generating device.
  • the optimization device may generate the target image based on the labeled image and the unlabeled image.
  • the optimization device may generate the target image based on the labeled image and the unlabeled image in various ways, which is not limited in this application.
  • the optimizing device crops the labeled image to obtain a first sub-image; crops the unlabeled image to obtain a second sub-image; the first sub-image and the second sub-image
  • the images are spliced to obtain the target image.
  • the optimization device may extract the first region of interest in the labeled image based on the first mask to obtain the first sub-image; extract the region of interest in the unlabeled image based on the second mask.
  • the second region of interest is to obtain the second sub-image; the first sub-image and the second sub-image are spliced to obtain the target image, and the position of the first region of interest in the first mask is the same as that of the second sub-image.
  • the positions of the two non-interest regions in the second mask correspond, wherein the second non-interest region is an area in the second mask except the second interest region.
  • the first semantic segmentation model is a pre-trained model for identifying C types of object categories, where C is an integer greater than 0.
  • the target image includes at least part or all of the C types of object categories.
  • the first output result may include the first semantic segmentation map and P first feature maps, where the value of P is greater than the number of channels of the target image, and the resolution of the first feature map is smaller than The resolution of this target image.
  • the first semantic segmentation model may use a convolutional neural network, and the convolutional neural network includes at least a processing layer 1, a processing layer 2, a processing layer 3, and a processing layer 4.
  • the above steps 202 may include: the optimization device performs feature extraction on the target image through the processing layer 1 to obtain Q feature maps 1, and the resolution of the feature maps 1 is H 2 ⁇ W 2 , where H 2 is smaller than H 1 , and W 2 is less than W 1 , and Q is greater than T; through this processing layer 2, Q feature maps 1 are mapped to P feature maps 2 (that is, P first feature maps), and the resolution of the feature maps 2 is H 2 ⁇ W 2 , where P is smaller than Q; the Q feature maps 1 are mapped to C feature maps 3 through the processing layer 3, the resolution of the feature maps 3 is H 1 ⁇ W 1 , the C feature maps 3 and the There is a one-to-one correspondence between C types of object categories, the feature map 3 includes H 1 ⁇ W 1 confidence levels, and the H 1 ⁇ W 1 confidence
  • the degree is used to represent the probability that the pixel at the corresponding position in the target image belongs to the object category corresponding to the feature map 3; through the processing layer 4, the credibility of each object category based on the C feature maps 3 and the C object categories Threshold 1, the first semantic segmentation map is obtained, and the resolution of the first semantic segmentation map is H 1 ⁇ W 1 .
  • Y feature maps with a resolution of H ⁇ W can be referred to as a feature space of H ⁇ W ⁇ Y, which includes Y channels (that is, the Depth is Y), each channel in the Y channels includes H ⁇ W pixels; or, Y feature maps with a resolution of H ⁇ W can be called a feature matrix of H ⁇ W ⁇ Y, the feature The matrix includes Y eigenvectors, and each of the Y eigenvectors includes H ⁇ W elements, where Y is an integer greater than 0.
  • the above processing layer 1 is used to down-sample the target image to obtain Q feature maps 1 (namely feature space 1), the resolution of the feature map 1 is lower than the resolution of the target image, that is, the processing layer 1 can reduce
  • the resolution of the small image reduces the computational load of the model and improves the classification efficiency; in addition, Q is greater than the number of channels of the target image, that is, the processing layer 1 can increase the dimension of the feature space, thereby extracting the high-dimensional spatial features of the image.
  • the processing layer 1 may include at least one convolutional layer 1 .
  • the above processing layer 2 is used to map Q feature maps 1 to P feature maps 2 (that is, feature space 2).
  • the resolution of feature maps 2 and feature maps 1 is the same, but P is smaller than Q, that is, processing layer 2 can reduce feature The dimension of the space to remove redundant features in the image, thereby reducing the amount of calculation of the model.
  • the processing layer 2 may include at least one convolutional layer 2 .
  • the above processing layer 3 is used to up-sample Q feature maps 1 to obtain C feature maps 3 (namely, feature space 3).
  • the resolution of the feature maps 3 is the same as the resolution of the target image, that is, the processing layer 3 can restore The full resolution of the target image can be obtained, so as to restore more detailed features in the target image.
  • the processing layer 3 may include at least one deconvolution layer and a maximum value function layer.
  • each processing layer may further include other operation layers capable of implementing respective functions, which is not limited in this embodiment of the present application.
  • the above-mentioned processing layer 1 can also have at least one pooling layer.
  • the pooling layer can reduce the width and height of the feature map, and reduce the computational complexity of the convolutional neural network by reducing the amount of feature map data; on the other hand, it can Perform feature compression to extract the main features of the image.
  • the optimization device may determine the maximum confidence of pixels at the same position in the C feature map 3 through the processing layer 4, if the maximum confidence is greater than or equal to the feature to which the maximum confidence belongs If the credibility threshold of the object category corresponding to FIG. 3 is 1, it is determined that the pixel at the corresponding position in the first semantic segmentation map belongs to the object category corresponding to the feature map 3 to which the maximum confidence belongs.
  • the second semantic segmentation model is a pre-trained model for identifying the C types of object categories.
  • the unlabeled image includes at least part or all of the C types of object categories.
  • the first semantic segmentation model described in this application has the same model structure as the second semantic segmentation model, including: first, the functions of these two models are the same, that is, both are used to identify the C object categories; Second, the convolutional neural networks used in the two models have the same network structure, including the same number of processing layers, types of processing layers, and the same function of each processing layer.
  • the difference between the two models is that the parameters set by the processing layers in the two models may be different, for example, the weight value of the convolution kernel in the first semantic segmentation model is different from the weight value of the convolution kernel in the second semantic segmentation model.
  • the second output result may include a second semantic segmentation map and P second feature maps, where the resolution of the second feature maps is smaller than the resolution of the target image.
  • the optimization device can perform feature extraction on the unlabeled image to obtain Q third feature maps, and the resolution of the third feature maps is H 2 ⁇ W 2 ;
  • the three feature maps are mapped to the P second feature maps, and the resolution of the second feature maps is H 2 ⁇ W 2 ;
  • the Q third feature maps are mapped to C fourth feature maps, and the fourth feature maps
  • the resolution is H 1 ⁇ W 1
  • the C fourth feature maps correspond to the C types of object categories one by one
  • the fourth feature map includes H 1 ⁇ W 1 confidence levels
  • the degree of confidence corresponds to the H 1 ⁇ W 1 pixels included in the unlabeled image, and the confidence degree is used to indicate the probability that the pixel at the corresponding position in the unlabeled image belongs to the object category corresponding to the fourth feature map; based on the C A fourth feature map and the first credible threshold of each object category in the C types of object categories to obtain the second semantic segmentation map, and the resolution of the second semantic segmentation map
  • the first output result may include the first semantic segmentation map and P first feature maps
  • the second output result may include the second semantic segmentation map and P second feature maps
  • the optimization device optimizing the first semantic segmentation model based on the target image, the first output result and the second output result may include: the optimization device based on the target image, the first semantic segmentation map, the The P first feature maps, the second semantic segmentation map and the P second feature maps optimize the first semantic segmentation model.
  • the optimization device can iteratively adjust the parameters of the model based on the P first feature maps, the second semantic segmentation map, the P second feature maps and the first loss function, and the first A loss function is used to shrink the distance between pixels belonging to the same object class and/or lengthen the distance between pixels belonging to different object classes.
  • the second semantic segmentation map output by the teacher model can guide the student model to perform comparative learning on the P first feature maps and the P second feature maps, so that The distance between pixels of different categories is shortened, and the distance between pixels of the same category is shortened to ensure that the encoding of pixel features belonging to the same category is as similar as possible, and the encoding of pixel features of different categories is as dissimilar as possible, so , which can improve the compactness of the student model within the segmentation class and the difference between the classes, thereby improving the prediction accuracy of the student model.
  • the optimization device can iteratively adjust the parameters of the first semantic segmentation model based on the target image, the first semantic segmentation map and a second loss function, the second loss function is used to constrain Consistency of predicted and labeled values for the object category to which the same pixel belongs.
  • the target image input by the student model includes a part of the image area of the marked image
  • the first semantic segmentation map output by the student model also includes the part of the image area
  • the prediction accuracy of the student model can be improved by constraining the consistency of the ground truth and predicted values of the same pixels in the target image and the first semantic segmentation map.
  • the optimization device can iteratively adjust the parameters of the first semantic segmentation model based on the first semantic segmentation graph, the second semantic segmentation graph and the third loss function, and the third loss function It is used to constrain the consistency of the prediction results of the first semantic segmentation model and the second semantic segmentation model for the object category to which the same pixel belongs.
  • the target image input by the student model includes a part of the image area without annotated image, and correspondingly, the unlabeled image input by the teacher model also includes an image corresponding to this part of the image area region, which can improve the prediction accuracy of the student model by placing consistency constraints on the prediction results of the student model and the teacher model for the object category to which the same pixel belongs.
  • the second semantic segmentation model uses an unsupervised learning training method, the reliability of the prediction results is poor.
  • the first semantic segmentation The model performs poorly on optimization.
  • the optimization device may be based on the first credible threshold of each object category in the C object categories and the second credible threshold of each object category , to obtain the target credible threshold of each object category, wherein, the first credible threshold of each object category is the credible threshold used by the second semantic segmentation model in the current iteration process, and the target credible threshold of each object category
  • the second credible threshold is the credible threshold used by the second semantic segmentation model in the previous round of iteration
  • the third semantic segmentation map is obtained , the resolution of the third semantic segmentation map is H 1 ⁇ W 1 ; based on the target image, the first semantic segmentation map, the P first feature maps, the third semantic segmentation map and the P second feature Figure, optimize the first semantic segmentation model.
  • the target credible threshold Th' can be obtained by the following formula:
  • Th' ⁇ Th t-1 +(1- ⁇ ) ⁇ Th t
  • represents the weight coefficient
  • Th t-1 represents the credible threshold (ie, the second credible threshold) used by the second semantic segmentation model in the previous iteration process
  • Th t represents the second semantic segmentation model in the current iteration process.
  • the confidence threshold used by the segmentation model i.e. the first confidence threshold).
  • the optimization device may update the credibility threshold used by the second semantic segmentation model in this round from the first credibility threshold to the target credibility threshold.
  • the optimization device is based on the previous round of iterative process
  • the credible threshold used by the second semantic segmentation model and the credible threshold used by the second semantic segmentation model in this round of iteration process dynamically update the credible threshold of each object category to ensure that the credible threshold of each object category is always at Within a reasonable numerical range
  • the prediction results in the second semantic segmentation map can be screened based on the updated credible threshold of each object category to filter out the reliability in the second semantic segmentation map. If the prediction result is poor, the third semantic segmentation map is obtained, and the first semantic segmentation model is optimized based on the third semantic segmentation map, which is conducive to improving the reliability of the first semantic segmentation model.
  • the optimization device may perform the first semantic segmentation model based on the target image, the first semantic segmentation map, the P first feature maps, the third semantic segmentation map, and the P second feature maps optimization.
  • the optimization method may further include: the optimization device sends the optimized first semantic segmentation model, that is, the first optimized semantic segmentation model, to the semantic segmentation device.
  • the optimization device may send the first optimized semantic segmentation model to the semantic segmentation device in various ways, which is not limited in this application.
  • the optimization device may periodically send the first optimized semantic segmentation model to the semantic segmentation device based on a preset period. That is to say, the optimization device may regularly update the optimized first semantic segmentation model to the semantic segmentation device.
  • the optimization device may receive request information from the semantic segmentation device, where the request information is used to request optimization of the first semantic segmentation model; based on the request information, the optimization device will The first optimized semantic segmentation model is sent to the semantic segmentation device.
  • the present application also provides a semantic segmentation method, which can be used in a semantic segmentation device, and the method can include: obtaining an image to be processed; inputting the image to be processed into the first optimized semantic segmentation model, and obtaining the image to be processed Semantic segmentation map of an image.
  • the semantic segmentation device may obtain the image to be processed in various ways, which is not limited in this application.
  • the obtaining the image to be processed by the semantic segmentation device may include: the semantic segmentation device receiving the image to be processed sent by the camera device.
  • the camera device captures the image to be processed and sends it to the semantic segmentation device.
  • the semantic segmentation device may receive the image to be processed from other image acquisition devices, and the other image acquisition device is used to acquire the image to be processed.
  • the semantic segmentation device may obtain the first optimized semantic segmentation model.
  • the semantic segmentation device may obtain the first optimized semantic segmentation model in various ways, which is not limited in this application.
  • the semantic segmentation device may periodically receive the first optimized semantic segmentation model sent from the optimization device based on a preset period. That is to say, the semantic segmentation device may regularly receive the optimized first semantic segmentation model updated by the optimization device.
  • the semantic segmentation device may send request information to the semantic segmentation model optimization device, where the request information is used to request optimization of the first semantic segmentation model; and receive the semantic segmentation model optimization device The first optimized semantic segmentation model sent.
  • the first optimized semantic segmentation model above is obtained after optimizing the first semantic segmentation model using the optimization method provided in the first aspect. Therefore, by performing semantic segmentation on the image to be processed based on the first optimized semantic segmentation model Segmentation can improve the accuracy of semantic segmentation.
  • the present application also provides a semantic segmentation method, which can be used in a semantic segmentation system, and the semantic segmentation system can include: an optimization device and a semantic segmentation device; the method can include: the optimization device obtains a target image, The target image is obtained based on the labeled image and the unlabeled image; the optimization device inputs the target image into the first semantic segmentation model to obtain a first output result; the optimization device inputs the unlabeled image into the second semantic segmentation model to obtain The second output result, the model structure of the second semantic segmentation model is the same as that of the first semantic segmentation model; the optimization device optimizes the first semantic segmentation model based on the target image, the first output result and the second output result Perform optimization to obtain a first optimized semantic segmentation model; the optimization device sends the first optimized semantic segmentation device to the semantic segmentation device; the semantic segmentation device obtains an image to be processed; the semantic segmentation device inputs the image to be processed into the first Optimize the semantic segmentation model to obtain the semantic
  • the semantic segmentation system may further include a display device, and the method may further include the semantic segmentation device sending the semantic segmentation graph of the image to be processed to the display device; correspondingly, the display device displays the semantic segmentation graph.
  • the present application also provides a semantic segmentation device.
  • the optimization device may include an obtaining module, a first semantic segmentation module, a second semantic segmentation module, and an optimization module; the obtaining module is used to obtain a target image, and the target image It is obtained based on labeled images and unlabeled images; the first semantic segmentation module is used to input the target image into the first semantic segmentation model to obtain the first output result; the second semantic segmentation module is used to use the unlabeled The image is input into the second semantic segmentation model to obtain a second output result, and the second semantic segmentation model has the same model structure as the first semantic segmentation model; the optimization module is used to based on the target image, the first output result and the The second output result is to optimize the first semantic segmentation model.
  • the first output result includes the first semantic segmentation map and P first feature maps, where the value of P is greater than the number of channels of the target image, and the resolution of the first feature map is smaller than the The resolution of the target image
  • the second output result includes a second semantic segmentation map and P second feature maps, the resolution of the second feature map is the same as the resolution of the first feature map, and the first semantic segmentation map
  • the resolution of the resolution and the resolution of the second semantic segmentation map are the same as the resolution of the target image;
  • the optimization module is specifically used based on the target image, the first semantic segmentation map, the P first feature maps, the The second semantic segmentation map and the P second feature maps optimize the first semantic segmentation model.
  • the size of the unlabeled image is H 1 ⁇ W 1 ⁇ T
  • the second semantic segmentation model is used to identify C types of object categories, where H 1 and W 1 are both greater than 1 Integer, T is an integer greater than 0, C is an integer greater than 0,
  • the second semantic segmentation module is specifically used to: perform feature extraction on the unlabeled image, obtain Q third feature maps, and distinguish the third feature maps
  • the ratio is H 2 ⁇ W 2 , where H 2 is smaller than H 1 , W 2 is smaller than W 1 , and Q is larger than T;
  • the Q third feature maps are mapped to the P second feature maps, and the second feature maps
  • the resolution is H 2 ⁇ W 2 , where P is smaller than Q;
  • the Q third feature maps are mapped to C fourth feature maps, the resolution of the fourth feature maps is H 1 ⁇ W 1 , and the C
  • the fourth feature map has a one-to-one correspondence with the C object categories.
  • the fourth feature map includes H 1 ⁇ W 1 confidence levels, and the H 1 ⁇ W 1 confidence levels are the same as the H 1 ⁇ W 1 confidence levels included in the unlabeled image. pixels in one-to-one correspondence, and the confidence is used to represent the probability that the pixel at the corresponding position in the unlabeled image belongs to the object category corresponding to the fourth feature map; based on the C fourth feature maps and the C object categories The first credible threshold of each object category is used to obtain the second semantic segmentation map, and the resolution of the second semantic segmentation map is H 1 ⁇ W 1 .
  • the optimization device further includes a threshold updating module, the threshold updating module is configured to obtain, based on the first credible threshold of each object category and the second credible threshold of each object category, The target credible threshold of each object category, wherein, the first credible threshold of each object category is the credible threshold used by the second semantic segmentation model in the iterative process of this round, and the second credible threshold of each object category
  • the credible threshold is the credible threshold used by the second semantic segmentation model in the previous round of iteration
  • the third semantic segmentation map is obtained, the The resolution of the third semantic segmentation map is H 1 ⁇ W 1 ; the optimization module is specifically used to based on the target image, the first semantic segmentation map, the P first feature maps, the third semantic segmentation map and the P A second feature map is used to optimize the first semantic segmentation model.
  • the optimization module is specifically configured to iteratively adjust the parameters of the model based on the P first feature maps, the third semantic segmentation map, the P second feature maps and the first loss function , the first loss function is used to reduce the distance between pixels belonging to the same object category and/or lengthen the distance between pixels belonging to different object categories; based on the target image, the first semantic segmentation map and the second loss function, which iteratively adjusts the parameters of the first semantic segmentation model, and the second loss function is used to constrain the consistency of the predicted value and label value of the object category to which the same pixel belongs; based on the first semantic segmentation map, the third semantic segmentation Figure and the third loss function, iteratively adjust the parameters of the first semantic segmentation model, the third loss function is used to constrain the prediction results of the first semantic segmentation model and the second semantic segmentation model on the object category to which the same pixel belongs consistency.
  • the target image includes a partial area of the labeled image and a partial area of the unlabeled image.
  • the obtaining module is specifically configured to: crop the labeled image to obtain a first sub-image; crop the unlabeled image to obtain a second sub-image; The second sub-image is spliced to obtain the target image.
  • the present application also provides a semantic segmentation device, which may include an obtaining module and a semantic segmentation module, the obtaining module is used to obtain an image to be processed; the semantic segmentation module is used to input the image to be processed into the first optimization
  • the semantic segmentation model obtains the semantic segmentation map of the image to be processed.
  • the obtaining module may obtain the image to be processed in various ways, which is not limited in this application.
  • the obtaining module is specifically configured to receive the image to be processed sent by the camera device.
  • the camera device captures the image to be processed and sends it to the obtaining module.
  • the obtaining module may receive the image to be processed from another image acquisition device, and the other image acquisition device is used to acquire the image to be processed.
  • the semantic segmentation module may obtain the first optimized semantic segmentation model.
  • the semantic segmentation module may obtain the first optimized semantic segmentation model in various ways, which is not limited in this application.
  • the semantic segmentation module may periodically receive the first optimized semantic segmentation model sent from the optimization device based on a preset period. That is to say, the semantic segmentation module can regularly receive the optimized first semantic segmentation model updated by the optimization device.
  • the semantic segmentation module may send request information to the semantic segmentation model optimization device, where the request information is used to request optimization of the first semantic segmentation model; and receive the semantic segmentation model optimization device The first optimized semantic segmentation model sent.
  • the present application further provides a semantic segmentation system, which may include the device for optimizing a semantic segmentation model described in the first aspect or any possible implementation thereof.
  • system may further include the semantic segmentation device described in the second aspect or any possible implementation thereof.
  • the system may also include an image acquisition device and a display device.
  • the present application further provides a terminal, which may include the semantic segmentation system described in the sixth aspect.
  • the terminal can be a vehicle.
  • the present application also provides a device for optimizing a semantic segmentation model
  • the device for optimizing may include a communication interface and a processor, the communication interface is coupled to the processor, and the communication interface is used to provide the processor with information and/or or data, the processor is used to run computer program instructions to execute the optimization method described in the above first aspect or any possible implementation thereof.
  • the optimization device may further include at least one memory, and the storage area is used to store the program code or instruction.
  • the optimization device may be a chip or an integrated circuit.
  • the present application also provides a semantic segmentation device, which may include a communication interface and a processor, the communication interface is coupled to the processor, the communication interface is used to provide information and/or data to the processor, the The processor is configured to execute computer program instructions to execute the method described in the above second aspect or any possible implementation thereof.
  • a semantic segmentation device which may include a communication interface and a processor, the communication interface is coupled to the processor, the communication interface is used to provide information and/or data to the processor, the The processor is configured to execute computer program instructions to execute the method described in the above second aspect or any possible implementation thereof.
  • the device may further include at least one memory, and the storage area is used to store the program code or instruction.
  • the device may be a chip or an integrated circuit.
  • the present application also provides a computer-readable storage medium, which is characterized in that it is used to store a computer program, and when the computer program is run by a processor, it can realize the above-mentioned first aspect and any possible implementation modes thereof.
  • the present application further provides a computer program product, which is characterized in that, when the computer program product is run on a processor, the optimization method described in the above first aspect and any possible implementation thereof is implemented, And/or, implement the method described in the above second aspect and any possible implementation manner thereof.
  • the optimization device, system, computer storage medium, computer program product, chip and terminal of the semantic segmentation model provided by the present application are all used to implement the optimization method of the semantic segmentation model provided above, therefore, the beneficial effect it can achieve can be Referring to the beneficial effect of the optimization method for the semantic segmentation model provided above, details will not be repeated here.
  • the semantic segmentation device, computer storage medium, computer program product, chip and terminal provided in this application are all used to implement the semantic segmentation method provided above, therefore, the beneficial effects it can achieve can refer to the semantic segmentation provided above The beneficial effects in the method will not be repeated here.
  • Figure 1 is a schematic diagram of the size of the image
  • Fig. 2 is a schematic diagram of a convolutional layer implementing a convolution operation process
  • FIG. 3 is a schematic flow diagram of extracting a region of interest of an image to be processed through a mask provided by an embodiment of the present application;
  • FIG. 4 is a schematic diagram of semantic segmentation processing provided by an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a semantic segmentation system 100 provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a semantic segmentation model optimization method 200 provided in an embodiment of the present application.
  • Fig. 8 is a schematic diagram of an annotated image provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an unlabeled image provided by an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a target image provided by an embodiment of the present application.
  • Fig. 11 is a schematic diagram of another target image provided by the embodiment of the present application.
  • Fig. 12 is a schematic flowchart of extracting the first ROI of the marked image through the first mask provided by the embodiment of the present application;
  • FIG. 13 is a schematic flow diagram of extracting a second region of interest of an unlabeled image through a second mask provided by an embodiment of the present application;
  • FIG. 14 is a schematic flow diagram of the semantic segmentation of the target image by the first semantic segmentation model provided by the embodiment of the present application.
  • FIG. 15 is a schematic diagram of the processing flow of the processing layer 4 provided by the embodiment of the present application.
  • FIG. 16 is a schematic flowchart of a semantic segmentation method 300 provided by an embodiment of the present application.
  • FIG. 17 is a schematic block diagram of an optimization device 400 for a semantic segmentation model provided by an embodiment of the present application.
  • FIG. 18 is a schematic flowchart of a method for optimizing a semantic segmentation model provided by an embodiment of the present application.
  • FIG. 19 is a schematic block diagram of an optimization device 500 for a semantic segmentation model provided by an embodiment of the present application.
  • FIG. 20 is a schematic block diagram of a semantic segmentation device 600 provided by an embodiment of the present application.
  • FIG. 21 is a schematic block diagram of a semantic segmentation device 700 provided by an embodiment of the present application.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • a pixel is the most basic element of an image and a logical unit of size.
  • the size of the image includes the width, height and depth (depth, D) of the image.
  • the height of an image can be understood as the number of pixels included in the image in the height direction.
  • the width of an image can be understood as the number of pixels included in the image in the width direction.
  • the depth of an image may be understood as the number of channels included in the image, where the height and height of each channel of the image are the same.
  • the size of an image is H ⁇ W ⁇ M, which means that the image includes M channels, and each channel in the M channels has a height of H pixels and a width of W pixels, where H, W and M are both integers greater than 0.
  • width and height of an image are also referred to as the resolution of the image.
  • the height of an image is H pixels
  • the width is W pixels, which is also referred to as the resolution of the image is H ⁇ W.
  • Figure 1 shows an image with a size of 5 ⁇ 5 ⁇ 3, as shown in Figure 1, the image includes 3 channels, as shown in Figure 1, the red (red, R) channel, green (green, G) channel and blue (blue, B) channel, wherein, the resolutions of R channel, G channel and B channel are all 5 ⁇ 5, that is, each channel has a width of 5 pixels and a height of 5 pixels.
  • R red
  • G green
  • blue blue
  • a convolution kernel is a filter used to extract the feature map of an image.
  • the dimensions of the convolution kernel include width, height and depth, where the depth of the convolution kernel is the same as that of the input image. How many different feature maps can be extracted by using as many different convolution kernels as the convolution operation on an input image.
  • a 5 ⁇ 5 ⁇ 3 convolution kernel is used to convolve a 7 ⁇ 7 ⁇ 3 input image to obtain an output feature map
  • multiple different 5 ⁇ 5 ⁇ 3 convolution kernels are used to convolve the 7 ⁇ 7 ⁇ 3 input image is subjected to convolution operation, and multiple different output feature maps can be obtained.
  • the convolution step size refers to the sliding distance between two convolution operations performed by the convolution kernel in the height direction and width direction during the process of sliding the convolution kernel on the feature map of the input image to extract the feature map of the input image .
  • the convolution step size can determine the downsampling ratio of the input image.
  • the convolution step size in the width (or height) direction is B, which can make the input feature map B times in the width (or height) direction
  • the downsampling of , B is an integer greater than 1.
  • the convolution layer mainly performs convolution operation on the input image based on the set convolution kernel, convolution step size and other parameters to extract the features of the input image.
  • multiple convolutions can be performed on the same image by setting convolution kernels of different sizes, different weight values, or with different convolution steps, so as to extract as many features of the image as possible.
  • the K ⁇ K image block covered when the convolution kernel slides on the image and the convolution kernel Do point multiplication, that is, the gray value of each point on the image block is multiplied by the weight value of the same position on the convolution kernel, and a total of K ⁇ K results are obtained, and after accumulation, a bias is added to obtain a result, and the output is output
  • the coordinate position of the pixel on the output image corresponds to the coordinate position of the center of the image block on the input image, where K is an integer greater than 0.
  • the depth of the convolution kernel must also be N, where N is an integer greater than 0.
  • the convolution operation of the input image and the convolution kernel can be transformed into splitting the input image with a depth of N and the convolution kernel with a depth of N into N images with a depth of 1 and N images with a depth of 1 respectively.
  • the convolution kernel performs convolution operation, and finally accumulates in the dimension of image depth, and finally obtains an output image.
  • the output image of the convolutional layer usually includes multiple feature maps, and a convolution kernel with a depth of N performs a convolution operation on an input image with a depth of N to obtain a feature map. Therefore, If you want to obtain as many feature maps as you need, how many convolution kernels with a depth of N need to be used to convolve the input image respectively.
  • Figure 2 shows the process of the convolution layer implementing the convolution operation on the input image.
  • the size of the input image is 5 ⁇ 5 ⁇ 3, and the height boundary and width boundary of the input image are filled with 1 pixel to obtain
  • the convolution operation includes using the convolution kernel w0 in the width direction and the height direction to perform convolution with a convolution step size of 2.
  • the size of the convolution kernel w0 is 3 ⁇ 3 ⁇ 3.
  • the three channels of the input image (ie, channel 1, channel 2, and channel 3) are respectively convoluted with the three-layer depth of the convolution kernel (convolution kernel w0-1, convolution kernel w0-2, and convolution kernel w0-3)
  • the feature map 1 is obtained, and the size of the feature map 1 is 3 ⁇ 3 ⁇ 1.
  • the first layer depth of w0 (that is, w0-1) is multiplied by the elements in the corresponding positions in the black box of channel 1 and then summed to get 0.
  • the black box first slides along the width direction of each channel, and then slides along the height direction, and performs a convolution operation every time it slides. Among them, each slide The distance of is 2 (that is, the convolution steps in the width and height directions are both 2), until the convolution operation on the input image is completed, and a 3 ⁇ 3 ⁇ 1 feature map 1 is obtained.
  • the convolution operation also includes convolution with a convolution step size of 2 using the convolution kernel w1 in the width direction and height direction, based on a process similar to the convolution kernel w0, 3 ⁇ 3 ⁇ 1 Features of Figure 2.
  • the deconvolution layer is also called the transposed convolution layer.
  • the upsampling ratio of the input image can be determined.
  • the convolution step size in the width (or height) direction is A
  • the input feature map can be upsampled by A times in the width (or height) direction
  • A is an integer greater than 1.
  • deconvolution operation can be understood as the reverse process of the convolution operation as shown in FIG. 2 .
  • Labeling an image means that each pixel in the image has a label value, and the label value of the pixel is used to indicate the object category to which the pixel belongs.
  • the label value in the label image is manually labeled, that is, the real value.
  • An unlabeled image means that each pixel in the image does not have a label value.
  • the mask is used to extract the region of interest in the image to be processed or to block the region of non-interest in the image to be processed.
  • the mask is usually a binary image, that is, the value of each pixel in the mask is "0" or 1, among which, the value of the pixel in the region of interest is "1", and the value of the pixel in the non-interest region is " 0".
  • the principle of using a mask to extract the ROI of the image to be processed is: multiply each pixel value in the image to be processed by the pixel value at the corresponding position in the mask, and the pixel value in the ROI of the image to be processed remains remains unchanged, and the pixel values outside the ROI (that is, in the non-ROI) are all 0, so that the ROI of the image to be processed can be extracted.
  • FIG. 3 shows a schematic flowchart of extracting a region of interest of an image to be processed through a mask.
  • the pixel value at the position of the first row and the first column of the image to be processed that is, position 1
  • the pixel value at position 1 of the image to be processed is 1
  • masking The pixel value at position 1 of the film
  • similar processing can be performed on the pixel values at other positions in the image to be processed to obtain a rendering.
  • the region of interest in the rendering is shown in Figure 3 shown.
  • Semantic segmentation refers to identifying images at the pixel level.
  • the goal of semantic segmentation is to predict the object category to which pixels at each position in the image to be processed belong, and use different label values to identify pixels belonging to different object categories in the image to be processed. Make an annotation.
  • the semantic segmentation result of the image to be processed is usually represented by a semantic segmentation map, which has the same resolution as the image to be processed, and the label value at each position in the semantic segmentation map is used to represent the corresponding position in the image to be processed
  • FIG. 4 shows a schematic diagram of semantic segmentation processing.
  • the image shown in (b) in FIG. 4 can be obtained Semantic segmentation map, wherein, the position of the label value 1 in the semantic segmentation map indicates that the object category to which the pixel at the corresponding position in the image to be processed belongs to is a tree, and the position with a label value of 2 indicates the object category to which the pixel at the corresponding position in the image to be processed belongs to The object category is road, and the position with a label value of 3 indicates that the object category to which the pixel at the corresponding position in the image to be processed belongs to is sky, and the position with a label value of 4 indicates that the object category to which the pixel at the corresponding position in the image to be processed belongs to is a building, and the label A position with a value of 5 indicates that the object category to which the pixel at the corresponding position in the image to be processed belongs to is cloud,
  • the convolutional neural network model is essentially an input-to-output mapping, which can learn a large number of mapping relationships between inputs and outputs without requiring any precise mathematical expressions between inputs and outputs. After sampling, the neural network model is trained, and the neural network model has the mapping ability between input and output pairs.
  • the semantic segmentation model is a neural network model, and the neural network model is used to perform semantic segmentation processing on an input image to obtain an output result, which is a semantic segmentation map of the input image.
  • the semantic segmentation model can use a convolutional neural network, which uses an encoder-decoder architecture.
  • the encoder gradually increases the spatial dimension of the input image (ie, the number of feature maps or the number of channels of the image) through the convolutional layer. For example, The input is down-sampled one or more times through the convolutional layer to extract high-level semantic features of the input image.
  • the decoder performs one or more up-sampling through the deconvolution layer on the high-level semantic features, gradually recovers the details and spatial dimensions of the input image, and finally outputs a semantic segmentation map consistent with the resolution of the input image.
  • the loss function is used to measure the degree of inconsistency between the predicted value of the model and the real value, and it is a non-negative real-valued function. The smaller the value of the loss function, the better the robustness of the model.
  • the goal of an optimization problem is to minimize the value of the loss function.
  • the process of model optimization refers to iteratively adjusting the parameters of the model to minimize the value of the loss function of the model.
  • the existing schemes propose to use semi-supervised learning to train the semantic segmentation model, that is, train the semantic segmentation model by combining a small number of labeled images with a large number of unlabeled images, and effectively mine the relationship between labeled images and unlabeled images. Thereby improving the generalization performance of the semantic segmentation model.
  • the training samples in the training data set include various family cars, and you want to train a semantic segmentation model that can identify vans. Compared with the recognition of family cars, the semantic segmentation model has a lower prediction accuracy. Low.
  • the present application provides an optimization method and device for a semantic segmentation model, which reduces the distribution difference between labeled images and unlabeled images by performing data enhancement on labeled images and unlabeled images in the training data set, and based on data enhancement
  • the final training data set optimizes the semantic segmentation model, which can improve the prediction accuracy of the semantic segmentation model.
  • the present application also provides a semantic segmentation method and device, which can improve the accuracy of semantic segmentation.
  • FIG. 5 shows a schematic block diagram of a semantic segmentation system 100 applied to the semantic segmentation method and the optimization method of the semantic segmentation model provided by the embodiment of the present application.
  • the system 100 may include an optimization device 110 for a semantic segmentation model, and the optimization device 110 includes a first semantic segmentation model.
  • the optimization device 110 is used to optimize the first semantic segmentation model based on the training data set (including a plurality of training samples) using the semantic segmentation model optimization method provided in this application to obtain the first optimized semantic segmentation model.
  • system 100 may further include a semantic segmentation device 120 , and the semantic segmentation device 120 may communicate with the optimization device 110 .
  • the optimization device 110 is also configured to send the first optimized semantic segmentation model to the semantic segmentation device 120 .
  • the semantic segmentation device 120 is configured to input the image to be processed into the first optimized semantic segmentation model to obtain a semantic segmentation map of the image to be processed.
  • the semantic segmentation device 120 and the optimization device 110 can be the same device, and the device can either use the optimization method provided by this application to optimize the first semantic segmentation model, or use the optimized first optimized semantic segmentation model Semantic segmentation of the image to be processed.
  • the system 100 may further include a camera device 130 and/or a display device 140 , wherein the camera device 130 may communicate with the optimization device 110 and the semantic segmentation device 120 respectively, and the display device 140 may communicate with the semantic segmentation device 120 .
  • the camera device 130 is used to capture sample images in the training data set, and send the sample images to the optimization device 110 .
  • the camera device 130 is also used to capture the image to be processed, and send the image to be processed to the semantic segmentation device 120 .
  • the semantic segmentation device 120 is further configured to send the semantic segmentation map of the image to be processed to the display device 140 .
  • the display device 140 is used for presenting the semantic segmentation map of the image to be processed.
  • the present application does not limit the specific forms of the optimization device 110 , the semantic segmentation device 120 , the camera device 130 and the display device 140 .
  • the optimization device 110 the semantic segmentation device 120 , the camera device 130 and the display device 140 may be separate devices (or respectively set in different devices).
  • one or more of the optimization device 110, the semantic segmentation device 120, the camera device 130, and the display device 140 may be set in the same device, and the remaining one or more devices are separate device (or set them in different devices).
  • the optimization device 110 the semantic segmentation device 120 , the camera device 130 and the display device 140 are all set in the same device, which is not limited in this embodiment of the present application.
  • the camera device 130 may be a camera or a camera module.
  • the camera device 130 may include a static camera and/or a video camera for collecting sample images and/or images to be processed.
  • the display device 140 may be a display screen.
  • the display device 140 may be a touch screen for interaction between the vehicle and the user.
  • the vehicle can obtain information input by the user through the touch display screen; or, the vehicle can present a display interface (such as a semantic segmentation map) to the user through the touch display screen.
  • the foregoing system 100 may be used in various scenarios or fields, which are not limited in this application.
  • the system 100 can be used in scenarios or fields of automatic driving, assisted driving, or unmanned driving, and can well segment the scene graph of the environment, output a more realistic scene graph, and make The automatic driving system can make safer and more reliable driving operations.
  • system 100 can be used in monitoring or security scenarios or fields, and can segment humans in the monitoring area, and perform target tracking, posture analysis and early warning based on the segmentation results.
  • system 100 can be used in medical scenes or fields, and can segment various organs in medical images, and perform three-dimensional virtual reality technology (virtual reality, virtual reality) corresponding to independent organs based on the segmentation results.
  • VR virtual reality
  • FIG. 6 shows a scene diagram where the system 100 provided by the embodiment of the present application is applied.
  • the semantic segmentation device 120 , the camera device 130 and the display device 140 may be set in the vehicle, and the optimization device 110 may be set in the cloud server.
  • the above-mentioned system 100 may realize semantic segmentation of the image to be processed through the following process.
  • the semantic segmentation device 120 sends request information to the optimization device 110, where the request information is used to request optimization of the first semantic segmentation model.
  • the optimization device 110 optimizes the first semantic segmentation model by using the optimization method provided in this application to obtain a first optimized semantic segmentation model; and sends the first optimized semantic segmentation model to the semantic segmentation device 120 .
  • the camera device 130 collects images to be processed during the driving of the vehicle; and sends them to the semantic segmentation device 120 .
  • the semantic segmentation device 120 inputs the image to be processed into the first optimized semantic segmentation model to obtain a semantic segmentation map of the image to be processed; and sends it to the display device 140 .
  • the display device 140 displays the semantic segmentation map of the image to be processed.
  • the server is set in the cloud as an example for illustration, but the present application is not limited thereto.
  • the server may also be set on the vehicle, which is not limited in this application.
  • the foregoing devices may communicate with each other in a wired or wireless manner, which is not limited in this embodiment of the present application.
  • the above-mentioned wired manner may be to implement communication through a data line connection or through an internal bus connection.
  • the foregoing wireless manner may be to realize communication through a communication network
  • the communication network may be a local area network, or a wide area network switched through a relay (relay) device, or include a local area network and a wide area network.
  • the communication network can be a wireless fidelity (wireless fidelity, Wifi) hotspot network, a wifi peer-to-peer (peer-to-peer, P2P) network, bluetooth (bluetooth) network, zigbee network, near field Communication (near field communication, NFC) network or possible general short-distance communication network in the future.
  • the communication network may be a third-generation mobile communication technology (3rd-generation wireless telephone technology, 3G) network, a fourth-generation mobile communication technology (the 4th generation mobile communication technology, 4G ) network, fifth-generation mobile communication technology (5th-generation mobile communication technology, 5G) network, public land mobile network (public land mobile network, PLMN) or the Internet (Internet), etc., which are not limited in this embodiment of the present application.
  • 3G third-generation mobile communication technology
  • 4G fourth-generation mobile communication technology
  • 5th-generation mobile communication technology 5th-generation mobile communication technology
  • PLMN public land mobile network
  • Internet Internet
  • FIG. 7 provides a schematic flowchart of a method 200 for optimizing a semantic segmentation model provided by an embodiment of the present application.
  • the method 200 can be applied to the system 100 shown in FIG. 5 , and can be executed by the optimization device 110 in the system 100 .
  • the optimization process of the optimization device may include the following steps. It should be noted that the steps listed below may be executed in various orders and/or simultaneously, and are not limited to the execution order shown in FIG. 7 .
  • step 201 the optimization device obtains a target image, and the target image is obtained based on an annotated image and an unannotated image.
  • the above target image is an image obtained by mixing an annotated image and an unannotated image.
  • the above labeled image means that each pixel included in the image has a label value, and the label value of each pixel is used to indicate the object category to which each pixel belongs.
  • the aforementioned unlabeled image means that each pixel included in the image does not have a label value.
  • the above labeled values are usually the real values labeled manually.
  • the labeled image and the unlabeled image are usually collected in similar application scenarios or similar environments, that is, the labeled image and the unlabeled image include at least some common object categories.
  • the object categories included in the labeled image are cars, people, trees, and buildings
  • the objects included in the unlabeled images are cars, trees, and buildings.
  • FIG. 8 shows a schematic diagram of an annotated image provided by an embodiment of the present application.
  • Each pixel in the annotated image has an annotated value, and the annotated value is used to indicate the object category to which the pixel belongs.
  • the object category to which the pixel with the label value 1 belongs is tree
  • the object category to which the pixel with label value 2 belongs is road
  • the object category to which the pixel with label value 3 belongs is sky.
  • the object category of the pixel with a value of 4 is a building
  • the object category of a pixel with a label value of 5 is a cloud
  • the object category of a pixel with a label value of 6 is a car
  • the object category of a pixel with a label value of 7 is the ground.
  • FIG. 9 shows a schematic diagram of an unlabeled image provided by an embodiment of the present application.
  • the unlabeled image only includes pixels, that is, there is no label value at each pixel position in the unlabeled image.
  • the target image may include a partial area of the labeled image and a partial area of the unlabeled image.
  • FIG. 10 shows a schematic diagram of the target image provided by the embodiment of the present application.
  • the target image An annotated sub-image 1 and an unannotated sub-image 2 may be included, wherein the sub-image 1 is intercepted from the annotated image, and the sub-image 2 is intercepted from the unannotated image.
  • FIG. 11 shows a schematic diagram of another target image provided by the embodiment of the present application.
  • the The target image may include an annotated sub-image 3 and an unannotated sub-image 4, wherein the sub-image 3 is intercepted from the annotated image, and the sub-image 4 is intercepted from the unannotated image.
  • the optimization device may obtain the target image in various ways, which are not limited in this application.
  • the optimizing device may receive the target image sent by other devices (such as an image generating device). That is, the target image can be generated by the image generating device.
  • the optimization device may generate the target image based on the labeled image and the unlabeled image.
  • the optimization device may generate the target image based on the labeled image and the unlabeled image in various ways, which is not limited in this application.
  • the optimizing device crops the labeled image to obtain a first sub-image; crops the unlabeled image to obtain a second sub-image; the first sub-image and the second sub-image
  • the images are spliced to obtain the target image.
  • the optimization device may extract the first region of interest in the labeled image based on the first mask to obtain the first sub-image; extract the region of interest in the unlabeled image based on the second mask.
  • the second region of interest is to obtain the second sub-image; the first sub-image and the second sub-image are spliced to obtain the target image, and the position of the first region of interest in the first mask is the same as that of the second sub-image.
  • the positions of the two non-interest regions in the second mask correspond, wherein the second non-interest region is an area in the second mask except the second interest region.
  • FIG. 12 shows a schematic flowchart of extracting a first region of interest of an annotated image through a first mask provided by an embodiment of the present application.
  • the above-mentioned annotated image is shown in (a) in FIG. 12
  • the above-mentioned first mask The membrane is shown in (b) in FIG. 12
  • the above-mentioned first region of interest is shown in (c) in FIG. 12 .
  • FIG. 13 shows a schematic flowchart of extracting a second region of interest of an unlabeled image through a second mask provided by the embodiment of the present application.
  • the above-mentioned unlabeled image is shown in (a) in FIG. 13
  • the above-mentioned first The second mask is shown in (b) in FIG. 13
  • the above-mentioned second ROI is shown in (c) in FIG. 13 .
  • the optimization device can obtain the first sub-image corresponding to the first region of interest as shown in (c) in Figure 12 from the labeled image, and obtain the first sub-image corresponding to the first region of interest as shown in (c) in Figure 12, and obtain the sub-image shown in Figure 13 from the unlabeled image
  • the second sub-image corresponding to the second region of interest shown in (c) in (c) and splicing the first sub-image and the second sub-image to obtain the target image as shown in FIG. 11 .
  • step 202 the optimization device inputs the target image into the first semantic segmentation model to obtain a first output result.
  • the first semantic segmentation model is a pre-trained model for identifying C types of object categories, where C is an integer greater than 0.
  • the target image includes at least part or all of the C types of object categories.
  • the first output result may include the first semantic segmentation map and P first feature maps, where the value of P is greater than the number of channels of the target image, and the resolution of the first feature map is smaller than The resolution of this target image.
  • the optimization method of the semantic segmentation model provided by this application, by mixing the labeled image and the unlabeled image, the association between the labeled image and the unlabeled image can be mined to reduce the distribution difference between the labeled image and the unlabeled image, By training the first semantic segmentation model with the mixed target image, the domain adaptability of the first semantic segmentation model can be improved, thereby improving the prediction accuracy of the semantic segmentation model.
  • the first semantic segmentation model may use a convolutional neural network, and the convolutional neural network includes at least a processing layer 1, a processing layer 2, a processing layer 3, and a processing layer 4.
  • the above steps 202 may include: the optimization device performs feature extraction on the target image through the processing layer 1 to obtain Q feature maps 1, and the resolution of the feature maps 1 is H 2 ⁇ W 2 , where H 2 is smaller than H 1 , and W 2 is less than W 1 , and Q is greater than T; through this processing layer 2, Q feature maps 1 are mapped to P feature maps 2 (that is, P first feature maps), and the resolution of the feature maps 2 is H 2 ⁇ W 2 , where P is smaller than Q; the Q feature maps 1 are mapped to C feature maps 3 through the processing layer 3, the resolution of the feature maps 3 is H 1 ⁇ W 1 , the C feature maps 3 and the There is a one-to-one correspondence between C types of object categories, the feature map 3 includes H 1 ⁇ W 1 confidence levels, and the H 1 ⁇ W 1 confidence
  • the degree is used to represent the probability that the pixel at the corresponding position in the target image belongs to the object category corresponding to the feature map 3; through the processing layer 4, the credibility of each object category based on the C feature maps 3 and the C object categories Threshold 1, the first semantic segmentation map is obtained, and the resolution of the first semantic segmentation map is H 1 ⁇ W 1 .
  • Y feature maps with a resolution of H ⁇ W can be referred to as a feature space of H ⁇ W ⁇ Y, which includes Y channels (that is, the Depth is Y), each channel in the Y channels includes H ⁇ W pixels; or, Y feature maps with a resolution of H ⁇ W can be called a feature matrix of H ⁇ W ⁇ Y, the feature The matrix includes Y eigenvectors, and each of the Y eigenvectors includes H ⁇ W elements, where Y is an integer greater than 0.
  • FIG. 14 shows a schematic flowchart of performing semantic segmentation on a target image by the first semantic segmentation model provided by the embodiment of the present application.
  • the 1024 ⁇ 1024 ⁇ 3 target image extracts features through the processing layer 1 to obtain a 128 ⁇ 128 ⁇ 1024 feature space 1, which is mapped to a 128 ⁇ 128 ⁇ 256 feature space through the processing layer 2 2.
  • the feature space 1 is mapped to the feature space 3 of 1024 ⁇ 1024 ⁇ 7 through the processing layer 3, and the feature space 3 is processed through the processing layer 4 to obtain the first semantic segmentation map.
  • the target image and the first semantic segmentation map shown in FIG. 14 are only schematic diagrams, and the specific resolutions of the target image and the first semantic segmentation map are subject to the dimensions marked below the images.
  • the above processing layer 1 is used to down-sample the target image to obtain Q feature maps 1 (namely feature space 1), the resolution of the feature map 1 is lower than the resolution of the target image, that is, the processing layer 1 can reduce
  • the resolution of the small image reduces the computational load of the model and improves the classification efficiency; in addition, Q is greater than the number of channels of the target image, that is, the processing layer 1 can increase the dimension of the feature space, thereby extracting the high-dimensional spatial features of the image.
  • the processing layer 1 may include at least one convolutional layer 1 .
  • the above processing layer 2 is used to map Q feature maps 1 to P feature maps 2 (that is, feature space 2).
  • the resolution of feature maps 2 and feature maps 1 is the same, but P is smaller than Q, that is, processing layer 2 can reduce feature The dimension of the space to remove redundant features in the image, thereby reducing the amount of calculation of the model.
  • the processing layer 2 may include at least one convolutional layer 2 .
  • the above processing layer 3 is used to up-sample Q feature maps 1 to obtain C feature maps 3 (namely, feature space 3).
  • the resolution of the feature maps 3 is the same as the resolution of the target image, that is, the processing layer 3 can restore The full resolution of the target image can be obtained, so as to restore more detailed features in the target image.
  • the processing layer 3 may include at least one deconvolution layer and a maximum function (argmax) layer.
  • each processing layer may further include other operation layers capable of implementing respective functions, which is not limited in this embodiment of the present application.
  • the above-mentioned processing layer 1 can also have at least one pooling layer.
  • the pooling layer can reduce the width and height of the feature map, and reduce the computational complexity of the convolutional neural network by reducing the amount of feature map data; on the other hand, it can Perform feature compression to extract the main features of the image.
  • the optimization device may determine the maximum confidence of pixels at the same position in the C feature map 3 through the processing layer 4, if the maximum confidence is greater than or equal to the feature to which the maximum confidence belongs If the credibility threshold of the object category corresponding to FIG. 3 is 1, it is determined that the pixel at the corresponding position in the first semantic segmentation map belongs to the object category corresponding to the feature map 3 to which the maximum confidence belongs.
  • FIG. 15 shows a schematic diagram of the processing flow of the processing layer 4 provided by the embodiment of the present application.
  • two feature maps 3 are shown in (a) in FIG. 15
  • the confidence that the pixel at position 1 belongs to the object category 1 is 0.78
  • the confidence that the pixel at position 1 in the feature map 3-2 belongs to the object category 2 is 0.32
  • the maximum confidence corresponding to position 1 is 0.78
  • 0.78 the confidence level corresponding to position 1 in feature map 3-1) is greater than 0.6 (that is, the confidence threshold 1 of the object category
  • the confidence that the pixel at position 2 in the feature map 3-1 belongs to the object category 1 is 0.19
  • the pixel at position 2 in the feature map 3-2 The confidence level of the pixel belonging to the object category 2 is 0.81, and the maximum confidence level corresponding to position 2 is 0.81, and 0.81 (the confidence level corresponding to position 2 in the feature map 3-2) is greater than 0.65 (that is, the object category 2 Believable threshold 1), therefore, the pixel at position 2 in the first semantic segmentation map belongs to the object category 1.
  • the confidence that the pixel at position 3 in the feature map 3-1 belongs to the object category 1 is 0.44
  • the pixel at position 3 in the feature map 3-2 The confidence level of the pixel at the object category 2 is 0.56
  • the maximum confidence level corresponding to position 3 is 0.56
  • 0.56 the confidence level corresponding to position 3 in the feature map 3-2
  • 0.65 the confidence level of the object category 2 Credible threshold 1
  • the first semantic segmentation map different object categories are marked with different label values.
  • the pixel at position 1 of the first semantic segmentation map is marked as belonging to object category 1 by the label value "1”
  • the first semantic segmentation is marked by the label value "2”.
  • the pixel at position 2 of the map belongs to object category 2
  • the pixel at position 3 of the first semantic segmentation map is marked as belonging to the default object by the label value "0”.
  • step 203 the optimization device inputs the unlabeled image into a second semantic segmentation model to obtain a second output result.
  • the second semantic segmentation model has the same model structure as the first semantic segmentation model.
  • the same model structure of the first semantic segmentation model and the second semantic segmentation model described in this application means: first, the functions of these two models are the same, that is, both are used to identify the C object categories ; Second, the convolutional neural networks used by the two models have the same network structure, including the same number of processing layers, types of processing layers, and the same function of each processing layer. The difference between these two models is only that the parameters set by the processing layer in these two models may be different, such as the weight value of the convolution kernel in the first semantic segmentation model and the weight value of the convolution kernel in the second semantic segmentation model. .
  • the second semantic segmentation model is a pre-trained model for identifying the C types of object categories.
  • the unlabeled image includes at least part or all of the C types of object categories.
  • the second output result may include a second semantic segmentation map and P second feature maps, where the resolution of the second feature maps is smaller than the resolution of the target image.
  • the optimization device can perform feature extraction on the unlabeled image to obtain Q third feature maps, and the resolution of the third feature maps is H 2 ⁇ W 2 ;
  • the three feature maps are mapped to the P second feature maps, and the resolution of the second feature maps is H 2 ⁇ W 2 ;
  • the Q third feature maps are mapped to C fourth feature maps, and the fourth feature maps
  • the resolution is H 1 ⁇ W 1
  • the C fourth feature maps correspond to the C types of object categories one by one
  • the fourth feature map includes H 1 ⁇ W 1 confidence levels
  • the degree of confidence corresponds to the H 1 ⁇ W 1 pixels included in the unlabeled image, and the confidence degree is used to indicate the probability that the pixel at the corresponding position in the unlabeled image belongs to the object category corresponding to the fourth feature map; based on the C A fourth feature map and the first credible threshold of each object category in the C types of object categories to obtain the second semantic segmentation map, and the resolution of the second semantic segmentation map
  • the target image input by the first semantic segmentation model is obtained by mixing the labeled image and the unlabeled image, which can dig deeper into the labeled image and the unlabeled image
  • training the first semantic segmentation model through the target image can improve the domain adaptability of the first semantic segmentation model, thereby improving the first semantic segmentation model Generalization performance for semantic segmentation.
  • the input of the second semantic segmentation model is an unlabeled image, and training the second semantic segmentation model through the unlabeled image can reduce the dependence of the second semantic segmentation model on the labeled image and reduce the cost of the labeled image.
  • Step 204 the optimizing device optimizes the first semantic segmentation model based on the target image, the first output result and the second output result.
  • This application adopts a dual-model structure composed of the first semantic segmentation model and the second semantic segmentation model, wherein the first semantic segmentation model can be used as a student model, and the second semantic segmentation model can be used as a teacher model, and the output result of the teacher model can be It is used to assist and guide the training and optimization of the student model, therefore, it can improve the optimization effect of the student model.
  • the first output result may include the first semantic segmentation map and P first feature maps
  • the second output result may include the second semantic segmentation map and P second feature maps
  • step 204 may include: the optimizing device, based on the target image, the first semantic segmentation map, the P first feature maps, the second semantic segmentation map, and the P second feature maps, for the first Semantic segmentation model for optimization.
  • the optimization device can iteratively adjust the parameters of the model based on the P first feature maps, the second semantic segmentation map, the P second feature maps and the first loss function, and the first A loss function is used to shrink the distance between pixels belonging to the same object class and/or lengthen the distance between pixels belonging to different object classes.
  • the second semantic segmentation map output by the teacher model can guide the student model to perform comparative learning on the P first feature maps and the P second feature maps, so that The distance between pixels of different categories is shortened, and the distance between pixels of the same category is shortened to ensure that the encoding of pixel features belonging to the same category is as similar as possible, and the encoding of pixel features of different categories is as dissimilar as possible, so , which can improve the compactness of the student model within the segmentation class and the difference between the classes, thereby improving the prediction accuracy of the student model.
  • the optimization device can iteratively adjust the parameters of the first semantic segmentation model based on the target image, the first semantic segmentation map and a second loss function, the second loss function is used to constrain Consistency of predicted and labeled values for the object category to which the same pixel belongs.
  • the target image input by the student model includes a part of the image area of the marked image
  • the first semantic segmentation map output by the student model also includes the part of the image area
  • the prediction accuracy of the student model can be improved by constraining the consistency of the ground truth and predicted values of the same pixels in the target image and the first semantic segmentation map.
  • the optimization device can iteratively adjust the parameters of the first semantic segmentation model based on the first semantic segmentation graph, the second semantic segmentation graph and the third loss function, and the third loss function It is used to constrain the consistency of the prediction results of the first semantic segmentation model and the second semantic segmentation model for the object category to which the same pixel belongs.
  • the target image input by the student model includes a part of the image area without annotated image, and correspondingly, the unlabeled image input by the teacher model also includes an image corresponding to this part of the image area region, which can improve the prediction accuracy of the student model by placing consistency constraints on the prediction results of the student model and the teacher model for the object category to which the same pixel belongs.
  • the second semantic segmentation model uses an unsupervised learning training method, the reliability of the prediction results is poor.
  • the first semantic segmentation The model performs poorly on optimization.
  • the optimization device may be based on the first credible threshold of each object category in the C object categories and the second credible threshold of each object category , to obtain the target credible threshold of each object category, wherein, the first credible threshold of each object category is the credible threshold used by the second semantic segmentation model in the current iteration process, and the target credible threshold of each object category
  • the second credible threshold is the credible threshold used by the second semantic segmentation model in the previous round of iteration
  • the third semantic segmentation map is obtained , the resolution of the third semantic segmentation map is H 1 ⁇ W 1 ; based on the target image, the first semantic segmentation map, the P first feature maps, the third semantic segmentation map and the P second feature Figure, optimize the first semantic segmentation model.
  • the target credible threshold Th' can be obtained through the following formula (1).
  • Th' ⁇ Th t-1 +(1- ⁇ ) ⁇ Th t formula (1)
  • represents the weight coefficient
  • Th t-1 represents the credible threshold (ie, the second credible threshold) used by the second semantic segmentation model in the previous iteration process
  • Th t represents the second semantic segmentation model in the current iteration process.
  • the confidence threshold used by the segmentation model i.e. the first confidence threshold).
  • the optimization device may update the credibility threshold used by the second semantic segmentation model in this round from the first credibility threshold to the target credibility threshold.
  • the optimization device is based on the previous round of iterative process
  • the credible threshold used by the second semantic segmentation model and the credible threshold used by the second semantic segmentation model in this round of iteration process dynamically update the credible threshold of each object category to ensure that the credible threshold of each object category is always at Within a reasonable numerical range
  • the prediction results in the second semantic segmentation map can be screened based on the updated credible threshold of each object category to filter out the reliability in the second semantic segmentation map. If the prediction result is poor, the third semantic segmentation map is obtained, and the first semantic segmentation model is optimized based on the third semantic segmentation map, which is conducive to improving the reliability of the first semantic segmentation model.
  • the optimization device may perform the first semantic segmentation model based on the target image, the first semantic segmentation map, the P first feature maps, the third semantic segmentation map, and the P second feature maps optimization.
  • an optimized first semantic segmentation model that is, a first optimized semantic segmentation model can be obtained.
  • the optimization method 200 may further include: the optimization device sending the optimized first semantic segmentation model, that is, the first optimized semantic segmentation model, to the semantic segmentation device.
  • the optimization device may send the first optimized semantic segmentation model to the semantic segmentation device in various ways, which is not limited in this application.
  • the optimization device may periodically send the first optimized semantic segmentation model to the semantic segmentation device based on a preset period. That is to say, the optimization device may periodically update the optimized first semantic segmentation model to the semantic segmentation device.
  • the optimization device may receive request information from the semantic segmentation device, where the request information is used to request optimization of the first semantic segmentation model; based on the request information, the optimization device will The first optimized semantic segmentation model is sent to the semantic segmentation device.
  • FIG. 16 shows a schematic flowchart of a semantic segmentation method 300 provided by an embodiment of the present application.
  • the method 300 can be applied to the system 100 shown in FIG. 5 , and can be executed by the semantic segmentation device 120 in the system 100 .
  • the semantic segmentation process of the semantic segmentation device may include the following steps. It should be noted that the steps listed below may be executed in various orders and/or simultaneously, and are not limited to the execution order shown in FIG. 16 .
  • step 301 the semantic segmentation device obtains an image to be processed.
  • step 302 the semantic segmentation device inputs the image to be processed into a first optimized semantic segmentation model to obtain a semantic segmentation map of the image to be processed.
  • the semantic segmentation device may obtain the image to be processed in various ways, which is not limited in this application.
  • the obtaining the image to be processed by the semantic segmentation device may include: the semantic segmentation device receiving the image to be processed sent by the camera device.
  • the camera device captures the image to be processed and sends it to the semantic segmentation device.
  • the semantic segmentation device may receive the image to be processed from other image acquisition devices, and the other image acquisition device is used to acquire the image to be processed.
  • the semantic segmentation device may obtain the first optimized semantic segmentation model.
  • the semantic segmentation device may obtain the first optimized semantic segmentation model in various ways, which is not limited in this application.
  • the semantic segmentation device may periodically receive the first optimized semantic segmentation model sent from the optimization device based on a preset period. That is to say, the semantic segmentation device may regularly receive the optimized first semantic segmentation model updated by the optimization device.
  • the semantic segmentation device may send request information to the semantic segmentation model optimization device, where the request information is used to request optimization of the first semantic segmentation model; and receive the semantic segmentation model optimization device The first optimized semantic segmentation model sent.
  • the semantic segmentation method provided in the embodiment of the present application can improve the accuracy of the semantic segmentation by performing semantic segmentation on the image to be processed by using the above optimized first semantic segmentation model.
  • the optimization method of the semantic segmentation model and the semantic segmentation method provided by the embodiment of the present application are introduced above with reference to FIG. 7 to FIG. 16 .
  • the optimization device and the semantic segmentation device provided by the embodiment of the present application will be further introduced below.
  • FIG. 17 shows a schematic block diagram of an optimization device 400 for a semantic segmentation model provided by an embodiment of the present application.
  • the optimization device 400 may include an obtaining module 401, a first semantic segmentation module 402, and a second semantic segmentation module. 403 and optimization module 404.
  • the optimization device 400 may be used in the above-mentioned system 100 , further, the optimization device 400 may be the optimization device 110 in the above-mentioned system 100 .
  • the obtaining module 401 is configured to obtain a target image, the target image is obtained based on an annotated image and an unannotated image;
  • the first semantic segmentation module 402 is configured to input the target image into the first semantic segmentation model to obtain a first output result
  • the second semantic segmentation module 403 is configured to input the unlabeled image into a second semantic segmentation model to obtain a second output result, and the second semantic segmentation model has the same model structure as the first semantic segmentation model;
  • the optimization module 404 is configured to optimize the first semantic segmentation model based on the target image, the first output result and the second output result.
  • the first output result includes the first semantic segmentation map and P first feature maps, where the value of P is greater than the number of channels of the target image, and the resolution of the first feature map is smaller than the The resolution of the target image
  • the second output result includes a second semantic segmentation map and P second feature maps, the resolution of the second feature map is the same as the resolution of the first feature map, and the first semantic segmentation map
  • the resolution of the resolution and the resolution of the second semantic segmentation map are the same as the resolution of the target image
  • the optimization module 404 is specifically configured based on the target image, the first semantic segmentation map, the P first feature maps, The second semantic segmentation map and the P second feature maps optimize the first semantic segmentation model.
  • the size of the unlabeled image is H 1 ⁇ W 1 ⁇ T
  • the second semantic segmentation model is used to identify C types of object categories, where H 1 and W 1 are both greater than 1 Integer, T is an integer greater than 0, C is an integer greater than 0,
  • the second semantic segmentation module 403 is specifically used to: perform feature extraction on the unlabeled image to obtain Q third feature maps, the third feature map
  • the resolution is H 2 ⁇ W 2 , where H 2 is smaller than H 1 , W 2 is smaller than W 1 , and Q is larger than T;
  • the Q third feature maps are mapped to the P second feature maps, and the second feature maps
  • the resolution is H 2 ⁇ W 2 , where P is smaller than Q;
  • the Q third feature maps are mapped to C fourth feature maps, and the resolution of the fourth feature maps is H 1 ⁇ W 1
  • the C A fourth feature map corresponds to the C object categories one by one, the fourth feature map includes H 1 ⁇ W 1 confidence levels, and the H 1 ⁇
  • the optimization apparatus 400 further includes a threshold updating module 405, which is configured to Threshold, to obtain the target credible threshold of each object category, wherein, the first credible threshold of each object category is the credible threshold used by the second semantic segmentation model in the current round of iteration process, each object category
  • the second credible threshold of is the credible threshold used by the second semantic segmentation model in the previous round of iteration; based on the C fourth feature maps and the target credible threshold of each object category, the third semantic segmentation is obtained , the resolution of the third semantic segmentation map is H 1 ⁇ W 1 ; the optimization module 404 is specifically configured to graph and the P second feature maps to optimize the first semantic segmentation model.
  • the optimization module 404 is specifically configured to: iteratively adjust the model based on the P first feature maps, the third semantic segmentation map, the P second feature maps and the first loss function.
  • the first loss function is used to reduce the distance between pixels belonging to the same object category and/or lengthen the distance between pixels belonging to different object categories; based on the target image, the first semantic segmentation map and the second A loss function, which iteratively adjusts the parameters of the first semantic segmentation model, and the second loss function is used to constrain the consistency of the predicted value and label value of the object category to which the same pixel belongs; based on the first semantic segmentation map, the third semantic A segmentation map and a third loss function, iteratively adjusting the parameters of the first semantic segmentation model, the third loss function is used to constrain the prediction results of the first semantic segmentation model and the second semantic segmentation model for the object category to which the same pixel belongs consistency.
  • the target image includes a partial area of the labeled image and a partial area of the unlabeled image.
  • the obtaining module 401 is specifically configured to: crop the labeled image to obtain a first sub-image; crop the unlabeled image to obtain a second sub-image; and splicing with the second sub-image to obtain the target image.
  • the optimization device 400 may specifically be the optimization device in the above-mentioned optimization method 200 embodiment, and the optimization device 400 may be used to execute the various processes and/or steps corresponding to the optimization device in the above-mentioned optimization method 200 embodiment, To avoid repetition, details are not repeated here.
  • One or more of the various modules in the embodiment shown in FIG. 17 may be implemented by software, hardware, firmware or a combination thereof.
  • the software or firmware includes but is not limited to computer program instructions or codes, and can be executed by a hardware processor.
  • the hardware includes but is not limited to various integrated circuits, such as a central processing unit (CPU, Central Processing Unit), a digital signal processor (DSP, Digital Signal Processor), a field programmable gate array (FPGA, Field Programmable Gate Array) or Application Specific Integrated Circuit (ASIC).
  • CPU central processing unit
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • FIG. 18 shows a schematic flowchart of a method for optimizing a semantic segmentation model provided by an embodiment of the present application.
  • the steps in the process may be executed by the optimization apparatus 400 described in FIG. 17 . It should be noted that the steps listed below may be executed in various orders and/or concurrently, and are not limited to the execution order shown in FIG. 16 .
  • the process includes the following steps:
  • Obtaining module 401 obtains labeled images and unlabeled images.
  • Obtaining module 401 obtains the target image based on the labeled image and the unlabeled image. For details, reference may be made to relevant introductions in step 201 of the above method.
  • the obtaining module 401 sends the target image to the first semantic segmentation module 402 and the optimization module 404 .
  • the first semantic segmentation module 402 inputs the target image into the first semantic segmentation model to obtain the first semantic segmentation map and P feature maps, the value of P is greater than the number of channels of the target image, and the first feature map The resolution is smaller than the resolution of the target image.
  • the first semantic segmentation module 402 sends the first semantic segmentation map and the P first feature maps to the optimization module 404 .
  • the second semantic segmentation module 403 obtains the unlabeled image.
  • the second semantic segmentation module 403 inputs the unlabeled image into the second semantic segmentation model to obtain the second semantic segmentation map and P second feature maps, the resolution of the second feature map is the same as the resolution of the first feature map
  • the resolutions are the same, the resolution of the first semantic segmentation map and the resolution of the second semantic segmentation map are the same as the resolution of the target image, for details, please refer to the relevant introduction in step 203 of the above method.
  • the second semantic segmentation model has the same model structure as the first semantic segmentation model, both of which are used to identify C types of object categories, where C is an integer greater than 0.
  • the second semantic segmentation module 403 sends the second semantic segmentation map to the threshold update module 405 .
  • the second semantic segmentation module 403 sends the P second feature maps to the optimization module 404 .
  • the threshold update module 405 obtains the first credible threshold of each of the C object categories used in the last iteration of the second semantic segmentation model and the first credible threshold of each of the C object categories used in the current iteration.
  • the second credible threshold of each object category is obtained to obtain the target credible threshold of each object category.
  • the threshold updating module 405 obtains a third semantic segmentation map based on the second semantic segmentation map and the target credible threshold of each object category.
  • the threshold update module 405 sends the third semantic segmentation map to the optimization module 404 .
  • the threshold update module 405 sends the target credible threshold to the second semantic segmentation module 403 .
  • the second semantic segmentation module 403 updates the credibility threshold used by the second semantic segmentation model in the current iteration process from the first credibility threshold to the target credibility threshold.
  • Optimization module 404 optimizes the first semantic segmentation model based on the target image, the first semantic segmentation map, the P first feature maps, the third semantic segmentation map, and the P second feature maps , such as iteratively adjusting the model parameters of the first semantic segmentation model. For details, reference may be made to the relevant introduction in the above-mentioned step 204 .
  • FIG. 19 shows a schematic block diagram of an optimization apparatus 500 for a semantic segmentation model provided by an embodiment of the present application.
  • the optimization apparatus 500 may include a processor 501 and a communication interface 502 , and the processor 501 and the communication interface 502 are coupled.
  • the communication interface 502 is used to input image data to the processor 501, and/or output image data from the processor 501; the processor 501 runs computer programs or instructions, so that the optimization device 500 implements the optimization method described in the embodiment of the method 200 above .
  • the processor 501 in the embodiment of the present application includes but is not limited to a central processing unit (Central Processing Unit, CPU), a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC ), off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA), discrete gate or transistor logic devices or discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, a microcontroller, or any conventional processor or the like.
  • the processor 501 is configured to obtain a target image through the communication interface 502, and the target image is obtained based on an annotated image and an unlabeled image; input the target image into the first semantic segmentation model to obtain a first output result; The image is input into the second semantic segmentation model to obtain a second output result.
  • the second semantic segmentation model has the same model structure as the first semantic segmentation model; based on the target image, the first output result and the second output result, the The first semantic segmentation model is optimized.
  • the optimization device 500 may specifically be the optimization device in the above-mentioned optimization method 200 embodiment, and the optimization device 500 may be used to execute the optimization device in the above-mentioned optimization method 200 embodiment. Various processes and/or steps are not repeated here to avoid repetition.
  • the optimization apparatus 500 may further include a memory 503 .
  • Memory 503 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM, DDR SDRAM enhanced synchronous dynamic random access memory
  • Enhanced SDRAM, ESDRAM synchronous connection dynamic random access memory
  • Synchlink DRAM, SLDRAM Direct Memory Bus Random Access Memory
  • Direct Rambus RAM Direct Rambus RAM
  • the memory 503 is used to store program codes and instructions of the optimization device.
  • the memory 503 is also used to store the image data obtained by the processor 501 during the execution of the optimization method 200 described above, such as the target image obtained through the communication interface 502 .
  • the memory 503 may be an independent device or integrated in the processor 501 .
  • FIG. 19 only shows a simplified design of the optimization device 500 .
  • the optimization device 500 can also include other necessary components, including but not limited to any number of communication interfaces, processors, controllers, memories, etc., and all optimization devices 500 that can implement this application are listed in this application. within the scope of protection.
  • the optimization device 500 may be a chip.
  • the chip may also include one or more memories for storing computer-executable instructions.
  • the processor may execute the computer-executable instructions stored in the memory, so that the chip performs the optimization method described above.
  • the chip device can be a field programmable gate array, an ASIC, a system chip, a central processing unit, a network processor, a digital signal processing circuit, a microcontroller, or a programmable controller for realizing related functions. or other integrated chips.
  • the embodiment of the present application also provides a computer-readable storage medium, in which computer instructions are stored, and when the computer instructions are run on the computer, the optimization method described in the foregoing method embodiments is implemented.
  • the embodiment of the present application further provides a computer program product, when the computer program product is run on a processor, the optimization method described in the foregoing method embodiments is implemented.
  • the optimization device, computer-readable storage medium, computer program product or chip provided in the embodiments of the present application are all used to execute the corresponding optimization method provided above. Therefore, the beneficial effects that it can achieve can refer to the above-mentioned The beneficial effects of the corresponding optimization method will not be repeated here.
  • FIG. 20 shows a schematic block diagram of a semantic segmentation device 600 provided by an embodiment of the present application.
  • the device 600 may include an obtaining module 601 and a semantic segmentation module 602 .
  • the apparatus 600 may be used in the above-mentioned system 100 , further, the apparatus 600 may be the semantic segmentation apparatus 120 in the above-mentioned system 100 .
  • the obtaining module 601 is used to obtain images to be processed.
  • the semantic segmentation module 602 is configured to input the image to be processed into the first optimized semantic segmentation model to obtain a semantic segmentation map of the image to be processed.
  • first optimized semantic segmentation model is obtained by optimizing the first semantic segmentation model through the optimization method 200 provided in the embodiment of the present application, and the specific optimization method will not be repeated again.
  • One or more of the various modules in the embodiment shown in FIG. 20 may be implemented by software, hardware, firmware or a combination thereof.
  • the software or firmware includes but is not limited to computer program instructions or codes, and can be executed by a hardware processor.
  • the hardware includes but is not limited to various integrated circuits, such as a central processing unit (CPU, Central Processing Unit), a digital signal processor (DSP, Digital Signal Processor), a field programmable gate array (FPGA, Field Programmable Gate Array) or Application Specific Integrated Circuit (ASIC).
  • CPU central processing unit
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • FIG. 21 shows a schematic block diagram of a semantic segmentation device 700 provided by an embodiment of the present application.
  • the device 700 may include a processor 701 and a communication interface 702 , and the processor 701 and the communication interface 702 are coupled.
  • the communication interface 702 is used to input image data to the processor 701, and/or output image data from the processor 701; the processor 701 runs computer programs or instructions, so that the device 700 implements the semantic segmentation method described in the embodiment of the above-mentioned method 300 .
  • the processor 701 in the embodiment of the present application includes but is not limited to a central processing unit (Central Processing Unit, CPU), a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC ), off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA), discrete gate or transistor logic devices or discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, a microcontroller, or any conventional processor or the like.
  • the processor 701 is configured to obtain an image to be processed through the communication interface 702; input the image to be processed into a first optimized semantic segmentation model to obtain a semantic segmentation map of the image to be processed.
  • the device 700 can be specifically the semantic segmentation device in the above-mentioned embodiment of the method 300, and the device 700 can be used to execute each process corresponding to the semantic segmentation device in the above-mentioned embodiment of the method 300 And/or steps, in order to avoid repetition, no more details are given here.
  • the device 700 may further include a memory 703 .
  • Memory 703 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM, DDR SDRAM enhanced synchronous dynamic random access memory
  • Enhanced SDRAM, ESDRAM synchronous connection dynamic random access memory
  • Synchlink DRAM, SLDRAM Direct Memory Bus Random Access Memory
  • Direct Rambus RAM Direct Rambus RAM
  • the memory 703 is used to store program codes and instructions of the device.
  • the memory 703 is also used to store the image data obtained by the processor 701 during execution of the above method 300 embodiment, such as the image to be processed obtained through the communication interface 702 .
  • the memory 703 may be an independent device or integrated in the processor 701 .
  • FIG. 21 only shows a simplified design of the device 700 .
  • the device 700 can also include other necessary components, including but not limited to any number of communication interfaces, processors, controllers, memories, etc., and all devices 700 that can implement this application are protected by this application. within range.
  • device 700 may be a chip.
  • the chip may also include one or more memories for storing computer-executable instructions.
  • the processor may execute the computer-executable instructions stored in the memory, so that the chip performs the above semantic segmentation method.
  • the chip device can be a field programmable gate array, an ASIC, a system chip, a central processing unit, a network processor, a digital signal processing circuit, a microcontroller, or a programmable controller for realizing related functions. or other integrated chips.
  • the embodiment of the present application also provides a computer-readable storage medium, in which computer instructions are stored, and when the computer instructions are run on a computer, the semantic segmentation method described in the foregoing method embodiments is implemented.
  • the embodiment of the present application also provides a computer program product, which implements the semantic segmentation method described in the foregoing method embodiments when the computer program product is run on a processor.
  • the semantic segmentation device, computer-readable storage medium, computer program product or chip provided in the embodiments of the present application are all used to implement the corresponding semantic segmentation method provided above, therefore, the beneficial effects it can achieve can refer to the above-mentioned The beneficial effects of the provided corresponding semantic segmentation method will not be repeated here.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components can be combined or integrated. to another device, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may be one physical unit or multiple physical units, which may be located in one place or distributed to multiple different places. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

一种语义分割模型的优化方法和装置,能够提高语义分割模型的预测准确度。该优化方法可以包括:获得目标图像,该目标图像是基于标注图像和无标注图像得到的(201);将该目标图像输入第一语义分割模型,得到第一输出结果(202);将该无标注图像输入第二语义分割模型,得到第二输出结果(203),该第二语义分割模型与该第一语义分割模型的模型结构相同;基于该目标图像、该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化(204)。

Description

语义分割模型的优化方法和装置 技术领域
本申请涉及图像处理技术领域,并且更具体地,涉及语义分割模型的优化方法和装置。
背景技术
语义分割技术是对图像像素级别的理解,是在图像上对物体进行像素级的分类,即将属于同一类物体的像素归为一类,使用指定的标签(label)进行标记。目前,语义分割技术广泛应用于无人驾驶、辅助驾驶、自动驾驶、安防、监控等场景。
现有技术中,通常利用大量标注图像作为训练样本对语义分割模型进行训练,训练样本越多,得到的语义分割模型的准确度越高。
然而,由于目前标注图像都是由人工标注的,需要耗费大量的人力和财力,因此,人工标注的训练样本的数量非常有限,这样就会导致训练得到的语义分割模型的泛化性能较差,从而降低图像语义分割模型的预测准确度。
发明内容
本申请提供一种语义分割模型的优化方法和装置,能够提高图像语义分割模型的预测准确度。
第一方面,本申请提供一种语义分割模型的优化方法,该方法可以用于语义分割模型的优化装置,该方法可以包括:优化装置获得目标图像,该目标图像是基于标注图像和无标注图像得到的;该优化装置将该目标图像输入第一语义分割模型,得到第一输出结果;该优化装置将该无标注图像输入第二语义分割模型,得到第二输出结果,该第二语义分割模型与该第一语义分割模型的模型结构相同;该优化装置基于该目标图像、该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化。
需要说明的是,上述目标图像是对标注图像和无标注图像进行混合后得到的图像。上述标注图像是指图像中包括的每个像素具有标注值,该每个像素的标注值用于指示该每个像素所属的对象类别。上述无标注图像是指图像中包括的每个像素不具有标注值。上述标注值通常为人工标注的真实值。
采用本申请实施例提供的语义分割模型的优化方法,采用第一语义分割模型和第二语义分割模型构成的双模型结构,其中,该第一语义分割模型可以作为学生模型,该第二语义分割模型可以作为教师模型,教师模型的输出结果可以用于辅助和指导学生模型的训练和优化,能够提高学生模型的优化效果,其中,第一语义分割模型输入的目标图像是经过对标注图像和无标注图像进行混合后得到的,能够更深入的挖掘标注图像和无标注图像之间的关联(即能够增强标注图像和无标注图像之间的关联),以降低无标注图像和标注图像之间的分布差异,因此,通过该目标图像优化该第一语义分割模型,可以提高该第一语义分割模型的域适应能力,从而提高该第一语义分割模型的预测准确度。此外,第二语义 分割模型输入的是无标注图像,通过该无标注图像训练该第二语义分割模型,能够减少该第二语义分割模型对标注图像的依赖,并能够降低标注图像的成本。
还需要说明的是,该标注图像、该无标注图像和该目标图像的分辨率相同。
可选地,该标注图像和该无标注图像通常是在类似应用场景或类似环境下采集得到的,即该标注图像和该无标注图像中至少包括共同的部分对象类别。如标注图像中包括的对象类别为车、人、树和楼,未标注图像中包括的对象为车、树和楼。
在一种可能的实现方式中,上述目标图像可以包括该标注图像的部分区域和该无标注图像的部分区域。
可选地,该优化装置可以通过多种方式获得该目标图像,本申请对此不作限定。
在一种可能的实现方式中,该优化装置可以接收其它装置(如图像生成装置)发送的该目标图像。也就是说,该目标图像可以由该图像生成装置生成。
在另一种可能的实现方式中,该优化装置可以基于该标注图像和该无标注图像,生成该目标图像。
可选地,该优化装置可以通过多种方式基于该标注图像和该无标注图像,生成该目标图像,本申请对此不做限定。
在一种可能的实现方式中,该优化装置对该标注图像进行裁剪,得到第一子图像;对该无标注图像进行裁剪,得到第二子图像;对该第一子图像和该第二子图像进行拼接,得到该目标图像。
在另一种可能的实现方式中,该优化装置可以基于第一掩膜提取该标注图像中的第一感兴趣区,得到该第一子图像;基于第二掩膜提取该无标注图像中的第二感兴趣区,得到该第二子图像;对该第一子图像和该第二子图像进行拼接,得到该目标图像,该第一感兴趣区在该第一掩膜中的位置与第二非感兴趣区在该第二掩膜中的位置对应,其中,该第二非感兴趣区为该第二掩膜中除该第二感兴趣区外的区域。
需要说明的是,该第一语义分割模型为预先通过训练好的模型、用于识别C种对象类别,C为大于0的整数。相应地,该目标图像中至少包括该C种对象类别中的部分或全部。
在一种可能的实现方式中,该第一输出结果可以包括第一语义分割图和P个第一特征图,P的取值大于该目标图像的通道数量,该第一特征图的分辨率小于该目标图像的分辨率。
可选地,该第一语义分割模型可以采用卷积神经网络,该卷积神经网络至少包括处理层1、处理层2、处理层3和处理层4。
在一种可能的实现方式中,以该目标图像的尺寸可以为H 1×W 1×T,其中,H 1和W 1均为大于1的整数,T为大于0的整数为例,上述步骤202可以包括:该优化装置通过该处理层1对该目标图像进行特征提取,得到Q个特征图1,该特征图1的分辨率为H 2×W 2,其中,H 2小于H 1,W 2小于W 1,Q大于T;通过该处理层2将Q个特征图1映射至P个特征图2(即P个第一特征图),该特征图2的分辨率为H 2×W 2,其中,P小于Q;通过该处理层3将所述Q个特征图1映射至C个特征图3,该特征图3的分辨率为H 1×W 1,该C个特征图3和该C种对象类别一一对应,该特征图3包括H 1×W 1个置信度,该H 1×W 1个置信度与该目标图像包括的H 1×W 1个像素一一对应,该置信度用于表示该目标图像中对应位置的像素属于该特征图3对应的对象类别的概率;通过处理层4基于该 C个特征图3和该C种对象类别中的每种对象类别的可信阈值1,得到该第一语义分割图,该第一语义分割图的分辨率为H 1×W 1
需要说明的是,在本申请中,Y个分辨率为H×W的特征图可以被称为一个H×W×Y的特征空间,该特征空间包括Y个通道(也即是该特征空间的深度为Y),该Y个通道中的每个通道包括H×W个像素;或者,Y个分辨率为H×W的特征图可以被称为一个H×W×Y的特征矩阵,该特征矩阵包括Y个特征向量,该Y个特征向量中的每个特征向量包括H×W个元素,其中,Y为大于0的整数。
上述处理层1用于对目标图像进行下采样,得到Q个特征图1(即特征空间1),该特征图1的分辨率相比于目标图像的分辨率变低,即处理层1能够减小图像的分辨率,从而减小模型的计算量,提高分类效率;此外,Q大于目标图像的通道数量,即处理层1能够提升特征空间的维度,从而提取图像的高维空间特征。
在一种可能的实现方式中,该处理层1可以包括至少一个卷积层1。
上述处理层2用于将Q个特征图1映射至P个特征图2(即特征空间2),特征图2与特征图1的分辨率相同,但P小于Q,即处理层2能够降低特征空间的维度,以去除图像中的冗余特征,从而减少模型的计算量。
在一种可能的实现方式中,该处理层2可以包括至少一个卷积层2。
上述处理层3用于对Q个特征图1进行上采样,得到C个特征图3(即特征空间3),该特征图3的分辨率与目标图像的分辨率相同,即处理层3能够还原出该目标图像的全分辨率,从而恢复出目标图像中更多的细节特征。
在一种可能的实现方式中,该处理层3可以包括至少一个反卷积层和最大值函数层。
需要说明的是,上面仅示意性介绍各处理层的结构,但本申请不限于此。可选地,各处理层还可以包括能够实现各自功能的其他操作层,本申请实施例对此不做限定。
例如,上述处理层1还可以至少一个池化层,池化层一方面可以使特征图的宽度和高度变小,通过减少特征图数据量降低卷积神经网络的计算复杂度;另一方面可以进行特征压缩,提取图像的主要特征。
在一种可能的实现方式中,该优化装置可以通过该处理层4确定该C个特征图3中同一位置的像素的最大置信度,若该最大置信度大于或等于该最大置信度所属的特征图3所对应的对象类别的可信阈值1,则确定该第一语义分割图中对应位置的像素属于该最大置信度所属的特征图3所对应的对象类别。
还需要说明的是,该第二语义分割模型为预先通过训练好的模型、用于识别该C种对象类别。相应地,该无标注图像中至少包括该C种对象类别中的部分或全部。
需要说明的是,本申请中所述的第一语义分割模型与第二语义分割模型的模型结构相同包括:第一,这两个模型的功能相同,即都用于识别该C种对象类别;第二,这两个模型使用的卷积神经网络的网络结构相同,即包括相同的处理层数量、处理层种类以及每个处理层的功能都相同。这两个模型的区别在于这两个模型中的处理层设置的参数可能不一样,如第一语义分割模型中卷积核的权重值和第二语义分割模型中卷积核的权重值不同。
在一种可能的实现方式中,该第二输出结果可以包括第二语义分割图和P个第二特征图,该第二特征图的分辨率小于该目标图像的分辨率。
在一种可能的实现方式中,该优化装置可以对该无标注图像进行特征提取,得到Q个 第三特征图,该第三特征图的分辨率为H 2×W 2;将该Q个第三特征图映射至该P个第二特征图,该第二特征图的分辨率为H 2×W 2;将该Q个第三特征图映射至C个第四特征图,该第四特征图的分辨率为H 1×W 1,该C个第四特征图和该C种对象类别一一对应,该第四特征图包括H 1×W 1个置信度,该H 1×W 1个置信度与该无标注图像包括的H 1×W 1个像素一一对应,该置信度用于表示该无标注图像中对应位置的像素属于该第四特征图对应的对象类别的概率;基于该C个第四特征图和该C种对象类别中的每种对象类别的第一可信阈值,得到该第二语义分割图,该第二语义分割图的分辨率为H 1×W 1
可选地,该第一输出结果可以包括第一语义分割图和P个第一特征图,该第二输出结果可以包括第二语义分割图和P个第二特征图。
相应地,该优化装置基于目标图像,该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化可以包括:该优化装置基于该目标图像、该第一语义分割图、该P个第一特征图、该第二语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该优化装置可以基于该P个第一特征图、该第二语义分割图、该P个第二特征图和第一损失函数,迭代调整模型的参数,该第一损失函数用于缩小属于相同对象类别的像素之间的距离和/或拉长属于不同对象类别的像素之间的距离。
采用本申请实施例提供的语义分割模型的优化方法,通过教师模型输出的第二语义分割图,可以指导学生模型对该P个第一特征图和该P个第二特征图进行对比学习,从而拉近不同类别的像素之间的距离,并拉远相同类别的像素之间的距离,以保证属于同一个类别的像素特征编码尽可能的相似,不同类别像素特征的编码尽可能不相似,因此,可以提高该学生模型分割类内的紧致性以及类间的差异性,从而提高学生模型的预测准确度。
在另一种可能的实现方式中,该优化装置可以基于该目标图像、该第一语义分割图和第二损失函数,迭代调整该第一语义分割模型的参数,该第二损失函数用于约束相同像素所属的对象类别的预测值和标注值的一致性。
采用本申请实施例提供的语义分割模型的优化方法,学生模型输入的目标图像中包括标注图像的部分图像区域,相应地,该学生模型输出的第一语义分割图中也包括与该部分图像区域对应的图像区域,通过对目标图像和第一语义分割图中相同像素的真实值和预测值进行一致性约束,能够提高学生模型的预测准确度。
在又一种可能的实现方式中,该优化装置可以基于该第一语义分割图、该第二语义分割图和第三损失函数,迭代调整该第一语义分割模型的参数,该第三损失函数用于约束该第一语义分割模型和该第二语义分割模型对相同像素所属的对象类别的预测结果的一致性。
采用本申请实施例提供的语义分割模型的优化方法,学生模型输入的目标图像中包括无标注图像的部分图像区域,相应地,教师模型输入的无标注图像也包括与该部分图像区域对应的图像区域,通过对学生模型和教师模型对相同像素所属的对象类别的预测结果进行一致性约束,能够提高学生模型的预测准确度。
需要说明的是,由于该第二语义分割模型采用的是无监督学习的训练方法,预测结果的可靠性较差,基于该第二语义分割模型输出的第二语义分割图对该第一语义分割模型进行优化的效果较差。
因此,为提高该第二语义分割模型的预测结果可靠性,该优化装置可以基于该C种对 象类别中的每种对象类别的第一可信阈值和该每种对象类别的第二可信阈值,得到该每种对象类别的目标可信阈值,其中,该每种对象类别的第一可信阈值为本轮迭代过程中该第二语义分割模型使用的可信阈值,该每种对象类别的第二可信阈值为上一轮迭代过程中该第二语义分割模型使用的可信阈值;基于该C个第四特征图和该每种对象类别的目标可信阈值,得到第三语义分割图,该第三语义分割图的分辨率为H 1×W 1;基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
示例的,可以通过如下公式得到该目标可信阈值Th':
Th‘=α·Th t-1+(1-α)·Th t
其中,α表示权重系数,Th t-1表示上一轮迭代过程中该第二语义分割模型使用的可信阈值(即第二可信阈值),Th t表示本轮迭代过程中该第二语义分割模型使用的可信阈值(即第一可信阈值)。
进一步地,该优化装置可以将该第二语义分割模型本轮使用的可信阈值由该第一可信阈值更新为该目标可信阈值。
采用本申请实施例提供的语义分割模型的优化方法,由于该第二语义分割模型采用的是无监督学习的训练方法,预测结果的可靠性较差,因此,该优化装置基于上一轮迭代过程中该第二语义分割模型使用的可信阈值和本轮迭代过程中该第二语义分割模型使用的可信阈值动态更新各对象类别的可信阈值,以保证各对象类别的可信阈值始终在一个合理的数值范围内,进一步地,可以基于更新后各对象类别的可信阈值对该第二语义分割图中的预测结果进行筛查以筛除掉该第二语义分割图中的可靠性较差的预测结果,得到第三语义分割图,基于该第三语义分割图对该第一语义分割模型进行优化,有利于提高该第一语义分割模型的可靠性。
进一步地,该优化装置可以基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。具体可以参考上述基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化的介绍,此处不再赘述。
可选地,该优化方法还可以包括:该优化装置向语义分割装置发送优化后的第一语义分割模型,即第一优化语义分割模型。
可选地,该优化装置可以通过多种方式向该语义分割装置发送该第一优化语义分割模型,本申请对此不做限定。
在一种可能的实现方式中,该优化装置可以基于预设的周期,周期性向该语义分割装置发送该第一优化语义分割模型。也就是说,该优化装置可以定期向该语义分割装置更新优化后的第一语义分割模型。
在另一种可能的实现方式中,该优化装置可以接收来自该语义分割装置的请求信息,该请求信息用于请求对该第一语义分割模型进行优化;该优化装置基于该请求信息,将该第一优化语义分割模型发送至该语义分割装置。
第二方面,本申请还提供一种语义分割方法,该方法可以用于语义分割装置,该方法可以包括:获得待处理图像;将该待处理图像输入第一优化语义分割模型,得到该待处理图像的语义分割图。
可选地,该语义分割装置可以通过多种方式获得该待处理图像,本申请对此不做限定。
在一种可能的实现方式中,该语义分割装置获得该待处理图像可以包括:该语义分割装置接收摄像装置发送的该待处理图像。相应地,摄像装置采集该待处理图像,并发送至该语义分割装置。
在另一种可能的实现方式中,该语义分割装置可以接收来自其他图像采集的该待处理图像,该其它图像采集装置用于采集该待处理图像。
可选地,在该将该待处理图像输入第一优化语义分割模型之前,该语义分割装置可以获得该第一优化语义分割模型。
可选地,该语义分割装置可以通过多种方式获得该第一优化语义分割模型,本申请对此不做限定。
在一种可能的实现方式中,该语义分割装置可以基于预设的周期,周期性接收来自该优化装置发送的该第一优化语义分割模型。也就是说,该语义分割装置可以定期接收该优化装置更新的优化后的第一语义分割模型。
在另一种可能的实现方式中,该语义分割装置可以向语义分割模型的优化装置发送请求信息,该请求信息用于请求对第一语义分割模型进行优化;并接收该语义分割模型的优化装置发送的该第一优化语义分割模型。
需要说明的是,上述第一优化语义分割模型为采用第一方面提供的优化方法对该第一语义分割模型进行优化后得到的,因此,通过基于该第一优化语义分割模型对待处理图像进行语义分割,能够提高语义分割的准确度。
第三方面,本申请还提供一种语义分割方法,该语义分割方法可以用于语义分割系统,该语义分割系统可以包括:优化装置和语义分割装置;该方法可以包括:优化装置获得目标图像,该目标图像是基于标注图像和无标注图像得到的;该优化装置将该目标图像输入第一语义分割模型,得到第一输出结果;该优化装置将该无标注图像输入第二语义分割模型,得到第二输出结果,该第二语义分割模型与该第一语义分割模型的模型结构相同;该优化装置基于该目标图像、该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化,得到第一优化语义分割模型;该优化装置向该语义分割装置发送该第一优化语义分割装置;该语义分割装置获得待处理图像;该语义分割装置将该待处理图像输入该第一优化语义分割模型,得到该待处理图像的语义分割图。
可选地,该语义分割系统还可以包括显示装置,该方法还可以包括该语义分割装置向该显示装置发送该待处理图像的语义分割图;相应地,该显示装置显示该语义分割图。
需要说明的是,上述优化装置执行的步骤可以参考第一方面中的相关介绍,上述语义分割装置执行的步骤可以参考第二方面中的相关介绍,此处不再赘述。
第四方面,本申请还提供一种语义分割装置,该优化装置可以包括获得模块、第一语义分割模块、第二语义分割模块和优化模块;该获得模块,用于获得目标图像,该目标图像是基于标注图像和无标注图像得到的;该第一语义分割模块,用于将该目标图像输入第一语义分割模型,得到第一输出结果;该第二语义分割模块,用于将该无标注图像输入第二语义分割模型,得到第二输出结果,该第二语义分割模型与该第一语义分割模型的模型结构相同;该优化模块,用于基于该目标图像、该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该第一输出结果包括第一语义分割图和P个第一特征图,P的取值大于该目标图像的通道数量,该第一特征图的分辨率小于该目标图像的分辨率,该第二输出结果包括第二语义分割图和P个第二特征图,该第二特征图的分辨率与该第一特征图的分辨率相同,该第一语义分割图的分辨率和该第二语义分割图的分辨率均与该目标图像的分辨率相同;该优化模块具体用于基于该目标图像、该第一语义分割图、该P个第一特征图、该第二语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该无标注图像的尺寸为H 1×W 1×T,该第二语义分割模型用于识别C种对象类别,其中,H 1和W 1均为大于1的整数,T为大于0的整数,C为大于0的整数,该第二语义分割模块具体用于:对该无标注图像进行特征提取,得到Q个第三特征图,该第三特征图的分辨率为H 2×W 2,其中,H 2小于H 1,W 2小于W 1,Q大于T;将该Q个第三特征图映射至该P个第二特征图,该第二特征图的分辨率为H 2×W 2,其中,P小于Q;将该Q个第三特征图映射至C个第四特征图,该第四特征图的分辨率为H 1×W 1,该C个第四特征图和该C种对象类别一一对应,该第四特征图包括H 1×W 1个置信度,该H 1×W 1个置信度与该无标注图像包括的H 1×W 1个像素一一对应,该置信度用于表示该无标注图像中对应位置的像素属于该第四特征图对应的对象类别的概率;基于该C个第四特征图和该C种对象类别中的每种对象类别的第一可信阈值,得到该第二语义分割图,该第二语义分割图的分辨率为H 1×W 1
在一种可能的实现方式中,该优化装置还包括阈值更新模块,该阈值更新模块用于基于该每种对象类别的第一可信阈值和该每种对象类别的第二可信阈值,得到该每种对象类别的目标可信阈值,其中,该每种对象类别的第一可信阈值为本轮迭代过程中该第二语义分割模型使用的可信阈值,该每种对象类别的第二可信阈值为上一轮迭代过程中该第二语义分割模型使用的可信阈值;基于该C个第四特征图和该每种对象类别的目标可信阈值,得到第三语义分割图,该第三语义分割图的分辨率为H 1×W 1;该优化模块具体用于基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该优化模块具体用于:基于该P个第一特征图、该第三语义分割图、该P个第二特征图和第一损失函数,迭代调整模型的参数,该第一损失函数用于缩小属于相同对象类别的像素之间的距离和/或拉长属于不同对象类别的像素之间的距离;基于该目标图像、该第一语义分割图和第二损失函数,迭代调整该第一语义分割模型的参数,该第二损失函数用于约束相同像素所属的对象类别的预测值和标注值的一致性;基于该第一语义分割图、该第三语义分割图和第三损失函数,迭代调整该第一语义分割模型的参数,该第三损失函数用于约束该第一语义分割模型和该第二语义分割模型对相同像素所属的对象类别的预测结果的一致性。
在一种可能的实现方式中,该目标图像包括该标注图像的部分区域和该无标注图像的部分区域。
在一种可能的实现方式中,该获得模块具体用于:对该标注图像进行裁剪,得到第一子图像;对该无标注图像进行裁剪,得到第二子图像;对该第一子图像和该第二子图像进行拼接,得到该目标图像。
第五方面,本申请还提供一种语义分割装置,该装置可以包括获得模块和语义分割模 块,该获得模块用于获得待处理图像;该语义分割模块用于将该待处理图像输入第一优化语义分割模型,得到该待处理图像的语义分割图。
可选地,该获得模块可以通过多种方式获得该待处理图像,本申请对此不做限定。
在一种可能的实现方式中,该获得模块具体用于接收摄像装置发送的该待处理图像。相应地,摄像装置采集该待处理图像,并发送至该获得模块。
在另一种可能的实现方式中,该获得模块可以接收来自其他图像采集装置的该待处理图像,该其他图像采集装置用于采集该待处理图像。
可选地,在该语义分割模块将该待处理图像输入第一优化语义分割模型之前,该语义分割模块可以获得该第一优化语义分割模型。
可选地,该语义分割模块可以通过多种方式获得该第一优化语义分割模型,本申请对此不做限定。
在一种可能的实现方式中,该语义分割模块可以基于预设的周期,周期性接收来自该优化装置发送的该第一优化语义分割模型。也就是说,该语义分割模块可以定期接收该优化装置更新的优化后的第一语义分割模型。
在另一种可能的实现方式中,该语义分割模块可以向语义分割模型的优化装置发送请求信息,该请求信息用于请求对第一语义分割模型进行优化;并接收该语义分割模型的优化装置发送的该第一优化语义分割模型。
第六方面,本申请还提供一种语义分割系统,该系统可以包括上述第一方面或其任意可能的实现方式中所述的语义分割模型的优化装置。
可选地,该系统还可以包括上述第二方面或其任意可能的实现方式中所述的语义分割装置。
可选地,该系统还可以包括图像采集装置和显示装置。
第七方面,本申请还提供一种终端,该终端可以包括上述第六方面中所述的语义分割系统。
可选地,该终端可以为车辆。
第八方面,本申请还提供一种语义分割模型的优化装置,该优化装置可以包括通信接口和处理器,该通信接口与该处理器耦合,该通信接口用于为该处理器提供信息和/或数据,该处理器用于运行计算机程序指令以执行上述第一方面或其任意可能的实现方式中所述的优化方法。
可选地,该优化装置还可以包括至少一个存储器,所述存储区用于存储该程序代码或指令。
可选地,该优化装置可以为芯片或集成电路。
第九方面,本申请还提供一种语义分割装置,该装置可以包括通信接口和处理器,该通信接口与该处理器耦合,该通信接口用于为该处理器提供信息和/或数据,该处理器用于运行计算机程序指令以执行上述第二方面或其任意可能的实现方式中所述的方法。
可选地,该装置还可以包括至少一个存储器,所述存储区用于存储该程序代码或指令。
可选地,该装置可以为芯片或集成电路。
第十方面,本申请还提供一种计算机可读存储介质,其特征在于,用于存储计算机程序,该计算机程序被处理器运行时,实现上述第一方面及其任意可能的实现方式中所述的 优化方法,和/或,实现上述第二方面及其任意可能的实现方式中所述的方法。
第十一方面,本申请还提供一种计算机程序产品,其特征在于,当该计算机程序产品在处理器上运行时,实现上述第一方面及其任意可能的实现方式中所述的优化方法,和/或,实现上述第二方面及其任意可能的实现方式中所述的方法。
本申请提供的语义分割模型的优化装置、系统、计算机存储介质、计算机程序产品、芯片和终端均用于执行上文所提供的语义分割模型的优化方法,因此,其所能达到的有益效果可参考上文所提供的语义分割模型的优化方法中的有益效果,此处不再赘述。
本申请提供的语义分割装置、计算机存储介质、计算机程序产品、芯片和终端均用于执行上文所提供的语义分割方法,因此,其所能达到的有益效果可参考上文所提供的语义分割方法中的有益效果,此处不再赘述。
附图说明
图1是图像的尺寸示意图;
图2是卷积层实现卷积操作过程的示意图;
图3是本申请实施例提供的通过掩码提取待处理图像的感兴趣区的流程示意图;
图4是本申请实施例提供的语义分割处理的示意图;
图5是本申请实施例提供的语义分割系统100的示意性框图;
图6是本申请实施例提供的应用场景示意图;
图7是本申请实施例提供的语义分割模型的优化方法200的示意性流程图;
图8是本申请实施例提供的标注图像的示意图;
图9是本申请实施例提供的无标注图像的示意图;
图10是本申请实施例提供的目标图像的示意图;
图11是本申请实施例提供的另一目标图像的示意图;
图12是本申请实施例提供的通过第一掩膜提取标注图像的第一感兴趣区的流程示意图;
图13是本申请实施例提供的通过第二掩膜提取无标注图像的第二感兴趣区的流程示意图;
图14是本申请实施例提供的第一语义分割模型对目标图像进行语义分割的流程示意图;
图15是本申请实施例提供的处理层4的处理流程示意图;
图16是本申请实施例提供的语义分割方法300的示意性流程图;
图17是本申请实施例提供的语义分割模型的优化装置400的示意性框图;
图18是本申请实施例提供的语义分割模型的优化方法的流程示意图;
图19是本申请实施例提供的语义分割模型的优化装置500的示意性框图;
图20是本申请实施例提供的语义分割装置600的示意性框图;
图21是本申请实施例提供的语义分割装置700的示意性框图。
具体实施方式
下面将结合本申请中的附图,对本申请中的技术方案进行描述。
本申请的说明书实施例和权利要求书及附图中的术语“第一”、“第二”等仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a、b、c、“a和b”、“a和c”、“b和c”、或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。下面先对本申请实施例可能涉及的一些概念进行简单介绍。
1.像素(pixel)
像素是组成图像的最基本的元素,是一种逻辑尺寸单位。
2.图像的尺寸
图像的尺寸包括图像的宽度、高度和深度(depth,D)。
图像的高度可以理解为该图像在高度方向上包括的像素的数量。
图像的宽度可以理解为该图像在宽度方向上包括的像素的数量。
图像的深度可以理解为该图像包括的通道的数量,其中,图像的各通道的高度和高度都相同。
示例的,一个图像的尺寸为H×W×M,是指该图像包括M个通道,该M个通道中的每个通道的高度为H个像素、宽度为W个像素,其中,H、W和M均为大于0的整数。
还需要说明的是,图像的宽度和高度也称为图像的分辨率。
示例的,一个图像的高度为H个像素、宽度为W个像素,也被称为该图像的分辨率为H×W。
示例的,图1示出了一个尺寸为5×5×3的图像,如图1所示,该图像包括3通道,如图1中所示的红色(red,R)通道、绿色(green,G)通道和蓝色(blue,B)通道,其中,R通道、G通道和B通道的分辨率均为5×5,即每个通道的宽度为5个像素,高度为5个像素。
需要说明的是,图1中仅以深度为3的RGB图像为例进行描述,图像的深度还可以为其它取值,示例的,灰度图像的深度为1,RGB-D图像的深度为4。
3.卷积核
卷积核是一种滤波器,用于提取图像的特征图。卷积核的尺寸包括宽度、高度和深度,其中,卷积核的深度与输入图像的深度相同。对一个输入图像使用多少种不同的卷积核进行卷积操作,就可以提取多少个不同的特征图。
例如,采用一个5×5×3的卷积核对7×7×3的输入图像进行卷积操作,可以得到一个输出特征图,采用多个不同的5×5×3的卷积核对7×7×3的输入图像进行卷积操作,可以 得到多个不同的输出特征图。
4.卷积步长
卷积步长是指卷积核在输入图像的特征图上滑动提取该输入图像的特征图的过程中,该卷积核在高度方向和宽度方向上执行两次卷积操作之间滑动的距离。
应理解,卷积步长可以决定输入图像的下采样倍率,例如,在宽度(或高度)方向上的卷积步长为B,可以使输入特征图在宽度(或高度)方向上实现B倍的下采样,B为大于1的整数。
5.卷积层(convolutional layer)
卷积层主要是基于设定的卷积核、卷积步长等参数,对输入图像进行卷积操作,以提取该输入图像的特征。
可选地,可以通过设置不同大小的卷积核、不同权重值或以不同的卷积步长对同一个图像进行多次卷积,以尽可能多的抽取该图像的特征。
需要说明的是,在使用一个K×K的卷积核对一个深度为1的输入图像进行卷积操作时,将卷积核在该图像上滑动时覆盖的K×K的图像块与卷积核做点乘,即图像块上每个点的灰度值与卷积核上相同位置的权重值相乘,共得到K×K个结果,累加后加上偏置,得到一个结果,输出为输出图像的单一像素,该像素在该输出图像上的坐标位置对应该图像块的中心在该输入图像上的坐标位置,其中,K为大于0的整数。
还需要说明的是,在使用卷积核对一个深度为N的输入图像进行卷积操作时,该卷积核的深度也需为N,其中,N为大于0的整数。该输入图像与卷积核的卷积操作,可以转化为将深度为N的输入图像和深度为N的卷积核在深度维度拆分为N个深度为1的图像分别与N个深度为1的卷积核进行卷积操作,最终在图像深度这一维度进行累加,最终获得一个输出图像。
还应理解,在卷积神经网络中,卷积层的输出图像通常包括多个特征图,一个深度为N的卷积核对深度为N的输入图像进行卷积操作后得到一个特征图,因此,如果想要获得多少个特征图就需要通过多少个深度为N的卷积核分别对输入图像进行卷积操作。
示例的,图2示出了卷积层实现对输入图像进行卷积操作的过程,输入图像的尺寸为5×5×3,为该输入图像的高度边界和宽度边界均填充1个像素后得到7×7×3的图像,卷积操作包括在宽度方向和高度方向上采用卷积核w0进行卷积步长为2的卷积,卷积核w0的尺寸为3×3×3,将该输入图像的3个通道(即通道1、通道2和通道3)分别与卷积核的三层深度(卷积核w0-1、卷积核w0-2和卷积核w0-3)进行卷积,得到特征图1,该特征图1的尺寸为3×3×1。
具体地,w0的第一层深度(即w0-1)和通道1黑色方框中对应位置的元素相乘再求和得到0,同理,卷积核w0的其他两个深度(即w0-2和w0-3)分别与通道2和通道3进行卷积操作,得到2和0,则图1中输出特征图1的第一个元素为0+2+0=2。经过卷积核w0的第一次卷积操作后,黑色方框先沿着各通道的宽度方向上滑动,再沿着高度方向上滑动,每滑动一次进行一次卷积操作,其中,每次滑动的距离为2(即宽度和高度方向上的卷积步长均为2),直到完成对该输入图像的卷积操作,得到3×3×1的特征图1。
可选地,若卷积操作还包括在宽度方向和高度方向上采用卷积核w1进行卷积步长为2的卷积,基于与卷积核w0类似的流程,可以得到3×3×1的特征图2。
6.反卷积层(deconvolution layer)
反卷积层也称反置卷积层(transposed convolution layer),通过设定反卷积步长可以决定输入图像的上采样倍率,例如,在宽度(或高度)方向上的卷积步长为A,可以使输入特征图在宽度(或高度)方向上实现A倍的上采样,A为大于1的整数。
应理解,反卷积操作可以理解为如图2所示的卷积操作的逆过程。
7.标注图像
标注图像是指图像中的每个像素具有标注值,像素的标注值用于表示该像素所属的对象类别,标注图像中的标注值为人工标注的,也即是真实值。
8.无标注图像
无标注图像是指图像中的每个像素不具有标注值。
9.掩膜(mask)
掩膜用于提取待处理图像中的感兴趣区或遮挡该待处理图像中的非感兴趣区。掩膜通常为一个二值化图像,即掩膜中的每个像素值为“0”或1,其中,感兴趣区内的像素值为“1”,非感兴趣区内的像素值为“0”。
用掩膜提取待处理图像的感兴趣区的原理是:将待处理图像中的每个像素值与掩膜中对应位置的像素值相乘,该待处理图像的感兴趣区内的像素值保持不变,而感兴趣区外(即非感兴趣区内)的像素值都为0,这样就可以提取待处理图像的感兴趣区。
请参考图3,图3示出了通过掩膜提取待处理图像的感兴趣区的流程示意图。以待处理图像的第1行第1列的位置(即位置1)处的像素值为例,如图3中的“■”所示,待处理图像的位置1处的像素值为1,掩膜的位置1处的像素值为0,处理后位置1处的像素值为1×0=0。同理,可以对该待处理图像中其他位置的像素值进行类似处理,得到效果图,效果图中的感兴趣区如图3中的
Figure PCTCN2021113095-appb-000001
所示。
10.语义分割
语义分割是指像素级地识别图像,语义分割的目标是预测出待处理图像中每个位置处的像素所属的对象类别,并通过不同的标签值对该待处理图像中属于不同对象类别的像素进行标注。
待处理图像的语义分割结果通常通过语义分割图表示,该语义分割图与该待处理图像的分辨率相同,该语义分割图中每个位置处的标签值用于表示该待处理图像中对应位置处的像素所属的对象类别,其中,语义分割图中的标签值为预测值。
请参考图4,图4示出了语义分割处理的示意图,对如图4中的(a)所示的待处理图像进行语义分割处理之后,可以得到如图4中的(b)所示的语义分割图,其中,该语义分割图中标签值为1的位置表示待处理图像中对应位置的像素所属的对象类别为树,标签值为2的位置表示待处理图像中对应位置的像素所属的对象类别为道路,标签值为3的位置表示待处理图像中对应位置的像素所属的对象类别为天空,标签值为4的位置表示待处理图像中对应位置的像素所属的对象类别为楼房,标签值为5的位置表示待处理图像中对应位置的像素所属的对象类别为云,标签值为6的位置表示待处理图像中对应位置的像素所属的对象类别为汽车,标签值为7的位置表示待处理图像中对应位置的像素所属的对象类别为地面,标签值为8的位置表示待处理图像中对应位置的像素所属的对象类别为人。
11.卷积神经网络模型
卷积神经网络模型本质上是一种输入到输出的映射,它能够学习大量的输入与输出之间的映射关系,而不需要任何输入和输出之间的精确的数学表达式,在收集好训练样本后,对神经网络模型加以训练,神经网络模型就具有输入输出对之间的映射能力。
12.语义分割模型
语义分割模型是一种神经网络模型,该神经网络模型用于对输入图像进行语义分割处理,得到输出结果,该输出结果为该输入图像的语义分割图。
语义分割模型可以采用卷积神经网络,该卷积神经网络采用编码器-解码器架构,编码器通过卷积层逐渐增加输入图像的空间维度(即图像的特征图数量或通道数量),如可以通过卷积层对输入进行一次或多次下采样,提取输入图像的高层语义特征。相应地,解码器在高层语义特征上通过反卷积层进行一次或多次上采样,逐渐恢复输入图像的细节和空间维度,最终输出与输入图像分辨率一致的语义分割图。
13.损失函数(loss function)
损失函数是用来估量模型的预测值与真实值的不一致程度,它是一个非负实值函数。损失函数值越小,模型的鲁棒性就越好。一个最佳化问题的目标是将损失函数值最小化。模型优化的过程是指,通过迭代调整模型的参数,使得模型的损失函数值最小化。
现有技术中,通常利用大量标注图像作为训练样本对语义分割模型进行训练,训练样本越多,得到的语义分割模型的准确度越高。
然而,由于目前标注图像都是由人工标注的,需要耗费大量的人力和财力,因此,人工标注的训练样本的数量非常有限,这样就会导致训练得到的语义分割模型的泛化性能较差,从而降低图像语义分割模型的预测准确度。
基于此,已有方案中提出采用半监督学习的方式训练语义分割模型,即通过少量标注图像结合大量无标注图像对语义分割模型进行训练,有效的挖掘标注图像与无标注图像之间的关联,从而提高语义分割模型的泛化性能。
然而,在已有方案中,若训练数据集(包括训练样本)和测试集(包括实际待处理图像)有着巨大差异时,很容易出现过拟合的现象,使得语义分割模型在测试集上表现不理想。也就是说,当训练数据集和测试数据集分布不一致的情况下,通过在训练数据集上按经验误差最小准则训练得到的模型在测试数据集上性能不佳,因此,在一个场景下的训练数据集训练得到的语义分割模型,并不能很好的适应另一个场景的数据,即语义分割模型的域适应能力较差。
示例的,假设训练数据集中的训练样本包括各种家用小轿车,而想训练得到可以识别厢货车的语义分割模型,该语义分割模型相比于家用小轿车的识别来说,预测的准确度较低。
综上所述,已有方案中采用半监督学习方法训练得到的语义分割模型的域适应性较差,导致预测准确度较低,从而语义分割模型的泛化性能较差。
基于此,本申请提供一种语义分割模型的优化方法和装置,通过对训练数据集中标注图像和无标注图像进行数据增强,以降低标注图像和无标注图像之间的分布差异,并基于数据增强后的训练数据集对语义分割模型进行优化,能够提高语义分割模型的预测准确度。此外,本申请还提供一种语义分割方法和装置,能够提高语义分割的准确度。
请参考图5,图5示出了本申请实施例提供的语义分割方法和语义分割模型的优化方 法所应用的语义分割系统100的示意性框图。如图5所示,系统100可以包括语义分割模型的优化装置110,优化装置110中包括第一语义分割模型。
优化装置110用于采用本申请提供的语义分割模型的优化方法,基于训练数据集(包括多个训练样本),对该第一语义分割模型进行优化,得到第一优化语义分割模型。
可选地,系统100还可以包括语义分割装置120,语义分割装置120可以与优化装置110通信。
优化装置110还用于将该第一优化语义分割模型发送至语义分割装置120。
语义分割装置120用于将待处理图像输入该第一优化语义分割模型,得到该待处理图像的语义分割图。
可选地,语义分割装置120和优化装置110可以为同一个装置,该装置既可以采用本申请提供优化方法对该第一语义分割模型进行优化,也可以通过优化后的第一优化语义分割模型对待处理图像进行语义分割。
可选地,系统100还可以包括摄像装置130和/或显示装置140,其中,摄像装置130可以分别与优化装置110和语义分割装置120通信,显示装置140可以与语义分割装置120通信。
摄像装置130用于拍摄该训练数据集中的样本图像,将该样本图像发送至优化装置110。
摄像装置130还用于拍摄该待处理图像,并将该待处理图像发送至语义分割装置120。
语义分割装置120还用于将该待处理图像的语义分割图发送至显示装置140。
显示装置140用于呈现该待处理图像的语义分割图。
可选地,本申请对优化装置110、语义分割装置120、摄像装置130和显示装置140的具体形态不作限定。
在一种可能的实现方式中,优化装置110、语义分割装置120、摄像装置130和显示装置140可以分别为单独的设备(或分别设置在不同的设备中)。
在另一种可能的实现方式中,优化装置110、语义分割装置120、摄像装置130和显示装置140中的一个或多个装置可以设置在同一个设备中,剩余一个或多个装置分别为单独的设备(或分别设置在不同的设备中)。
在又一种可能的实现方式中,优化装置110、语义分割装置120、摄像装置130和显示装置140均设置在同一个设备中,本申请实施例对此不做限定。
可选地,摄像装置130可以为摄像头或摄像头模组。示例的,摄像装置130可以包括静态摄像头和/或视频摄像头,用于采集样本图像和/或待处理图像。
可选地,显示装置140可以为显示屏。示例的,显示装置140可以为触摸显示屏,用于车辆与用户交互。如该车辆可以通过该触摸显示屏获得用户输入的信息;或者,该车辆可以通过该触摸显示屏向用户呈现显示界面(如语义分割图)。
可选地,上述系统100可以用于多种场景或领域,本申请对此不做限定。
在一种可能的实现方式中,系统100可以用于自动驾驶、辅助驾驶或无人驾驶的场景或领域,能够很好地对所在环境的场景图进行分割,输出更加真实的场景图,并使得自动驾驶系统可以做出更加安全可靠的行驶操作。
在另一种可能的实现方式中,系统100可以用于监控或安防的场景或领域,能够对监 控区域内的人类进行分割,并基于分割结果进行目标跟踪、姿态分析预警等。
在又一种可能的实现方式中,系统100可以用于医疗的场景或领域,能够对医学图像中的各种器官进行分割,并基于分割结果进行对应独立器官三维的虚拟现实技术(virtual reality,VR)显示,以进行手术导航。
示例的,图6示出了本申请实施例提供的系统100所应用的场景图。如图6所示,语义分割装置120、摄像装置130和显示装置140可以设置在车辆中,优化装置110可以设置在云端的服务器中。
示例的,上述系统100可以通过以下流程实现对待处理图像进行语义分割。
语义分割装置120向优化装置110发送请求信息,该请求信息用于请求对第一语义分割模型进行优化。
该优化装置110基于该请求信息,采用本申请提供的优化方法对该第一语义分割模型进行优化,得到第一优化语义分割模型;向该语义分割装置120发送该第一优化语义分割模型。
该摄像装置130采集车辆行驶过程中的待处理图像;并发送至该语义分割装置120。
该语义分割装置120将该待处理图像输入该第一优化语义分割模型,得到该待处理图像的语义分割图;并发送至该显示装置140。
显示装置140显示该待处理图像的语义分割图。
需要说明的是,图6中仅以服务器设置在云端为例进行绘示,但本申请不限于此。可选地,该服务器也可以设置在该车辆上,本申请对此不做限定。
可选地,上述各装置之间可以通过有线方式或无线方式进行通信,本申请实施例对此不作限定。
示例的,上述有线方式可以为通过数据线连接、或通过内部总线连接实现通信。
示例的,上述无线方式可以为通过通信网络实现通信,该通信网络可以是局域网,也可以是通过中继(relay)设备转接的广域网,或者包括局域网和广域网。当该通信网络为局域网时,该通信网络可以是无线保真(wireless fidelity,Wifi)热点网络、wifi对等(peer-to-peer,P2P)网络、蓝牙(bluetooth)网络、zigbee网络、近场通信(near field communication,NFC)网或者未来可能的通用短距离通信网络等。当该通信网络为广域网时,示例性的,该通信网络可以是第三代移动通信技术(3rd-generation wireless telephone technology,3G)网络、第四代移动通信技术(the 4th generation mobile communication technology,4G)网络、第五代移动通信技术(5th-generation mobile communication technology,5G)网络、公共陆地移动网络(public land mobile network,PLMN)或因特网(Internet)等,本申请实施例对此不作限定。
上面介绍了本申请实施例提供的语义分割方法和语义分割模型的优化方法所应用的系统和场景,下面将进一步介绍上述语义分割模型的优化方法和语义分割方法。
请参考图7,图7提供了本申请实施例提供的语义分割模型的优化方法200的示意性流程图。如图7所示,该方法200可以应用于如图5所示的系统100中,并可以由系统100中的优化装置110执行。该优化装置的优化流程可以包括以下步骤,需要说明的是,以下所列步骤可以以各种顺序执行和/或同时发生,不限于图7所示的执行顺序。
步骤201,优化装置获得目标图像,该目标图像是基于标注图像和无标注图像得到的。
需要说明的是,上述目标图像是对标注图像和无标注图像进行混合后得到的图像。上述标注图像是指图像中包括的每个像素具有标注值,该每个像素的标注值用于指示该每个像素所属的对象类别。上述无标注图像是指图像中包括的每个像素不具有标注值。上述标注值通常为人工标注的真实值。
还需要说明的是,该标注图像、该无标注图像和该目标图像的分辨率相同。
可选地,该标注图像和该无标注图像通常是在类似应用场景或类似环境下采集得到的,即该标注图像和该无标注图像中至少包括共同的部分对象类别。如标注图像中包括的对象类别为车、人、树和楼,未标注图像中包括的对象为车、树和楼。
示例的,图8示出了本申请实施例提供的标注图像的示意图,该标注图像中每个像素都具有标注值,该标注值用于表示该像素所属的对象类别。如图8所示,该标注图像中标注值为1的像素所属的对象类别为树,标注值为2的像素所属的对象类别为道路,标注值为3的像素所属的对象类别为天空,标注值为4像素所属的对象类别为楼房,标注值为5的像素所属的对象类别为云,标注值为6的像素所属的对象类别为汽车,标注值为7的像素所属的对象类别为地面。
示例的,图9示出了本申请实施例提供的无标注图像的示意图,该无标注图像仅包括像素,即该无标注图像中每个像素位置处不具有标注值。
在一种可能的实现方式中,上述目标图像可以包括该标注图像的部分区域和该无标注图像的部分区域。
示例的,以图8中所示的标注图像和图9中所示的无标注图像为例,图10示出了本申请实施例提供的目标图像的示意图,如图10所示,该目标图像可以包括标注的子图像1和无标注的子图像2,其中,该子图像1截取自该标注图像,该子图像2截取自该无标注图像。
示例的,以图8中所示的标注图像和图9中所示的无标注图像为例,图11示出了本申请实施例提供的另一目标图像的示意图,如图11所示,该目标图像可以包括标注的子图像3和无标注的子图像4,其中,该子图像3截取自该标注图像,该子图像4截取自该无标注图像。
可选地,该优化装置可以通过多种方式获得该目标图像,本申请对此不作限定。
在一种可能的实现方式中,该优化装置可以接收其它装置(如图像生成装置)发送的该目标图像。也就是说,该目标图像可以由该图像生成装置生成。
在另一种可能的实现方式中,该优化装置可以基于该标注图像和该无标注图像,生成该目标图像。
可选地,该优化装置可以通过多种方式基于该标注图像和该无标注图像,生成该目标图像,本申请对此不做限定。
在一种可能的实现方式中,该优化装置对该标注图像进行裁剪,得到第一子图像;对该无标注图像进行裁剪,得到第二子图像;对该第一子图像和该第二子图像进行拼接,得到该目标图像。
在另一种可能的实现方式中,该优化装置可以基于第一掩膜提取该标注图像中的第一感兴趣区,得到该第一子图像;基于第二掩膜提取该无标注图像中的第二感兴趣区,得到该第二子图像;对该第一子图像和该第二子图像进行拼接,得到该目标图像,该第一感兴 趣区在该第一掩膜中的位置与第二非感兴趣区在该第二掩膜中的位置对应,其中,该第二非感兴趣区为该第二掩膜中除该第二感兴趣区外的区域。
示例的,图12示出了本申请实施例提供的通过第一掩膜提取标注图像的第一感兴趣区的流程示意图,上述标注图像如图12中的(a)所示,上述第一掩膜如图12中的(b)所示,上述第一感兴趣区如图12中的(c)所示。
示例的,图13示出了本申请实施例提供的通过第二掩膜提取无标注图像的第二感兴趣区的流程示意图,上述无标注图像如图13中的(a)所示,上述第二掩膜如图13中的(b)所示,上述第二感兴趣区如图13中的(c)所示。
进一步地,该优化装置可以从该标注图像中裁取得到如图12中的(c)所示的第一感兴趣区对应的第一子图像,从该无标注图像中裁取得到如图13中的(c)所示的第二感兴趣区对应的第二子图像,并对该第一子图像和该第二子图像进行拼接,得到如图11中所示的目标图像。
步骤202,该优化装置将该目标图像输入第一语义分割模型,得到第一输出结果。
需要说明的是,该第一语义分割模型为预先通过训练好的模型、用于识别C种对象类别,C为大于0的整数。相应地,该目标图像中至少包括该C种对象类别中的部分或全部。
在一种可能的实现方式中,该第一输出结果可以包括第一语义分割图和P个第一特征图,P的取值大于该目标图像的通道数量,该第一特征图的分辨率小于该目标图像的分辨率。
采用本申请提供的语义分割模型的优化方法,通过对标注图像和无标注图像进行混合,能够挖掘标注图像和无标注图像之间的关联,以降低标注图像和无标注图像之间的分布差异,通过混合后的目标图像训练该第一语义分割模型,能够提高该第一语义分割模型的域适应性,从而提高语义分割模型的预测准确度。
可选地,该第一语义分割模型可以采用卷积神经网络,该卷积神经网络至少包括处理层1、处理层2、处理层3和处理层4。
在一种可能的实现方式中,以该目标图像的尺寸可以为H 1×W 1×T,其中,H 1和W 1均为大于1的整数,T为大于0的整数为例,上述步骤202可以包括:该优化装置通过该处理层1对该目标图像进行特征提取,得到Q个特征图1,该特征图1的分辨率为H 2×W 2,其中,H 2小于H 1,W 2小于W 1,Q大于T;通过该处理层2将Q个特征图1映射至P个特征图2(即P个第一特征图),该特征图2的分辨率为H 2×W 2,其中,P小于Q;通过该处理层3将所述Q个特征图1映射至C个特征图3,该特征图3的分辨率为H 1×W 1,该C个特征图3和该C种对象类别一一对应,该特征图3包括H 1×W 1个置信度,该H 1×W 1个置信度与该目标图像包括的H 1×W 1个像素一一对应,该置信度用于表示该目标图像中对应位置的像素属于该特征图3对应的对象类别的概率;通过处理层4基于该C个特征图3和该C种对象类别中的每种对象类别的可信阈值1,得到该第一语义分割图,该第一语义分割图的分辨率为H 1×W 1
需要说明的是,在本申请中,Y个分辨率为H×W的特征图可以被称为一个H×W×Y的特征空间,该特征空间包括Y个通道(也即是该特征空间的深度为Y),该Y个通道中的每个通道包括H×W个像素;或者,Y个分辨率为H×W的特征图可以被称为一个H×W×Y的特征矩阵,该特征矩阵包括Y个特征向量,该Y个特征向量中的每个特 征向量包括H×W个元素,其中,Y为大于0的整数。
示例的,以1024×1024×3的目标图像为例,图14示出了本申请实施例提供的第一语义分割模型对目标图像进行语义分割的流程示意图。如图14所示,1024×1024×3的目标图像通过处理层1提取特征后得到128×128×1024的特征空间1,该特征空间1通过处理层2映射至128×128×256的特征空间2,该特征空间1通过处理层3映射至1024×1024×7的特征空间3,该特征空间3通过处理层4处理得到该第一语义分割图。
需要说明的是,为清楚起见,图14中所示的目标图像和第一语义分割图仅为示意图,该目标图像和该第一语义分割图的具体分辨率以图像下方标注的尺寸为准。
上述处理层1用于对目标图像进行下采样,得到Q个特征图1(即特征空间1),该特征图1的分辨率相比于目标图像的分辨率变低,即处理层1能够减小图像的分辨率,从而减小模型的计算量,提高分类效率;此外,Q大于目标图像的通道数量,即处理层1能够提升特征空间的维度,从而提取图像的高维空间特征。
在一种可能的实现方式中,该处理层1可以包括至少一个卷积层1。
上述处理层2用于将Q个特征图1映射至P个特征图2(即特征空间2),特征图2与特征图1的分辨率相同,但P小于Q,即处理层2能够降低特征空间的维度,以去除图像中的冗余特征,从而减少模型的计算量。
在一种可能的实现方式中,该处理层2可以包括至少一个卷积层2。
上述处理层3用于对Q个特征图1进行上采样,得到C个特征图3(即特征空间3),该特征图3的分辨率与目标图像的分辨率相同,即处理层3能够还原出该目标图像的全分辨率,从而恢复出目标图像中更多的细节特征。
在一种可能的实现方式中,该处理层3可以包括至少一个反卷积层和最大值函数(argmax)层。
需要说明的是,上面仅示意性介绍各处理层的结构,但本申请不限于此。可选地,各处理层还可以包括能够实现各自功能的其他操作层,本申请实施例对此不做限定。
例如,上述处理层1还可以至少一个池化层,池化层一方面可以使特征图的宽度和高度变小,通过减少特征图数据量降低卷积神经网络的计算复杂度;另一方面可以进行特征压缩,提取图像的主要特征。
在一种可能的实现方式中,该优化装置可以通过该处理层4确定该C个特征图3中同一位置的像素的最大置信度,若该最大置信度大于或等于该最大置信度所属的特征图3所对应的对象类别的可信阈值1,则确定该第一语义分割图中对应位置的像素属于该最大置信度所属的特征图3所对应的对象类别。
示例的,图15示出了本申请实施例提供的处理层4的处理流程示意图,如图15所示,以C取值为2为例,2个特征图3如图15中的(a)所示的特征图3-1和如图15中的(b)所示的特征图3-2所示,其中,特征图3-1对应对象类别1,该对象类别1的可信阈值1为0.6,特征图3-2对应的对象类别2,该对象类别2的可信阈值2为0.65为例,对于第一行第一列(即位置1)的像素,由该特征图3-1中位置1处的像素属于该对象类别1的置信度为0.78,特征图3-2中位置1处的像素属于该对象类别2的置信度为0.32,得到位置1对应的最大置信度为0.78,且0.78(特征图3-1中位置1对应的置信度)大于0.6(即该对象类别1的可信阈值1),因此,该第一语义分割图中位置1处的像素属于该对象类 别1。
类似地,对于第一行第五列(即位置2)的像素,由该特征图3-1中位置2处的像素属于该对象类别1的置信度为0.19,特征图3-2中位置2处的像素属于该对象类别2的置信度为0.81,得到位置2对应的最大置信度为0.81,且0.81(特征图3-2中位置2对应的置信度)大于0.65(即该对象类别2的可信阈值1),因此,该第一语义分割图中位置2处的像素属于该对象类别1。
类似地,对于第四行第五列(即位置3)的像素,由该特征图3-1中位置3处的像素属于该对象类别1的置信度为0.44,特征图3-2中位置3处的像素属于该对象类别2的置信度为0.56,得到位置3对应的最大置信度为0.56,且0.56(特征图3-2中位置3对应的置信度)小于0.65(即该对象类别2的可信阈值1),因此,该第一语义分割图中位置3处的像素不属于该对象类别0和对象类别1。
同理可以采用类似流程得到2个特征图3中其它位置处的像素所属的对象类别,此处不再赘述。
需要说明的是,在该第一语义分割图中通过不同的标签值标注不同的对象类别。示例的,如图15中的(c)所示,通过标签值“1”标注该第一语义分割图的位置1处的像素属于对象类别1,通过标签值“2”标注该第一语义分割图的位置2处的像素属于对象类别2,通过标签值“0”标注该第一语义分割图的位置3处的像素属于缺省对象。
步骤203,该优化装置将该无标注图像输入第二语义分割模型,得到第二输出结果,该第二语义分割模型与该第一语义分割模型的模型结构相同。
需要说明的是,本申请中所述的第一语义分割模型与第二语义分割模型的模型结构相同是指:第一,这两个模型的功能相同,即都用于识别该C种对象类别;第二,这两个模型使用的卷积神经网络的网络结构相同,即包括相同的处理层数量、处理层种类以及每个处理层的功能都相同。这两个模型的区别仅在于这两个模型中的处理层设置的参数可能不一样,如第一语义分割模型中卷积核的权重值和第二语义分割模型中卷积核的权重值不同。
需要说明的是,该第二语义分割模型为预先通过训练好的模型、用于识别该C种对象类别。相应地,该无标注图像中至少包括该C种对象类别中的部分或全部。
在一种可能的实现方式中,该第二输出结果可以包括第二语义分割图和P个第二特征图,该第二特征图的分辨率小于该目标图像的分辨率。
在一种可能的实现方式中,该优化装置可以对该无标注图像进行特征提取,得到Q个第三特征图,该第三特征图的分辨率为H 2×W 2;将该Q个第三特征图映射至该P个第二特征图,该第二特征图的分辨率为H 2×W 2;将该Q个第三特征图映射至C个第四特征图,该第四特征图的分辨率为H 1×W 1,该C个第四特征图和该C种对象类别一一对应,该第四特征图包括H 1×W 1个置信度,该H 1×W 1个置信度与该无标注图像包括的H 1×W 1个像素一一对应,该置信度用于表示该无标注图像中对应位置的像素属于该第四特征图对应的对象类别的概率;基于该C个第四特征图和该C种对象类别中的每种对象类别的第一可信阈值,得到该第二语义分割图,该第二语义分割图的分辨率为H 1×W 1。步骤203具体可以参考上述步骤202,此处不再赘述。
采用本申请实施例提供的语义分割模型的优化方法,第一语义分割模型输入的为目标 图像是经过对标注图像和无标注图像进行混合后得到的,能够更深入的挖掘标注图像和无标注图像之间的关联,以降低无标注图像和标注图像之间的分布差异,因此,通过该目标图像训练该第一语义分割模型,可以提高该第一语义分割模型的域适应能力,从而提高该第一语义分割的泛化性能。此外,第二语义分割模型输入的是无标注图像,通过该无标注图像训练该第二语义分割模型,能够减少该第二语义分割模型对标注图像的依赖,并能够降低标注图像的成本。
步骤204,该优化装置基于该目标图像、该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化。
本申请采用第一语义分割模型和第二语义分割模型构成的双模型结构,其中,该第一语义分割模型可以作为学生模型,该第二语义分割模型可以作为教师模型,教师模型的输出结果可以用于辅助和指导学生模型的训练和优化,因此,能够提高学生模型的优化效果。
可选地,该第一输出结果可以包括第一语义分割图和P个第一特征图,该第二输出结果可以包括第二语义分割图和P个第二特征图。
相应地,步骤204可以包括:该优化装置基于该目标图像、该第一语义分割图、该P个第一特征图、该第二语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该优化装置可以基于该P个第一特征图、该第二语义分割图、该P个第二特征图和第一损失函数,迭代调整模型的参数,该第一损失函数用于缩小属于相同对象类别的像素之间的距离和/或拉长属于不同对象类别的像素之间的距离。
采用本申请实施例提供的语义分割模型的优化方法,通过教师模型输出的第二语义分割图,可以指导学生模型对该P个第一特征图和该P个第二特征图进行对比学习,从而拉近不同类别的像素之间的距离,并拉远相同类别的像素之间的距离,以保证属于同一个类别的像素特征编码尽可能的相似,不同类别像素特征的编码尽可能不相似,因此,可以提高该学生模型分割类内的紧致性以及类间的差异性,从而提高学生模型的预测准确度。
在另一种可能的实现方式中,该优化装置可以基于该目标图像、该第一语义分割图和第二损失函数,迭代调整该第一语义分割模型的参数,该第二损失函数用于约束相同像素所属的对象类别的预测值和标注值的一致性。
采用本申请实施例提供的语义分割模型的优化方法,学生模型输入的目标图像中包括标注图像的部分图像区域,相应地,该学生模型输出的第一语义分割图中也包括与该部分图像区域对应的图像区域,通过对目标图像和第一语义分割图中相同像素的真实值和预测值进行一致性约束,能够提高学生模型的预测准确度。
在又一种可能的实现方式中,该优化装置可以基于该第一语义分割图、该第二语义分割图和第三损失函数,迭代调整该第一语义分割模型的参数,该第三损失函数用于约束该第一语义分割模型和该第二语义分割模型对相同像素所属的对象类别的预测结果的一致性。
采用本申请实施例提供的语义分割模型的优化方法,学生模型输入的目标图像中包括无标注图像的部分图像区域,相应地,教师模型输入的无标注图像也包括与该部分图像区域对应的图像区域,通过对学生模型和教师模型对相同像素所属的对象类别的预测结果进行一致性约束,能够提高学生模型的预测准确度。
需要说明的是,由于该第二语义分割模型采用的是无监督学习的训练方法,预测结果 的可靠性较差,基于该第二语义分割模型输出的第二语义分割图对该第一语义分割模型进行优化的效果较差。
因此,为提高该第二语义分割模型的预测结果可靠性,该优化装置可以基于该C种对象类别中的每种对象类别的第一可信阈值和该每种对象类别的第二可信阈值,得到该每种对象类别的目标可信阈值,其中,该每种对象类别的第一可信阈值为本轮迭代过程中该第二语义分割模型使用的可信阈值,该每种对象类别的第二可信阈值为上一轮迭代过程中该第二语义分割模型使用的可信阈值;基于该C个第四特征图和该每种对象类别的目标可信阈值,得到第三语义分割图,该第三语义分割图的分辨率为H 1×W 1;基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
示例的,可以通过如下公式(1)得到该目标可信阈值Th'。
Th‘=α·Th t-1+(1-α)·Th t        公式(1)
其中,α表示权重系数,Th t-1表示上一轮迭代过程中该第二语义分割模型使用的可信阈值(即第二可信阈值),Th t表示本轮迭代过程中该第二语义分割模型使用的可信阈值(即第一可信阈值)。
进一步地,该优化装置可以将该第二语义分割模型本轮使用的可信阈值由该第一可信阈值更新为该目标可信阈值。
采用本申请实施例提供的语义分割模型的优化方法,由于该第二语义分割模型采用的是无监督学习的训练方法,预测结果的可靠性较差,因此,该优化装置基于上一轮迭代过程中该第二语义分割模型使用的可信阈值和本轮迭代过程中该第二语义分割模型使用的可信阈值动态更新各对象类别的可信阈值,以保证各对象类别的可信阈值始终在一个合理的数值范围内,进一步地,可以基于更新后各对象类别的可信阈值对该第二语义分割图中的预测结果进行筛查以筛除掉该第二语义分割图中的可靠性较差的预测结果,得到第三语义分割图,基于该第三语义分割图对该第一语义分割模型进行优化,有利于提高该第一语义分割模型的可靠性。
进一步地,该优化装置可以基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。具体可以参考上述基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化的介绍,此处不再赘述。
通过上述步骤201~步骤204,可以得到经优化的第一语义分割模型,即第一优化语义分割模型。
可选地,该优化方法200还可以包括:该优化装置向语义分割装置发送优化后的第一语义分割模型,即第一优化语义分割模型。
可选地,该优化装置可以通过多种方式向该语义分割装置发送该第一优化语义分割模型,本申请对此不做限定。
在一种可能的实现方式中,该优化装置可以基于预设的周期,周期性向该语义分割装置发送该第一优化语义分割模型。也就是说,该优化装置可以定期向该语义分割装置更新优化后的第一语义分割模型。
在另一种可能的实现方式中,该优化装置可以接收来自该语义分割装置的请求信息, 该请求信息用于请求对该第一语义分割模型进行优化;该优化装置基于该请求信息,将该第一优化语义分割模型发送至该语义分割装置。
请参考图16,图16示出了本申请实施例提供的语义分割方法300的示意性流程图。如图16所示,该方法300可以应用于如图5所示的系统100中,并可以由系统100中的语义分割装置120执行。该语义分割装置的语义分割流程可以包括以下步骤,需要说明的是,以下所列步骤可以以各种顺序执行和/或同时发生,不限于图16所示的执行顺序。
步骤301,语义分割装置获得待处理图像。
步骤302,该语义分割装置将该待处理图像输入第一优化语义分割模型,得到该待处理图像的语义分割图。
可选地,该语义分割装置可以通过多种方式获得该待处理图像,本申请对此不做限定。
在一种可能的实现方式中,该语义分割装置获得该待处理图像可以包括:该语义分割装置接收摄像装置发送的该待处理图像。相应地,摄像装置采集该待处理图像,并发送至该语义分割装置。
在另一种可能的实现方式中,该语义分割装置可以接收来自其他图像采集的该待处理图像,该其它图像采集装置用于采集该待处理图像。
可选地,在该将该待处理图像输入第一优化语义分割模型之前,该语义分割装置可以获得该第一优化语义分割模型。
可选地,该语义分割装置可以通过多种方式获得该第一优化语义分割模型,本申请对此不做限定。
在一种可能的实现方式中,该语义分割装置可以基于预设的周期,周期性接收来自该优化装置发送的该第一优化语义分割模型。也就是说,该语义分割装置可以定期接收该优化装置更新的优化后的第一语义分割模型。
在另一种可能的实现方式中,该语义分割装置可以向语义分割模型的优化装置发送请求信息,该请求信息用于请求对第一语义分割模型进行优化;并接收该语义分割模型的优化装置发送的该第一优化语义分割模型。
本申请实施例提供的语义分割方法,通过上述优化后的第一语义分割模型对待处理图像进行语义分割,能够提高语义分割的准确度。
上面结合图7至图16介绍了本申请实施例提供的语义分割模型的优化方法以及语义分割方法,下面将进一步介绍本申请实施例提供的语义分割装置的优化装置以及语义分割装置。
请参考图17,图17示出了本申请实施例提供的语义分割模型的优化装置400的示意性框图,该优化装置400可以包括获得模块401、第一语义分割模块402、第二语义分割模块403和优化模块404。
可选地,该优化装置400可以用于上述系统100,进一步地,该优化装置400可以为上述系统100中的优化装置110。
该获得模块401,用于获得目标图像,该目标图像是基于标注图像和无标注图像得到的;
该第一语义分割模块402,用于将该目标图像输入第一语义分割模型,得到第一输出结果;
该第二语义分割模块403,用于将该无标注图像输入第二语义分割模型,得到第二输出结果,该第二语义分割模型与该第一语义分割模型的模型结构相同;
该优化模块404,用于基于该目标图像、该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该第一输出结果包括第一语义分割图和P个第一特征图,P的取值大于该目标图像的通道数量,该第一特征图的分辨率小于该目标图像的分辨率,该第二输出结果包括第二语义分割图和P个第二特征图,该第二特征图的分辨率与该第一特征图的分辨率相同,该第一语义分割图的分辨率和该第二语义分割图的分辨率均与该目标图像的分辨率相同;该优化模块404具体用于基于该目标图像、该第一语义分割图、该P个第一特征图、该第二语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该无标注图像的尺寸为H 1×W 1×T,该第二语义分割模型用于识别C种对象类别,其中,H 1和W 1均为大于1的整数,T为大于0的整数,C为大于0的整数,该第二语义分割模块403具体用于:对该无标注图像进行特征提取,得到Q个第三特征图,该第三特征图的分辨率为H 2×W 2,其中,H 2小于H 1,W 2小于W 1,Q大于T;将该Q个第三特征图映射至该P个第二特征图,该第二特征图的分辨率为H 2×W 2,其中,P小于Q;将该Q个第三特征图映射至C个第四特征图,该第四特征图的分辨率为H 1×W 1,该C个第四特征图和该C种对象类别一一对应,该第四特征图包括H 1×W 1个置信度,该H 1×W 1个置信度与该无标注图像包括的H 1×W 1个像素一一对应,该置信度用于表示该无标注图像中对应位置的像素属于该第四特征图对应的对象类别的概率;基于该C个第四特征图和该C种对象类别中的每种对象类别的第一可信阈值,得到该第二语义分割图,该第二语义分割图的分辨率为H 1×W 1
在一种可能的实现方式中,该优化装置400还包括阈值更新模块405,该阈值更新模块405用于基于该每种对象类别的第一可信阈值和该每种对象类别的第二可信阈值,得到该每种对象类别的目标可信阈值,其中,该每种对象类别的第一可信阈值为本轮迭代过程中该第二语义分割模型使用的可信阈值,该每种对象类别的第二可信阈值为上一轮迭代过程中该第二语义分割模型使用的可信阈值;基于该C个第四特征图和该每种对象类别的目标可信阈值,得到第三语义分割图,该第三语义分割图的分辨率为H 1×W 1;该优化模块404具体用于基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化。
在一种可能的实现方式中,该优化模块404具体用于:基于该P个第一特征图、该第三语义分割图、该P个第二特征图和第一损失函数,迭代调整模型的参数,该第一损失函数用于缩小属于相同对象类别的像素之间的距离和/或拉长属于不同对象类别的像素之间的距离;基于该目标图像、该第一语义分割图和第二损失函数,迭代调整该第一语义分割模型的参数,该第二损失函数用于约束相同像素所属的对象类别的预测值和标注值的一致性;基于该第一语义分割图、该第三语义分割图和第三损失函数,迭代调整该第一语义分割模型的参数,该第三损失函数用于约束该第一语义分割模型和该第二语义分割模型对相同像素所属的对象类别的预测结果的一致性。
在一种可能的实现方式中,该目标图像包括该标注图像的部分区域和该无标注图像的 部分区域。
在一种可能的实现方式中,该获得模块401具体用于:对该标注图像进行裁剪,得到第一子图像;对该无标注图像进行裁剪,得到第二子图像;对该第一子图像和该第二子图像进行拼接,得到该目标图像。
需要说明的是,上述装置之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。在一个可选例子中,优化装置400可以具体为上述优化方法200实施例中的优化装置,优化装置400可以用于执行上述优化方法200实施例中与优化装置对应的各个流程和/或步骤,为避免重复,在此不再赘述。
图17所示实施例中的各个模块中的一个或多个可以通过软件、硬件、固件或其结合实现。所述软件或固件包括但不限于计算机程序指令或代码,并可以被硬件处理器所执行。所述硬件包括但不限于各类集成电路,如中央处理单元(CPU,Central Processing Unit)、数字信号处理器(DSP,Digital Signal Processor)、现场可编程门阵列(FPGA,Field Programmable Gate Array)或专用集成电路(ASIC,Application Specific Integrated Circuit)。
示例的,图18示出了本申请实施例提供的语义分割模型的优化方法的流程示意图。可选地,该流程中的步骤可以由图17中所述的优化装置400执行。需要说明的是,以下所列步骤可以以各种顺序执行和/或同时发生,不限于图16所示的执行顺序。该流程包括以下步骤:
(1)获得模块401获得标注图像和无标注图像。
(2)获得模块401基于该标注图像和该无标注图像,得到目标图像。具体可以参考上述方法步骤201中的相关介绍。
(3)获得模块401将该目标图像发送至第一语义分割模块402和优化模块404。
(4)第一语义分割模块402将该目标图像输入第一语义分割模型,得到第一语义分割图和P个特征图,P的取值大于该目标图像的通道数量,该第一特征图的分辨率小于该目标图像的分辨率。具体可以参考上述方法步骤202中的相关介绍。
(5)第一语义分割模块402将该第一语义分割图和该P个第一特征图发送至优化模块404。
(6)第二语义分割模块403获得该无标注图像。
(7)第二语义分割模块403将该无标注图像输入第二语义分割模型,得到第二语义分割图和P个第二特征图,该第二特征图的分辨率与该第一特征图的分辨率相同,该第一语义分割图的分辨率和该第二语义分割图的分辨率均与该目标图像的分辨率相同,具体可以参考上述方法步骤203中的相关介绍。
其中,该第二语义分割模型与该第一语义分割模型的模型结构相同,均用于识别C种对象类别,C为大于0的整数。
(8)第二语义分割模块403将第二语义分割图发送至阈值更新模块405。
(9)第二语义分割模块403将该P个第二特征图发送至优化模块404。
(10)阈值更新模块405获得该第二语义分割模型上一轮迭代过程中使用该C种对象类别中的每个种对象类别的第一可信阈值和本轮迭代过程中使用的该每个种对象类别的第二可信阈值,得到该每个种对象类别的目标可信阈值。
(11)阈值更新模块405基于该第二语义分割图和该每个种对象类别的目标可信阈值,得到第三语义分割图。
(12)阈值更新模块405将该第三语义分割图发送至优化模块404。
(13)阈值更新模块405将该目标可信阈值发送至该第二语义分割模块403。
(14)第二语义分割模块403将本轮迭代过程中该第二语义分割模型使用的可信阈值由该第一可信阈值更新为该目标可信阈值。
(15)优化模块404基于该目标图像、该第一语义分割图、该P个第一特征图、该第三语义分割图和该P个第二特征图,对该第一语义分割模型进行优化,如迭代调整该第一语义分割模型的模型参数。具体可以参考上述所述步骤204中的相关介绍。
请参见图19,图19示出了本申请实施例提供的语义分割模型的优化装置500的示意性框图,优化装置500可以包括处理器501和通信接口502,处理器501和通信接口502耦合。
通信接口502,用于向处理器501输入图像数据,和/或从处理器501输出图像数据;处理器501运行计算机程序或指令,以使优化装置500实现上述方法200实施例所描述的优化方法。
本申请实施例中的处理器501包括但不限于中央处理单元(Central Processing Unit,CPU)、通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)、分立门或者晶体管逻辑器件或分立硬件组件等。通用处理器可以是微处理器、微控制器或者是任何常规的处理器等。
例如,处理器501用于通过通信接口502获得目标图像,该目标图像是基于标注图像和无标注图像得到的;将该目标图像输入第一语义分割模型,得到第一输出结果;将该无标注图像输入第二语义分割模型,得到第二输出结果,该第二语义分割模型与该第一语义分割模型的模型结构相同;基于该目标图像、该第一输出结果和该第二输出结果,对该第一语义分割模型进行优化。在一个可选例子中,本领域技术人员可以理解,优化装置500可以具体为上述优化方法200实施例中的优化装置,优化装置500可以用于执行上述优化方法200实施例中与优化装置对应的各个流程和/或步骤,为避免重复,在此不再赘述。
可选地,优化装置500还可以包括存储器503。
存储器503可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
具体地,存储器503用于存储优化装置的程序代码和指令。可选地,存储器503还用于存储处理器501执行上述优化优化方法200实施例过程中获得的图像数据,如通过通信接口502获得的目标图像。
可选地,存储器503可以为单独的器件或集成在处理器501中。
需要说明的是,图19仅仅示出了优化装置500的简化设计。在实际应用中,优化装置500还可以分别包含必要的其他元件,包含但不限于任意数量的通信接口、处理器、控制器、存储器等,而所有可以实现本申请的优化装置500都在本申请的保护范围之内。
在一种可能的设计中,优化装置500可以为芯片。可选地,该芯片还可以包括一个或多个存储器,用于存储计算机执行指令,当该芯片装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述优化方法。
可选地,该芯片装置可以为实现相关功能的现场可编程门阵列,专用集成芯片,系统芯片,中央处理器,网络处理器,数字信号处理电路,微控制器,还可以采用可编程控制器或其他集成芯片。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在计算机上运行时,实现上述方法实施例描述的优化方法。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在处理器上运行时,实现上述方法实施例描述的优化方法。
本申请实施例提供的优化装置、计算机可读存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的优化方法,因此,其所能达到的有益效果可参考上文所提供的对应的优化方法中的有益效果,此处不再赘述。
请参考图20,图20示出了本申请实施例提供的语义分割装置600的示意性框图,该装置600可以包括获得模块601和语义分割模块602。
可选地,该装置600可以用于上述系统100,进一步地,该装置600可以为上述系统100中的语义分割装置120。
该获得模块601用于获得待处理图像。
该语义分割模块602用于将该待处理图像输入第一优化语义分割模型,得到该待处理图像的语义分割图。
需要说明的是上述第一优化语义分割模型是通过本申请实施例提供的优化方法200对第一语义分割模型进行优化后得到的,具体优化方法再次不再赘述。
图20所示实施例中的各个模块中的一个或多个可以通过软件、硬件、固件或其结合实现。所述软件或固件包括但不限于计算机程序指令或代码,并可以被硬件处理器所执行。所述硬件包括但不限于各类集成电路,如中央处理单元(CPU,Central Processing Unit)、数字信号处理器(DSP,Digital Signal Processor)、现场可编程门阵列(FPGA,Field Programmable Gate Array)或专用集成电路(ASIC,Application Specific Integrated Circuit)。
请参见图21,图21示出了本申请实施例提供的语义分割装置700的示意性框图,装置700可以包括处理器701和通信接口702,处理器701和通信接口702耦合。
通信接口702,用于向处理器701输入图像数据,和/或从处理器701输出图像数据;处理器701运行计算机程序或指令,以使装置700实现上述方法300实施例所描述的语义分割方法。
本申请实施例中的处理器701包括但不限于中央处理单元(Central Processing Unit,CPU)、通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)、分立门或者晶体管逻辑器件或分立硬件组件等。通用处理器可以是微处理器、微控制器或者是任何常规的处理器等。
例如,处理器701用于通过通信接口702获得待处理图像;将该待处理图像输入第一优化语义分割模型,得到该待处理图像的语义分割图。在一个可选例子中,本领域技术人员可以理解,装置700可以具体为上述方法300实施例中的语义分割装置,装置700可以用于执行上述方法300实施例中与语义分割装置对应的各个流程和/或步骤,为避免重复,在此不再赘述。
可选地,装置700还可以包括存储器703。
存储器703可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
具体地,存储器703用于存储装置的程序代码和指令。可选地,存储器703还用于存储处理器701执行上述方法300实施例过程中获得的图像数据,如通过通信接口702获得的待处理图像。
可选地,存储器703可以为单独的器件或集成在处理器701中。
需要说明的是,图21仅仅示出了装置700的简化设计。在实际应用中,装置700还可以分别包含必要的其他元件,包含但不限于任意数量的通信接口、处理器、控制器、存储器等,而所有可以实现本申请的装置700都在本申请的保护范围之内。
在一种可能的设计中,装置700可以为芯片。可选地,该芯片还可以包括一个或多个存储器,用于存储计算机执行指令,当该芯片装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述语义分割方法。
可选地,该芯片装置可以为实现相关功能的现场可编程门阵列,专用集成芯片,系统芯片,中央处理器,网络处理器,数字信号处理电路,微控制器,还可以采用可编程控制器或其他集成芯片。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在计算机上运行时,实现上述方法实施例描述的语义分割方法。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在处理器上运行时,实现上述方法实施例描述的语义分割方法。
本申请实施例提供的语义分割装置、计算机可读存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的语义分割方法,因此,其所能达到的有益效果可参考上文所提供的对应的语义分割方法中的有益效果,此处不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (17)

  1. 一种语义分割模型的优化方法,其特征在于,包括:
    获得目标图像,所述目标图像是基于标注图像和无标注图像得到的;
    将所述目标图像输入第一语义分割模型,得到第一输出结果;
    将所述无标注图像输入第二语义分割模型,得到第二输出结果,所述第二语义分割模型与所述第一语义分割模型的模型结构相同;
    基于所述目标图像、所述第一输出结果和所述第二输出结果,对所述第一语义分割模型进行优化。
  2. 根据权利要求1所述的方法,其特征在于,所述第一输出结果包括第一语义分割图和P个第一特征图,P的取值大于所述目标图像的通道数量,所述第一特征图的分辨率小于所述目标图像的分辨率,所述第二输出结果包括第二语义分割图和P个第二特征图,所述第二特征图的分辨率与所述第一特征图的分辨率相同,所述第一语义分割图的分辨率和所述第二语义分割图的分辨率均与所述目标图像的分辨率相同;
    其中,所述基于所述目标图像、所述第一输出结果和所述第二输出结果,对所述第一语义分割模型进行优化,包括:
    基于所述目标图像、所述第一语义分割图、所述P个第一特征图、所述第二语义分割图和所述P个第二特征图,对所述第一语义分割模型进行优化。
  3. 根据权利要求2所述的方法,其特征在于,所述无标注图像的尺寸为H 1×W 1×T,所述第二语义分割模型用于识别C种对象类别,其中,H 1和W 1均为大于1的整数,T为大于0的整数,C为大于0的整数,所述将所述无标注图像输入第二语义分割模型,得到第二输出结果,包括:
    对所述无标注图像进行特征提取,得到Q个第三特征图,所述第三特征图的分辨率为H 2×W 2,其中,H 2小于H 1,W 2小于W 1,Q大于T;
    将所述Q个第三特征图映射至所述P个第二特征图,所述第二特征图的分辨率为H 2×W 2,其中,P小于Q;
    将所述Q个第三特征图映射至C个第四特征图,所述第四特征图的分辨率为H 1×W 1,所述C个第四特征图和所述C种对象类别一一对应,所述第四特征图包括H 1×W 1个置信度,所述H 1×W 1个置信度与所述无标注图像包括的H 1×W 1个像素一一对应,所述置信度用于表示所述无标注图像中对应位置的像素属于所述第四特征图对应的对象类别的概率;
    基于所述C个第四特征图和所述C种对象类别中的每种对象类别的第一可信阈值,得到所述第二语义分割图,所述第二语义分割图的分辨率为H 1×W 1
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述目标图像、所述第一语义分割图、所述P个第一特征图、所述第二语义分割图和所述P个第二特征图,对所述第一语义分割模型进行优化,包括:
    基于所述每种对象类别的第一可信阈值和所述每种对象类别的第二可信阈值,得到所述每种对象类别的目标可信阈值,其中,所述每种对象类别的第一可信阈值为本轮迭代过程中所述第二语义分割模型使用的可信阈值,所述每种对象类别的第二可信阈值为上一轮 迭代过程中所述第二语义分割模型使用的可信阈值;
    基于所述C个第四特征图和所述每种对象类别的目标可信阈值,得到第三语义分割图,所述第三语义分割图的分辨率为H 1×W 1
    基于所述目标图像、所述第一语义分割图、所述P个第一特征图、所述第三语义分割图和所述P个第二特征图,对所述第一语义分割模型进行优化。
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述目标图像、所述第一语义分割图、所述P个第一特征图、所述第三语义分割图和所述P个第二特征图,对所述第一语义分割模型进行优化,包括:
    基于所述P个第一特征图、所述第三语义分割图、所述P个第二特征图和第一损失函数,迭代调整模型的参数,所述第一损失函数用于缩小属于相同对象类别的像素之间的距离和/或拉长属于不同对象类别的像素之间的距离;
    基于所述目标图像、所述第一语义分割图和第二损失函数,迭代调整所述第一语义分割模型的参数,所述第二损失函数用于约束相同像素所属的对象类别的预测值和标注值的一致性;
    基于所述第一语义分割图、所述第三语义分割图和第三损失函数,迭代调整所述第一语义分割模型的参数,所述第三损失函数用于约束所述第一语义分割模型和所述第二语义分割模型对相同像素所属的对象类别的预测结果的一致性。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述目标图像包括所述标注图像的部分区域和所述无标注图像的部分区域。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述获得目标图像,包括:
    对所述标注图像进行裁剪,得到第一子图像;
    对所述无标注图像进行裁剪,得到第二子图像;
    对所述第一子图像和所述第二子图像进行拼接,得到所述目标图像。
  8. 一种语义分割模型的优化装置,其特征在于,包括:
    获得模块,用于获得目标图像,所述目标图像是基于标注图像和无标注图像得到的;
    第一语义分割模块,用于将所述目标图像输入第一语义分割模型,得到第一输出结果;
    第二语义分割模块,用于将所述无标注图像输入第二语义分割模型,得到第二输出结果,所述第二语义分割模型与所述第一语义分割模型的模型结构相同;
    优化模块,用于基于所述目标图像、所述第一输出结果和所述第二输出结果,对所述第一语义分割模型进行优化。
  9. 根据权利要求8所述的装置,其特征在于,所述第一输出结果包括第一语义分割图和P个第一特征图,P的取值大于所述目标图像的通道数量,所述第一特征图的分辨率小于所述目标图像的分辨率,所述第二输出结果包括第二语义分割图和P个第二特征图,所述第二特征图的分辨率与所述第一特征图的分辨率相同,所述第一语义分割图的分辨率和所述第二语义分割图的分辨率均与所述目标图像的分辨率相同;
    所述优化模块具体用于基于所述目标图像、所述第一语义分割图、所述P个第一特征图、所述第二语义分割图和所述P个第二特征图,对所述第一语义分割模型进行优化。
  10. 根据权利要求9所述的装置,其特征在于,所述无标注图像的尺寸为H 1×W 1×T,所述第二语义分割模型用于识别C种对象类别,其中,H 1和W 1均为大于1的整数,T为 大于0的整数,C为大于0的整数,所述第二语义分割模块具体用于:
    对所述无标注图像进行特征提取,得到Q个第三特征图,所述第三特征图的分辨率为H 2×W 2,其中,H 2小于H 1,W 2小于W 1,Q大于T;
    将所述Q个第三特征图映射至所述P个第二特征图,所述第二特征图的分辨率为H 2×W 2,其中,P小于Q;
    将所述Q个第三特征图映射至C个第四特征图,所述第四特征图的分辨率为H 1×W 1,所述C个第四特征图和所述C种对象类别一一对应,所述第四特征图包括H 1×W 1个置信度,所述H 1×W 1个置信度与所述无标注图像包括的H 1×W 1个像素一一对应,所述置信度用于表示所述无标注图像中对应位置的像素属于所述第四特征图对应的对象类别的概率;
    基于所述C个第四特征图和所述C种对象类别中的每种对象类别的第一可信阈值,得到所述第二语义分割图,所述第二语义分割图的分辨率为H 1×W 1
  11. 根据权利要求10所述的装置,其特征在于,所述优化装置还包括阈值更新模块,
    所述阈值更新模块用于基于所述每种对象类别的第一可信阈值和所述每种对象类别的第二可信阈值,得到所述每种对象类别的目标可信阈值,其中,所述每种对象类别的第一可信阈值为本轮迭代过程中所述第二语义分割模型使用的可信阈值,所述每种对象类别的第二可信阈值为上一轮迭代过程中所述第二语义分割模型使用的可信阈值;基于所述C个第四特征图和所述每种对象类别的目标可信阈值,得到第三语义分割图,所述第三语义分割图的分辨率为H 1×W 1
    所述优化模块具体用于基于所述目标图像、所述第一语义分割图、所述P个第一特征图、所述第三语义分割图和所述P个第二特征图,对所述第一语义分割模型进行优化。
  12. 根据权利要求11所述的装置,其特征在于,所述优化模块具体用于:
    基于所述P个第一特征图、所述第三语义分割图、所述P个第二特征图和第一损失函数,迭代调整模型的参数,所述第一损失函数用于缩小属于相同对象类别的像素之间的距离和/或拉长属于不同对象类别的像素之间的距离;
    基于所述目标图像、所述第一语义分割图和第二损失函数,迭代调整所述第一语义分割模型的参数,所述第二损失函数用于约束相同像素所属的对象类别的预测值和标注值的一致性;
    基于所述第一语义分割图、所述第三语义分割图和第三损失函数,迭代调整所述第一语义分割模型的参数,所述第三损失函数用于约束所述第一语义分割模型和所述第二语义分割模型对相同像素所属的对象类别的预测结果的一致性。
  13. 根据权利要求8-12任一项所述的装置,其特征在于,所述目标图像包括所述标注图像的部分区域和所述无标注图像的部分区域。
  14. 根据权利要求8-13任一项所述的装置,其特征在于,所述获得模块具体用于:
    对所述标注图像进行裁剪,得到第一子图像;
    对所述无标注图像进行裁剪,得到第二子图像;
    对所述第一子图像和所述第二子图像进行拼接,得到所述目标图像。
  15. 一种语义分割模型的优化装置,其特征在于,包括:处理器和通信接口,所述处理器和所述通信接口耦合,所述通信接口用于为所述处理器提供信息和/或数据,所述处 理器用于运行计算机程序指令以执行上述权利要求1-7中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,用于存储计算机程序,所述计算机程序被处理器运行时,实现如权利要求1-7任一项所述的方法。
  17. 一种计算机程序产品,其特征在于,当所述计算机程序产品在处理器上运行时,实现如权利要求1-7任一项所述的方法。
PCT/CN2021/113095 2021-08-17 2021-08-17 语义分割模型的优化方法和装置 WO2023019444A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/113095 WO2023019444A1 (zh) 2021-08-17 2021-08-17 语义分割模型的优化方法和装置
CN202180100913.9A CN117693768A (zh) 2021-08-17 2021-08-17 语义分割模型的优化方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/113095 WO2023019444A1 (zh) 2021-08-17 2021-08-17 语义分割模型的优化方法和装置

Publications (1)

Publication Number Publication Date
WO2023019444A1 true WO2023019444A1 (zh) 2023-02-23

Family

ID=85239923

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/113095 WO2023019444A1 (zh) 2021-08-17 2021-08-17 语义分割模型的优化方法和装置

Country Status (2)

Country Link
CN (1) CN117693768A (zh)
WO (1) WO2023019444A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757546A (zh) * 2023-07-05 2023-09-15 安徽如柒信息科技有限公司 一种基于工业互联网的生产监测预警系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489365A (zh) * 2020-04-10 2020-08-04 上海商汤临港智能科技有限公司 神经网络的训练方法、图像处理方法及装置
US20200327409A1 (en) * 2017-11-16 2020-10-15 Samsung Electronics Co., Ltd. Method and device for hierarchical learning of neural network, based on weakly supervised learning
CN113139500A (zh) * 2021-05-10 2021-07-20 重庆中科云从科技有限公司 烟雾检测方法、系统、介质及设备
CN113160230A (zh) * 2021-03-26 2021-07-23 联想(北京)有限公司 一种图像处理方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200327409A1 (en) * 2017-11-16 2020-10-15 Samsung Electronics Co., Ltd. Method and device for hierarchical learning of neural network, based on weakly supervised learning
CN111489365A (zh) * 2020-04-10 2020-08-04 上海商汤临港智能科技有限公司 神经网络的训练方法、图像处理方法及装置
CN113160230A (zh) * 2021-03-26 2021-07-23 联想(北京)有限公司 一种图像处理方法及装置
CN113139500A (zh) * 2021-05-10 2021-07-20 重庆中科云从科技有限公司 烟雾检测方法、系统、介质及设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOKANG CHEN; YUHUI YUAN; GANG ZENG; JINGDONG WANG: "Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 January 1900 (1900-01-01), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081982847 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757546A (zh) * 2023-07-05 2023-09-15 安徽如柒信息科技有限公司 一种基于工业互联网的生产监测预警系统
CN116757546B (zh) * 2023-07-05 2023-12-12 安徽如柒信息科技有限公司 一种基于工业互联网的生产监测预警系统

Also Published As

Publication number Publication date
CN117693768A (zh) 2024-03-12

Similar Documents

Publication Publication Date Title
WO2020177651A1 (zh) 图像分割方法和图像处理装置
US11328430B2 (en) Methods, systems, and media for segmenting images
US10614574B2 (en) Generating image segmentation data using a multi-branch neural network
US11954822B2 (en) Image processing method and device, training method of neural network, image processing method based on combined neural network model, constructing method of combined neural network model, neural network processor, and storage medium
JP7464752B2 (ja) 画像処理方法、装置、機器及びコンピュータプログラム
WO2021018163A1 (zh) 神经网络的搜索方法及装置
WO2021155792A1 (zh) 一种处理装置、方法及存储介质
US11983903B2 (en) Processing images using self-attention based neural networks
US20220230282A1 (en) Image processing method, image processing apparatus, electronic device and computer-readable storage medium
CN109977832B (zh) 一种图像处理方法、装置及存储介质
US11967134B2 (en) Method and device for identifying video
WO2021249114A1 (zh) 目标跟踪方法和目标跟踪装置
CN112949507A (zh) 人脸检测方法、装置、计算机设备及存储介质
CN113807361B (zh) 神经网络、目标检测方法、神经网络训练方法及相关产品
CN112883887B (zh) 一种基于高空间分辨率光学遥感图像的建筑物实例自动提取方法
CN113011562A (zh) 一种模型训练方法及装置
CN112184780A (zh) 一种运动物体实例分割方法
Pang et al. SGBNet: An ultra light-weight network for real-time semantic segmentation of land cover
WO2023019444A1 (zh) 语义分割模型的优化方法和装置
CN115577768A (zh) 半监督模型训练方法和装置
Sharjeel et al. Real time drone detection by moving camera using COROLA and CNN algorithm
WO2024001653A1 (zh) 特征提取方法、装置、存储介质及电子设备
CN115294337B (zh) 训练语义分割模型的方法、图像语义分割方法及相关装置
CN116739920A (zh) 双解耦互校正多时相遥感影像缺失信息重建方法及系统
Osuna-Coutiño et al. Structure extraction in urbanized aerial images from a single view using a CNN-based approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21953695

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180100913.9

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE