CN114677512A

CN114677512A - Training method of semantic segmentation model, semantic segmentation method and semantic segmentation device

Info

Publication number: CN114677512A
Application number: CN202210301937.7A
Authority: CN
Inventors: 唐月标; 叶泽锐; 黄镜澄; 张丹枫
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-28

Abstract

The embodiment of the application provides a training method of a semantic segmentation model, a semantic segmentation method, a semantic segmentation device, a method and a device, which relate to the technical field of image processing and comprise the following steps: the method comprises the steps of inputting a plurality of acquired sample images into a semantic segmentation model to obtain a target characteristic image, calculating a loss value of the target characteristic image by adopting a cross entropy loss function, and training the model according to the loss value, wherein the semantic segmentation model comprises a plurality of attention models. The attention model can fully extract the characteristic information of the sample image, and the trained semantic segmentation model is used for segmenting the image, so that the accuracy of image segmentation can be effectively improved.

Description

Training method of semantic segmentation model, semantic segmentation method and semantic segmentation device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a training method for a semantic segmentation model, a semantic segmentation method, and an apparatus thereof.

Background

At present, more and more application scenes need to perform semantic segmentation on video images, such as automatic driving, indoor navigation, virtual reality, image recognition and the like.

UNet is also called as U-Net convolutional neural network and is a common semantic segmentation model, and the model is based on a full convolutional network, can perform convolution operation on pictures with any shapes and sizes, and achieves good effect in the face of complex image segmentation.

However, UNet directly uses a convolutional network for extracting image features, and therefore, accuracy in feature extraction is reduced, and accuracy of image segmentation is reduced to some extent.

Disclosure of Invention

The embodiment of the application provides a training method of a semantic segmentation model, a semantic segmentation method and a semantic segmentation device, so as to improve the accuracy of image segmentation.

In a first aspect, an embodiment of the present application provides a training method for a semantic segmentation model, including:

acquiring a plurality of sample images; inputting the sample image into a semantic segmentation model to obtain a target characteristic image corresponding to the sample image; the semantic segmentation model comprises a plurality of attention models, each attention model comprises a left branch and a right branch, the left branch is used for obtaining characteristic parameters of the layer number dimension of the sample image, and the right branch is used for obtaining characteristic parameters of the width dimension and the height dimension of the sample image; calculating a loss value of the target characteristic image by adopting a cross entropy loss function; and training the semantic segmentation model according to the loss value.

Optionally, acquiring a plurality of sample images includes:

acquiring a plurality of original images shot by different users in different shooting scenes; and randomly rotating the original images to obtain a plurality of sample images, wherein the number of the sample images is larger than that of the original images.

Optionally, the semantic segmentation model includes: the device comprises a feature extraction layer, a down-sampling processing layer and an up-sampling processing layer; the downsampling processing layer comprises a downsampling layer and a first attention model; the up-sampling processing layer comprises an up-sampling layer and a second attention model, and the first attention model and the second attention model are identical in structure;

inputting the sample image into a semantic segmentation model, and obtaining a target feature image corresponding to the sample image, wherein the method comprises the following steps:

inputting a sample image to the feature extraction layer to obtain a first feature image; inputting the first characteristic image into a first attention model to obtain a second characteristic image, and inputting the second characteristic image into a down-sampling layer to obtain a third characteristic image; and inputting the third characteristic image to the upper sampling layer to obtain a fourth characteristic image, and inputting the fourth characteristic image to the second attention model to obtain the target characteristic image.

Optionally, inputting the first feature image into the first attention model to obtain a second feature image, where the method includes:

inputting the first feature image to a left branch in the first attention model, and obtaining a left feature image output by the left branch; inputting the first feature image to a right branch in the first attention model to obtain a right feature image output by the right branch; and performing dot multiplication on the left characteristic image and the right characteristic image to obtain a second characteristic image.

Optionally, the left branch of the attention model comprises: left 1 branch, left 2 branch, left 3 branch; the left 1 branch comprises a first processing layer and a second processing layer, and the left 2 branch comprises a third processing layer;

inputting the first feature image into a left branch in the first attention model, and obtaining a left feature image output by the left branch, wherein the left feature image comprises:

inputting the first characteristic image into a first processing layer to obtain a first dimension reduction characteristic image; inputting the first feature image into a third processing layer to obtain a second dimension-reduced feature image; matrix multiplication is carried out on the first dimension reduction characteristic image and the second dimension reduction characteristic image, and the first dimension reduction characteristic image and the second dimension reduction characteristic image are input into a second processing layer to obtain a fifth characteristic image; and performing dot multiplication on the fifth characteristic image and the first characteristic image of the left 3 branches to obtain a left characteristic image.

Optionally, the right branch of the attention model comprises: right 1 branch, right 2 branch, right 3 branch; the right 1 branch comprises a fourth processing layer and a fifth processing layer, and the right 2 branch comprises a sixth processing layer;

inputting the first feature image into a right branch in the first attention model, and obtaining a right feature image output by the right branch, wherein the right feature image comprises:

inputting the first feature image to a fourth processing layer to obtain a third dimension reduction feature image; inputting the first feature image to a sixth processing layer to obtain a fourth dimension reduction feature image; performing matrix multiplication on the third dimension reduction characteristic image and the fourth dimension reduction characteristic image, and inputting the result into the fifth processing layer to obtain a sixth characteristic image; and performing dot multiplication on the sixth characteristic image and the first characteristic image of the right 3 branches to obtain a right characteristic image.

In a second aspect, an embodiment of the present application provides a segmentation method, including:

acquiring an image to be processed; and inputting the image to be processed into the semantic segmentation model to obtain a semantic segmentation result output by the semantic segmentation model.

In a third aspect, an embodiment of the present application provides a training apparatus for a semantic segmentation model, including:

an acquisition module for acquiring a plurality of sample images; the first training module is used for inputting the sample image into the semantic segmentation model to obtain a characteristic image corresponding to the sample image; the calculation module is used for calculating a loss value by adopting a cross entropy loss function to the characteristic image; and the second training module is used for training the semantic segmentation model according to the loss value.

In a fourth aspect, an embodiment of the present application provides a semantic segmentation apparatus, including:

the acquisition module is used for acquiring an image to be processed; and the processing module is used for inputting the image to be processed into the semantic segmentation model and obtaining a semantic segmentation result output by the semantic segmentation model.

In a fifth aspect, an embodiment of the present application provides a terminal device, including: a memory and a processor;

the memory is used for storing computer instructions; the processor is configured to execute the computer instructions stored by the memory to implement the method of any one of the first or second aspects.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, the computer program being executed by a processor to implement the method of any one of the first aspect or the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer program product, which includes a computer program that, when executed by a processor, implements the method of any one of the first aspect or the second aspect.

Drawings

Fig. 1 is a schematic view of a scenario provided in an embodiment of the present application;

FIG. 2 is a schematic flowchart of a training method of a semantic segmentation model according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a semantic segmentation model according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an attention model provided in an embodiment of the present application;

FIG. 5 is a schematic flow chart of a semantic segmentation method according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a training apparatus for a semantic segmentation model according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a semantic segmentation apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to clearly describe the technical solutions of the embodiments of the present application, some terms and technologies referred to in the embodiments of the present application are briefly described below:

1) semantic segmentation refers to classifying an image at a pixel level, and classifying pixels belonging to the same class into one class.

2) The U-Net convolutional neural network is a classic full convolutional network, the network has no full connection operation and is mainly divided into an up sampling process and a down sampling process, and the whole network is similar to a U, so that the U-Net is called. All convolutional layers except the last output layer in the network are 3 x 3 convolutions.

3) The attention model can selectively extract a series of regions from the picture, only the extracted regions are processed each time, and the processed information is combined to establish corresponding scene information.

4) Other terms

In the embodiments of the present application, the terms "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions, and the order of the items or similar items is not limited. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The following describes in detail a training method of a semantic segmentation model provided in an embodiment of the present application with reference to the accompanying drawings. It should be noted that "at … …" in the embodiment of the present application may be at the instant of a certain condition, or may be within a certain period of time after a certain condition occurs, and the embodiment of the present application is not particularly limited to this.

In view of the above, the present application provides a training method for a semantic segmentation model and a semantic segmentation method, which introduce an attention model based on a U-net model, redesign a model structure, and can extract and process feature parameters of an image to be segmented more comprehensively to obtain a more accurate segmentation result.

The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings, and for convenience of description, the embodiments of the present invention are described by taking face recognition as an example, which does not limit application scenarios of the present invention.

Fig. 1 is a schematic view of a scenario of an embodiment of the present application, as shown in fig. 1, including a server and a terminal device, where the terminal device is connected to the server through a network to perform data transmission. The terminal equipment is provided with a client used for collecting a user face image, the server is provided with a model used for semantically segmenting the user face image, the terminal equipment sends the collected user face image to the server, and the server segments the face image and returns a segmentation result.

It is understood that the server may also use a plurality of acquired face images as sample images, and train the semantic segmentation model to obtain an expected output result.

The application scenario provided by the embodiment of the present application is briefly described above, and the following takes the server applied in fig. 1 as an example to describe in detail the training method of the semantic segmentation model provided by the embodiment of the present application.

Fig. 2 is a schematic flow chart of a training method of a semantic segmentation model provided in an embodiment of the present application, including the following steps:

s201, obtaining a plurality of sample images.

The server can acquire face images shot by a plurality of users in different shooting scenes through interaction with the terminal equipment, and the face images are used as sample images for training the semantic segmentation model.

Optionally, in order to further enrich sample images required for training, a plurality of acquired face images shot by different users in different shooting scenes can be randomly rotated to obtain richer sample images.

Illustratively, the obtained face image of the user can be randomly rotated by 90 degrees, horizontally turned, and turned up and down to obtain a plurality of sample images.

S202, inputting the sample image into a semantic segmentation model, and obtaining a target characteristic image corresponding to the sample image.

The semantic segmentation model of the embodiment of the application adopts a U-net model as a basic structure, an attention model is introduced into the U-net model, correspondingly, the semantic segmentation model comprises a plurality of attention models, each attention model comprises a left branch and a right branch, the left branch is used for obtaining characteristic parameters of the layer number dimension of the sample image, and the right branch is used for obtaining characteristic parameters of the width dimension and the height dimension of the sample image.

The image is typically composed of three-dimensional arrays of height, width and number of layers, respectively. The number of layers is the red, green and blue value of the image pixel. The three-dimensional array of images may also be referred to as a feature parameter of the image. Through different branch processing of the attention model, the characteristic parameters in the sample image can be better extracted.

Optionally, the semantic segmentation model may further include a feature extraction layer, configured to extract feature parameters of the sample image, and obtain a corresponding feature image.

And S203, calculating a cross entropy loss value by adopting a cross entropy loss function for the target characteristic image.

Specifically, the cross entropy loss value may be obtained according to a true value of a pixel of the sample image and a pixel value of the sample image output by the model.

For example, the cross entropy loss value can be calculated according to the following formula:

where, true is the true value of the sample image pixel, i represents the ith sample image, n represents the number of sample images, and ypred is the pixel value of the sample image output by the model.

And S204, training the semantic segmentation model according to the loss value.

And after the cross entropy loss value is obtained, judging whether the cross entropy loss value meets a preset value or not, if the cross entropy loss value does not meet the preset value, adjusting parameters of the semantic segmentation model based on the cross entropy loss value, and repeatedly executing the processes shown in S201-S203 until the cross entropy loss value meets the preset value.

For example, the preset value of the cross entropy loss value may be 0.2, and when the loss value of the target feature image calculated by the server is greater than 0.2, the parameters of each layer of the semantic segmentation model are adjusted, for example, the size of the convolution kernel may be adjusted. And repeating the processes from S201 to S203 for a plurality of times until the loss value of the target characteristic image is less than or equal to the preset value.

According to the method for training the semantic segmentation model, the obtained multiple sample images are input into the semantic segmentation model to obtain the target feature image, the cross entropy loss function is adopted for loss value calculation of the target feature image, and the model is trained according to the loss value, wherein the semantic segmentation model comprises multiple attention models. The attention model can fully extract the characteristic information of the sample image, and the trained semantic segmentation model is used for segmenting the image, so that the accuracy of image segmentation can be effectively improved.

The following will describe the structure of the semantic segmentation model in detail by taking the example of inputting the sample image into the semantic segmentation model on the basis of the embodiment shown in fig. 2.

As shown in fig. 3, the semantic segmentation model includes a feature extraction layer, a downsampling processing layer, an upsampling processing layer, and a convolutional layer.

The downsampling processing layer comprises a first downsampling processing layer, a second downsampling processing layer and a third downsampling processing layer, each downsampling processing layer comprises an attention model and a downsampling layer, each downsampling layer comprises a convolution layer (conv), an active layer (relu) and a pooling layer (maxpool), the convolution kernel size of each convolution layer is 3 x 3, and the convolution kernel size of each pooling layer is 2 x 2.

The up-sampling processing layers comprise a first up-sampling processing layer, a second up-sampling processing layer and a third up-sampling processing layer, each up-sampling processing layer comprises an attention model and an up-sampling layer, each up-sampling layer comprises a convolution layer, an activation layer and an up-sampling operation layer, the convolution kernel size of each convolution layer is 3 x 3, and the convolution kernel size of each up-sampling operation layer is 2 x 2.

Correspondingly, the step of inputting the sample image into the semantic segmentation model and obtaining the target characteristic image corresponding to the sample image comprises the following steps.

S301, inputting the sample image to a feature extraction layer to obtain a first feature image.

The characteristic extraction layer is used for extracting characteristic parameters of the sample image, and the characteristic parameters comprise characteristic parameters of three dimensions of layer number, height and width.

For example, the first feature image output by the feature extraction layer may be represented in the form of feature parameters, which are denoted as C × H × W, where C is the number of layers, H is the height, and W is the width.

S302, inputting the first feature image into the first attention model to obtain a second feature image, inputting the second feature image into a down-sampling layer to obtain a third feature image.

The semantic segmentation model may include a plurality of downsampling processing layers, in this embodiment, 3 downsampling processing layers are taken as an example for description, each downsampling processing layer includes a first attention model and a downsampling layer, a detailed process of obtaining a second feature image from a first feature image and then obtaining a third feature image from the second feature image is as follows, wherein a plurality of feature images are output in the process of obtaining the second feature image from the first feature image, and the nth feature image and the N +1 th feature image are used for distinguishing.

And S3021, inputting the first characteristic image into the first downsampling processing layer, and obtaining an Nth characteristic image.

The first feature image is firstly input to a first attention model in a first downsampling processing layer, the first attention model can weight feature parameters of the image, the size of the image is not changed, and the feature image output by the first attention model is still C x H x W. And inputting the feature image output by the first attention model into a down-sampling layer to obtain an Nth feature image, wherein the feature parameter of the Nth feature image is 2C H/2W/2.

S3022, the nth feature image is input to the second downsampling processing layer, and an N +1 th feature image is obtained.

And inputting the N-th feature image into the first attention model in the second down-sampling processing layer to obtain a feature image corresponding to the output, and inputting the feature image into the down-sampling layer in the second down-sampling processing layer to obtain an N + 1-th feature image, wherein the feature parameter of the N + 1-th feature image can be represented as 4C H/4W/4.

And S3023, inputting the (N + 1) th feature image into the third downsampling processing layer to obtain a third feature image.

And inputting the N +1 th feature image into the first attention model in the third down-sampling processing layer to obtain a second feature image, inputting the second feature image into the down-sampling layer in the third down-sampling processing layer to obtain a third feature image, wherein the feature parameter of the third feature image can be represented as 8C H/8W/8.

And S303, inputting the third characteristic image into the upper sampling layer to obtain a fourth characteristic image, and inputting the fourth characteristic image into the second attention model to obtain a target characteristic image.

Each up-sampling processing layer comprises a second attention model and an up-sampling layer, the process of obtaining the target image from the third characteristic image is as follows, the process of obtaining the fourth characteristic image from the third characteristic image, and the process of obtaining the target characteristic image from the fourth characteristic image also outputs a plurality of other characteristic images, and the naming mode is continued to the mode shown in S302.

S3031, inputting the third characteristic image into the third up-sampling processing layer to obtain an N +2 characteristic image.

And inputting the third feature image to an up-sampling layer in the third up-sampling processing layer to obtain a corresponding feature image, and inputting the feature image to a second attention model in the third up-sampling processing layer to obtain an N +2 feature image, wherein the feature parameter of the N +2 feature image can be represented as 8C H/4W/4.

And S3032, inputting the (N + 2) th characteristic image into a second up-sampling processing layer to obtain an (N + 3) th characteristic image.

Inputting the N +2 th feature image to an upsampling layer in the second upsampling processing layer to obtain a corresponding feature image, inputting the feature image to a second attention model in the second upsampling processing layer to obtain an N +3 th feature image, wherein the feature parameter of the N +3 th feature image can be represented as 4C H/2W/2.

S3033, inputting the (N + 3) th characteristic image into the first up-sampling processing layer to obtain an (N + 4) th characteristic image.

And inputting the N +3 th feature image to an up-sampling layer in the first up-sampling processing layer to obtain a fourth feature image, and inputting the fourth feature image to a second attention model in the first up-sampling processing layer to obtain an N +4 th feature image, wherein the feature parameter of the N +4 th feature image can be expressed as 2C H W.

S3034, inputting the (N + 4) th characteristic image into the convolutional layer to obtain a target characteristic image.

The convolution layer is used for processing the characteristic parameters of the (N + 4) th characteristic image into the characteristic parameters with the same scale as the first characteristic image.

Illustratively, the characteristic parameter of the N +4 th feature image is 2C × H × W, and the N +4 th feature image is input into the convolution layer to obtain the target feature image, where the characteristic parameter of the target image is C × H × W.

The structure of the semantic segmentation model is explained above, and the structure of the attention model in the semantic segmentation model is explained below on the basis of the embodiment of fig. 2.

Fig. 4 is a schematic structural diagram of an attention model provided in an embodiment of the present application, and as shown in fig. 4, the attention model includes a left branch and a right branch.

Wherein the left branch comprises: left 1 branch, left 2 branch, left 3 branch; the left 1 branch includes a first processing layer and a second processing layer, and the left 2 branch includes a third processing layer. The right branch comprises: right 1 branch, right 2 branch, right 3 branch; the right 1 branch includes a fourth processing layer and a fifth processing layer, and the right 2 branch includes a sixth processing layer.

The first processing layer comprises a convolution layer and a reconstruction layer (reshape); the second processing layer comprises a convolution layer, a reconstruction layer and a classification layer (softmax); the third processing layer comprises a convolutional layer, a standardized layer (layerorm) and an active layer (sigmod); the fourth processing layer comprises a convolution layer and a reconstruction layer; a reconstruction layer and an activation layer; the sixth process layer includes a convolutional layer, a global pooling layer (global pooling layer), a reconstruction layer, and a classification layer. The convolution kernel size of each convolution layer is 1 x 1

Correspondingly, inputting the first feature image into the first attention model, and obtaining a second feature image includes:

s401, inputting the first feature image to a left branch in the first attention model, and obtaining a left feature image output by the left branch.

In a specific implementation process, inputting a first feature image into a first processing layer to obtain a first dimension reduction feature image; inputting the first feature image into a third processing layer to obtain a second dimension reduction feature image; matrix multiplication is carried out on the first dimension reduction characteristic image and the second dimension reduction characteristic image, and the first dimension reduction characteristic image and the second dimension reduction characteristic image are input into a second processing layer to obtain a fifth characteristic image; and performing dot multiplication on the fifth characteristic image and the first characteristic image of the left 3 branches to obtain a left characteristic image.

Illustratively, the feature parameter of the first feature map is C × H × W, and is input into the first processing layer, and after the convolution layer processing, the feature parameter is C/2 × H × W, and is input into the reconstruction layer to be processed, so as to obtain a first dimension reduction feature image, where the feature parameter is C/2 × HW. Wherein the reconstruction layer is used for reconstructing the shape of the characteristic image.

And inputting the first feature map into a third processing layer, wherein after the convolution layer processing is carried out, the feature parameter is 1H W, and then the first feature map is input into the reconstruction layer and the classification layer for processing to obtain a second dimension reduction feature image, wherein the feature parameter is HW 1W.

And performing matrix multiplication processing on the first dimension reduction characteristic image and the second dimension reduction characteristic image, wherein the characteristic parameter is C/2 x 1, inputting the result into the second processing layer, and performing processing on the convolution layer, the normalization layer and the activation layer to obtain a fifth characteristic image, wherein the characteristic parameter is C x 1. The normalization layer processing refers to the step of normalizing image features, the step of pulling data distribution to a non-saturation region of an activation function and the step of having the characteristic of invariance of weight/data expansion and contraction. The effects of relieving gradient disappearance/explosion, accelerating training and regularization are achieved.

And performing dot multiplication on the fifth characteristic image and the first characteristic image to obtain a left characteristic image, wherein the characteristic parameter is C x H x W.

S402, inputting the first feature image to the right branch in the first attention model, and obtaining a right feature image output by the right branch.

In specific implementation, the first feature image is input to a fourth processing layer, and a third dimension-reduced feature image is obtained; inputting the first feature image to a sixth processing layer to obtain a fourth dimension reduction feature image; performing matrix multiplication on the third dimension reduction characteristic image and the second four dimension reduction characteristic images, and inputting the matrix multiplication result into a fifth processing layer to obtain a sixth characteristic image; and performing dot multiplication on the sixth characteristic image and the first characteristic image of the right 3 branches to obtain a right characteristic image.

Illustratively, the first feature map is input into the fourth processing layer, after the convolution layer processing, the feature parameter is C/2 × H × W, and then the first feature map is input into the reconstruction layer for processing, so as to obtain a third dimension reduction feature image, where the feature parameter is C/2 × HW.

And inputting the first feature map into a sixth processing layer, inputting the feature parameter C/2H W into the global pooling layer for processing after the convolution layer processing, inputting the feature parameter C/2H 1 into the reconstruction layer and the classification layer for processing to obtain a fourth dimension-reduction feature image, wherein the feature parameter C/2H/W is the feature parameter C/2H 1.

And performing matrix multiplication on the third dimension reduction characteristic image and the fourth dimension reduction characteristic image, wherein the characteristic parameters are changed into 1 HW, inputting the HW into a fifth processing layer, and performing processing on a reconstruction layer and an activation layer to obtain a sixth characteristic image, wherein the characteristic parameters are 1H W.

And performing dot multiplication on the sixth characteristic image and the first characteristic image to obtain a right characteristic image, wherein the characteristic parameter is C x H x W.

And S403, performing dot multiplication on the left characteristic image and the right characteristic image to obtain a second characteristic image.

In the method, the left branch of the attention model obtains two feature images with different dimensionalities by performing dimensionality reduction on the first feature image, then performs matrix multiplication processing, and finally performs point multiplication processing on the feature images and the first feature image, so that the layer number dimensionality information in the first feature image can be better extracted, and the information of the height dimensionality and the width dimensionality is also reserved. The right branch also carries out better extraction on the information of the height dimension and the width dimension in the first characteristic image through a similar method, and the information of the layer number dimension is also reserved. And finally, performing dot product processing on the feature images output by the left branch and the right branch, and fully extracting the information of three dimensions in the first feature image. Thereby improving the output result of the model.

The embodiment of the present application further provides a semantic segmentation method, as shown in fig. 5, including the following steps:

s501, acquiring an image to be processed.

S502, inputting the image to be processed into the semantic segmentation model, and obtaining a semantic segmentation result output by the semantic segmentation model.

The semantic segmentation model is obtained by the semantic segmentation model training method shown in fig. 2.

In the method, the attention model is introduced into the semantic segmentation model, the image characteristic information can be weighted by using the attention model, the characteristic information of the sample image is fully extracted and processed, and the accuracy of the semantic segmentation model on image segmentation is improved.

The embodiment of the present application further provides a training apparatus for a semantic segmentation model, as shown in fig. 6, including: an acquisition module 601, a first training module 602, a calculation module 603, and a second training module 604.

The acquiring module 601 is configured to acquire a plurality of sample images.

The first training module 602 is configured to input the sample image into the semantic segmentation model, and obtain a feature image corresponding to the sample image.

And a calculating module 603, configured to calculate a loss value by using a cross entropy loss function for the feature image.

A second training module 604, configured to train the defined segmentation model according to the loss value.

The training device of the semantic segmentation model provided in this embodiment may execute the technical solution of the method embodiment shown in fig. 2, and the implementation principle and the technical effect are similar, which are not described herein again.

Further, on the basis of the embodiment shown in fig. 6, the training apparatus for a semantic segmentation model provided by the present application further includes a preprocessing module 605.

The obtaining module 601 is further configured to obtain a plurality of original images shot by different users in different shooting scenes.

The preprocessing module 605 is configured to perform random rotation on the original image to obtain a plurality of sample images.

The first training module 602 is further configured to input the sample image to the feature extraction layer, so as to obtain a first feature image; inputting the first characteristic image into a first attention model to obtain a second characteristic image, and inputting the second characteristic image into a down-sampling layer to obtain a third characteristic image; and inputting the third characteristic image to the upper sampling layer to obtain a fourth characteristic image, and inputting the fourth characteristic image to the second attention model to obtain a target characteristic image.

The embodiment of the present application further provides a semantic segmentation apparatus 70, as shown in fig. 7, including: an obtaining module 701 and a processing module 702.

The acquisition module is used for acquiring an image to be processed;

and the processing module is used for inputting the image to be processed into the semantic segmentation model and obtaining a semantic segmentation result output by the semantic segmentation model.

The semantic segmentation model apparatus provided in this embodiment may execute the technical solution of the method embodiment shown in fig. 5, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the apparatus 80 provided in this embodiment may include:

a processor 801.

A memory 802 for storing executable instructions for the electronic device.

The processor is configured to execute the above-mentioned technical solution of the semantic segmentation model training method or the semantic segmentation method embodiment by executing the executable instructions, and the implementation principle and the technical effect are similar, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the technical solution of the embodiment of the semantic segmentation model training method or the semantic segmentation method is implemented, and the implementation principle and the technical effect of the embodiment are similar, which is not described herein again.

In one possible implementation, the computer-readable medium may include Random Access Memory (RAM), Read-Only Memory (ROM), compact disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and Disc, as used herein, includes Disc, laser Disc, optical Disc, Digital Versatile Disc (DVD), floppy disk and blu-ray Disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The embodiment of the present application further provides a computer program product, which includes a computer program, and when executed by a processor, the computer program implements the technical solution of the semantic segmentation model training method or the semantic segmentation method embodiment, and the implementation principle and the technical effect of the computer program are similar, which are not described herein again.

In the above Specific implementation of the terminal device or the server, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Those skilled in the art will appreciate that all or a portion of the steps of any of the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium, and when executed, performs all or part of the steps of the above-described method embodiments.

The technical scheme of the application can be stored in a computer readable storage medium if the technical scheme is realized in a software form and is sold or used as a product. Based on this understanding, all or part of the technical solutions of the present application may be embodied in the form of a software product stored in a storage medium, including a computer program or several instructions. The computer software product enables a computer device (which may be a personal computer, a server, a network device, or a similar electronic device) to perform all or part of the steps of the method described in the embodiments of the present application.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A training method of a semantic segmentation model is characterized by comprising the following steps:

acquiring a plurality of sample images;

inputting the sample image into a semantic segmentation model to obtain a target characteristic image corresponding to the sample image; the semantic segmentation model comprises a plurality of attention models, each attention model comprises a left branch and a right branch, the left branch is used for obtaining characteristic parameters of the layer number dimension of the sample image, and the right branch is used for obtaining characteristic parameters of the width dimension and the height dimension of the sample image;

calculating a loss value of the target characteristic image by adopting a cross entropy loss function;

and training the semantic segmentation model according to the loss value.

2. The method of claim 1, wherein the acquiring a plurality of sample images comprises:

acquiring a plurality of original images shot by different users in different shooting scenes;

and randomly rotating the original images to obtain a plurality of sample images, wherein the number of the sample images is larger than that of the original images.

3. The method of claim 1, wherein the semantic segmentation model comprises: the device comprises a feature extraction layer, a down-sampling processing layer and an up-sampling processing layer; the downsampling processing layer comprises a downsampling layer and a first attention model; the up-sampling processing layer comprises an up-sampling layer and a second attention model, and the first attention model and the second attention model are identical in structure;

the inputting the sample image into a semantic segmentation model to obtain a target feature image corresponding to the sample image includes:

inputting the sample image to the feature extraction layer to obtain a first feature image;

inputting the first feature image into the first attention model to obtain a second feature image, and inputting the second feature image into the downsampling layer to obtain a third feature image;

and inputting the third characteristic image into the up-sampling layer to obtain a fourth characteristic image, and inputting the fourth characteristic image into the second attention model to obtain the target characteristic image.

4. The method of claim 3, wherein inputting the first feature image to the first attention model to obtain a second feature image comprises:

inputting the first feature image to a left branch in the first attention model, and obtaining a left feature image output by the left branch;

inputting the first feature image to a right branch in the first attention model, and obtaining a right feature image output by the right branch;

and performing dot multiplication on the left characteristic image and the right characteristic image to obtain the second characteristic image.

5. The method of claim 4, wherein the left branch of the attention model comprises: left 1 branch, left 2 branch, left 3 branch; the left 1 branch comprises a first processing layer and a second processing layer, and the left 2 branch comprises a third processing layer;

the inputting the first feature image to a left branch in the first attention model, and obtaining a left feature image output by the left branch, includes:

inputting the first feature image to the first processing layer to obtain a first dimension-reduced feature image;

inputting the first feature image to the third processing layer to obtain a second dimension-reduced feature image;

performing matrix multiplication on the first dimension reduction characteristic image and the second dimension reduction characteristic image, and inputting the matrix multiplication result into the second processing layer to obtain a fifth characteristic image;

and performing dot multiplication on the fifth characteristic image and the first characteristic image of the left 3 branch to obtain a left characteristic image.

6. The method of claim 4, wherein the right branch of the attention model comprises: right 1 branch, right 2 branch, right 3 branch; the right 1 branch comprises a fourth processing layer and a fifth processing layer, and the right 2 branch comprises a sixth processing layer;

the inputting the first feature image to the right branch in the first attention model, and obtaining the right feature image output by the right branch, includes:

inputting the first feature image to the fourth processing layer to obtain a third dimension-reduced feature image;

inputting the first feature image to the sixth processing layer to obtain a fourth dimension-reduced feature image;

performing matrix multiplication on the third dimension-reduced characteristic image and the fourth dimension-reduced characteristic image, and inputting the matrix multiplication result into the fifth processing layer to obtain a sixth characteristic image;

and performing dot multiplication on the sixth characteristic image and the first characteristic image of the right 3 branches to obtain a right characteristic image.

7. A method of semantic segmentation, comprising:

acquiring an image to be processed;

inputting the image to be processed into a semantic segmentation model, and obtaining a semantic segmentation result output by the semantic segmentation model, wherein the semantic segmentation model is obtained by training according to the method of any one of the above 1 to 6.

8. An apparatus for training a semantic segmentation model, comprising:

an acquisition module for acquiring a plurality of sample images;

the first training module is used for inputting the sample image into a semantic segmentation model to obtain a characteristic image corresponding to the sample image;

the calculation module is used for calculating a loss value by adopting a cross entropy loss function to the characteristic image;

and the second training module is used for training the semantic segmentation model according to the loss value.

9. A semantic segmentation apparatus, comprising:

the acquisition module is used for acquiring an image to be processed;

and the processing module is used for inputting the image to be processed into a semantic segmentation model and obtaining a semantic segmentation result output by the semantic segmentation model, wherein the semantic segmentation model is obtained by training according to the method of any one of the methods 1 to 6.

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the method of any one of claims 1-6 or to implement the method of claim 7.

11. A computer-readable storage medium, on which a computer program is stored, which computer program is executable by a processor to implement the method of any one of claims 1-6 or to implement the method of claim 7.