CN116645504A

CN116645504A - Image semantic segmentation method, electronic device and storage medium

Info

Publication number: CN116645504A
Application number: CN202310483437.4A
Authority: CN
Inventors: 秦宗琛; 巫延江; 戴亦军; 苏晓杰; 喻杜峰; 王楷; 孙少欣
Original assignee: Chongqing University; China Construction Tunnel Construction Co Ltd
Current assignee: Chongqing University; China Construction Tunnel Construction Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-08-25

Abstract

The application provides an image semantic segmentation method, electronic equipment and a storage medium. The method comprises the following steps: inputting the image to be detected into a trained semantic segmentation model; extracting features of an image to be detected through a backbone network in a semantic segmentation model to obtain a first feature set, wherein the backbone network comprises a Conv module, a C3 module and an SPPF module; inputting the first feature set into a Hamburg Head module in the semantic segmentation model to obtain a second feature set; and inputting the second characteristic set into a detection head to obtain a detection result. Therefore, based on the lightweight backbone network and combined with the attention mechanism of the Hamburg Head module, the semantic segmentation model has certain precision and can realize lightweight semantic segmentation, so that the method is beneficial to reducing the operand of the model in semantic segmentation and improving the operation efficiency.

Description

Image semantic segmentation method, electronic device and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to an image semantic segmentation method, an electronic device, and a storage medium.

Background

Semantic segmentation is one of the important ways that robot environment perception is important. The current research on semantic segmentation focuses on improving the precision of semantic segmentation, but when the method is applied to the field with limited computing capacity such as a mobile robot, the problem of poor timeliness caused by overlarge operand in the current popular semantic segmentation model (such as SegFormer, maskFormer and the like) generally exists. For example, in the field of mobile robots, the operation speed and timeliness of an algorithm are important issues equivalent to the accuracy of the algorithm. If the algorithm precision is only possessed, but the algorithm timeliness is poor, the operation movement of the mobile robot is stagnant, and even the self safety of the mobile robot is endangered.

Disclosure of Invention

Accordingly, an object of an embodiment of the present application is to provide an image semantic segmentation method, an electronic device, and a storage medium, which can solve the problem of excessive computation in the image semantic segmentation process.

In order to achieve the technical purpose, the application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides an image semantic segmentation method, where the method includes:

acquiring an image to be detected;

inputting the image to be detected into a trained semantic segmentation model;

extracting features of the image to be detected through a backbone network in the semantic segmentation model to obtain a first feature set, wherein the backbone network comprises a Conv module, a C3 module and an SPPF module;

inputting the first feature set into a Hamburg Head module in the semantic segmentation model to obtain a second feature set;

and inputting the second feature set into a detection head to obtain a detection result, wherein the detection result comprises a result of whether a target exists in the image to be detected and a mark corresponding to the target when the target exists.

With reference to the first aspect, in some optional embodiments, acquiring an image to be measured includes:

and obtaining the image to be detected by shooting the environment through a camera.

With reference to the first aspect, in some optional embodiments, before inputting the image to be tested into the trained semantic segmentation model, the method further includes:

based on Conv modules, C3 modules and SPPF modules, the backbone network is established, wherein the backbone network comprises a first Conv module, a second Conv module, two first C3 modules, a third Conv module, four second C3 modules, a fourth Conv module, six third C3 modules, a fifth Conv module, two fourth C3 modules and an SPPF module which are sequentially connected in series, the first Conv module serves as an input end of the backbone network, and an output end of the backbone network comprises the second C3 modules, the third C3 modules and the SPPF module.

With reference to the first aspect, in some optional embodiments, feature extraction is performed on the image to be measured through a backbone network in the semantic segmentation model to obtain a first feature set, including:

extracting features of the image to be detected through the first Conv module, the second Conv module, the two first C3 modules, the third Conv module and the four second C3 modules in the backbone network to obtain first-stage features;

extracting the characteristics of the first stage by the fourth Conv module and the six third C3 modules in the backbone network to obtain characteristics of a second stage;

extracting the features of the second stage by a fifth Conv module, two fourth C3 modules and an SPPF module in the backbone network to obtain features of a third stage;

the first stage feature, the second stage feature, and the third stage feature are taken as the first feature set.

the Hamburg Head module is created based on a Conv module and a non-Negative Matrix Factorization (NMF) module, wherein the Hamburg Head module comprises a sixth Conv module, an NMF module and a seventh Conv module which are sequentially connected in series, the sixth Conv module is used as an input end of the Hamburg Head module, and the seventh Conv module is used as an output end of the Hamburg Head module.

With reference to the first aspect, in some optional embodiments, inputting the first feature set into a Hamburger Head module in the semantic segmentation model, to obtain a second feature set includes:

and inputting the first feature set into the sixth Conv module of the Hamburg Head module, and performing fusion processing on the first feature set based on an attention mechanism through the sixth Conv module, the NMF module and the seventh Conv module to obtain the second feature set.

With reference to the first aspect, in some optional embodiments, the C3 module includes an eighth Conv module, a ninth Conv module, a tenth Conv module, and a Bottleneck module;

the eighth Conv module is connected in parallel with the ninth Conv module and then connected in series with the tenth Conv module, and the Bottleneck module is arranged between the ninth Conv module and the tenth Conv module.

With reference to the first aspect, in some optional embodiments, the boltleck module includes: two Conv modules and a shortcut module which are connected in series in sequence.

In a second aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a processor and a memory coupled to each other, where the memory stores a computer program, and when the computer program is executed by the processor, causes the electronic device to perform a method as described above.

In a third aspect, embodiments of the present application also provide a computer-readable storage medium having stored therein a computer program which, when run on a computer, causes the computer to perform the above-described method.

The application adopting the technical scheme has the following advantages:

in the technical scheme provided by the application, the backbone network of the semantic segmentation model consists of a Conv module, a C3 module and an SPPF module, so that the features of different layers of the image to be detected can be rapidly extracted to obtain a feature set, the feature set is input into the Hamburg Head module for calculation through a line attention mechanism, and the detection Head rapidly identifies the surrounding environment and task targets in the image and marks the surrounding environment and task targets. Therefore, based on the lightweight backbone network and combined with the attention mechanism of the Hamburg Head module, the semantic segmentation model has certain precision and can realize lightweight semantic segmentation, so that the method is beneficial to reducing the operand of the model in semantic segmentation and improving the operation efficiency.

Drawings

The application may be further illustrated by means of non-limiting examples given in the accompanying drawings. It is to be understood that the following drawings illustrate only certain embodiments of the application and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

Fig. 1 is a flow chart of an image semantic segmentation method according to an embodiment of the present application.

Fig. 2 is a network structure block diagram of a semantic segmentation model according to an embodiment of the present application.

Fig. 3 is a network structure block diagram of a backbone network according to an embodiment of the present application.

Fig. 4 is a network structure block diagram of a Conv module according to an embodiment of the present application.

Fig. 5 is a network structure block diagram of a C3 module according to an embodiment of the present application.

Fig. 6 is a network structure block diagram of an SPPF module according to an embodiment of the present application.

Fig. 7 is a network structure block diagram of a Hamburger Head module according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the drawings and the specific embodiments, wherein like or similar parts are designated by the same reference numerals throughout the drawings or the description, and implementations not shown or described in the drawings are in a form well known to those of ordinary skill in the art. In the description of the present application, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

The embodiment of the application provides an electronic device which can comprise a processing module and a storage module. The memory module stores a computer program which, when executed by the processing module, enables the electronic device to perform the respective steps in the image semantic segmentation method described below.

Understandably, the electronic device may perform semantic segmentation on the image to be detected by using the image semantic segmentation method described below, so as to implement classification and marking of the corresponding target in the image to be detected. The classification target may be, but is not limited to, a lane, a sidewalk, a building, a traffic light, a plant, a sky, a vehicle, a pedestrian, and the like. The electronic device may be, but is not limited to, a personal computer, a robot with data processing analysis functions, or the like.

Referring to fig. 1, the present application further provides an image semantic segmentation method, which can be applied to the above-mentioned electronic device, and the electronic device executes or implements the steps of the method. The image semantic segmentation method can comprise the following steps:

step 110, obtaining an image to be detected;

step 120, inputting the image to be tested into a trained semantic segmentation model;

step 130, extracting features of the image to be detected through a backbone network in the semantic segmentation model to obtain a first feature set, wherein the backbone network comprises a Conv module, a C3 module and an SPPF module;

step 140, inputting the first feature set into a Hamburger Head module in the semantic segmentation model to obtain a second feature set;

and step 150, inputting the second feature set into a detection head to obtain a detection result, wherein the detection result comprises a result of whether a target exists in the image to be detected and a mark corresponding to the target when the target exists.

The steps of the image semantic segmentation method will be described in detail as follows:

in step 110, the manner of acquiring the image to be measured may be flexibly determined according to the actual situation. For example, the electronic device may periodically acquire an image frame from the camera and take the image frame as an image to be measured. Alternatively, the electronic device may acquire a pre-stored image from the local as the image to be measured.

As an alternative embodiment, step 110 may include:

For example, the electronic device is a robot provided with a camera. The processing module of the robot can acquire an image obtained by shooting the surrounding environment of the robot by the camera from the camera to serve as an image to be detected.

Referring to fig. 2 to 7 in combination, in step 120, the semantic segmentation model includes a Yolov 5-based Backbone network (Backbone) and a Hamburger Head module. After the image to be detected is input into the semantic segmentation model subjected to training test, each module in the semantic segmentation model can be used for processing so as to obtain a feature set of the image to be detected.

Referring to fig. 3, before step 120, the method may further include:

Referring to fig. 4, it can be understood that the Conv module refers to a standard convolution module commonly used in convolutional neural networks, and is used to extract local spatial information in input features (such as images to be measured). The Conv module is mainly composed of a convolutional layer Conv, a BN layer and an activation function leak_relu.

The inputs and outputs of the Conv module can be expressed as:

Out＝Leaky_Relu(Batch_Normalization(conv(Input)))

out refers to output data, input refers to Input data, and batch_normalization refers to Batch Normalization.

Referring to fig. 5, the C3 module includes an eighth Conv module, a ninth Conv module, a tenth Conv module, and a boltleck module. The eighth Conv module is connected in series with the tenth Conv module after being connected in parallel with the ninth Conv module, and the Bottleneck module is arranged between the ninth Conv module and the tenth Conv module. And the C3 module is used for increasing the depth and receptive field of the network and improving the capability of feature extraction.

Wherein, the Bottleneck module includes: two Conv modules and a shortcut module (not shown) connected in series in turn. The Bottleneck module may halve the size of the feature map. The purpose of this is to increase the receptive field of the network while reducing the amount of computation. By halving the size of the feature map, the network can be more concerned with the global information of the object, thereby improving the feature extraction effect. The shortcut module is used for controlling whether residual connection is performed or not.

The input and output of the C3 module and the Bottleneck module may be expressed as:

referring to fig. 6, the SPPF module refers to a fast spatial pyramid pooling module, and the SPPF module continuously adopts three maximum pooling steps, so that the receptive field is greatly enlarged by multi-scale fusion of the results and inputs of the three steps. The SPPF module can apply receptive fields with different sizes to the same image, so that characteristic information with different scales can be captured. The input and output of the SPPF module can be expressed as:

where Layer1 through Layer4 refer to output data of the corresponding network Layer.

Referring to fig. 3 again, in this embodiment, step 130 may include:

In this embodiment, the backbone network may take the output data of three phases (i.e., the first phase feature, the second phase feature, and the third phase feature) and then input the output data into the attention mechanism of the Hamburger Head module. Features of different scales are extracted from different layers of the backbone network, and all the features of different scales contain rich information. Low-scale features have rich spatial location information although semantic information is not very rich, and high-scale features lose spatial location information although they have rich semantic information. And the Hamburg Head module performs fusion processing on the information of different stages in the subsequent stage, and can combine the information of the high-scale features and the low-scale features, so that the feature graphs of different scales have rich spatial position information and semantic information.

Referring to fig. 7, before step 120, the method may further include:

the Hamburg Head module is created based on a Conv module and an NMF (Nonnegative Matrix Factorization, non-negative matrix factorization) module, wherein the Hamburg Head module comprises a sixth Conv module, an NMF module and a seventh Conv module which are sequentially connected in series, the sixth Conv module is used as an input end of the Hamburg Head module, and the seventh Conv module is used as an output end of the Hamburg Head module.

It should be noted that, after the backbone network and the Hamburger Head module of the semantic segmentation model are created, a tester may perform training test on the semantic segmentation model based on the training image set and the testing image set that are prepared in advance. The training image set and the test image set comprise a large number of images of classified labeling targets. The mode of the model training test is a conventional mode and will not be described in detail herein.

Step 140 may include:

It will be appreciated that in this embodiment, the Hamburg Head module is used to perform the secondary processing on each feature in the first feature set. The Hamburg Head module uses a light attention mechanism, and the attention mechanism can be realized in a matrix decomposition mode, so that the semantic segmentation model has certain precision, meanwhile, the operation amount of the model is greatly reduced, the operation efficiency of the model is improved, the operation time is shortened, and the occupation of operation resources during semantic segmentation of electronic equipment is reduced.

In this embodiment, the attention mechanism consists of one matrix decomposition and two linear transformations. First using a linear transformationWill input +.>Mapping into characteristic space, obtaining a low rank signal subspace by non-negative matrix factorization, and finally using another linear transformation +.>Converting the extracted signal into an output, expressed as:

H(Z)＝W _u M(W _l Z)

wherein H (Z) refers to Hamburg Head module output, Z refers to backbone network output, W _l Refer to the sixth Conv module output, M refers to the NMF module output, W _u Refer to seventh ConvAnd outputting the module.Referring to the real number domain of linear transformations, the superscripts dXdz and dzXd refer to dimensions, parameters well known to those skilled in the art.

The data (refer to the second feature set) output by the Hamburg Head module is input into the detection Head for Mask marking. For example, a tester may select a roadway, a sidewalk, a building, a traffic light, a plant, a sky, a vehicle, etc. as a test target. If a corresponding target appears in the image to be detected, the detection head can perform Mask annotation on the image to be detected based on the type information of the target and the target pixel position. Therefore, the type information and the pixel position of the target object can be reflected in the detection result, the robot can understand the environment information, and corresponding task operation can be carried out according to the environment information. The labels of the same kind of targets are the same, and the labels of different targets are different, so that the classification and detection of the corresponding targets can be realized. The labeling mode can be realized by highlighting marks with different colors.

In the semantic-segmented network model, important links are usually a Backbone (Backbone) module and a Head module. The backup module is used for extracting feature information of the picture and is used by a later module, and the Head module fuses and outputs the feature information output by the front module to form a Mask result expected by a tester. The inventor researches and discovers that in order to design a lightweight semantic segmentation method suitable for electronic equipment with weak computing capability such as a mobile robot, the background module and the Head module of a network are required to have certain precision, and meanwhile, the computing amount is required to be reduced so as to meet the normal operation that the electronic equipment with weak computing capability can simultaneously carry out image semantic segmentation and other tasks.

Based on the design, a lightweight semantic segmentation model is constructed based on a backbone network of Yolov5 and Hamburg Head. The semantic segmentation model greatly improves the model operation speed while retaining the detection precision of a large-scale target. The problem that the electronic equipment (such as a mobile robot) can not normally maintain the operation of other items due to the fact that the electronic equipment is blocked because a great deal of time is consumed for waiting for visual task processing in the task execution process is avoided. Therefore, the timeliness and the safety of the operation of the electronic equipment (such as a mobile robot) are improved. When the electronic equipment is a mobile robot, the mobile robot can be well helped to understand and identify the current environment, and the realization of downstream tasks can be well supported.

In this embodiment, the processing module may be an integrated circuit chip with signal processing capability. The processing module may be a general purpose processor. For example, the processor may be a central processing unit (Central Processing Unit, CPU), digital signal processor (Digital Signal Processing, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the application.

The memory module may be, but is not limited to, random access memory, read only memory, programmable read only memory, erasable programmable read only memory, electrically erasable programmable read only memory, and the like. In this embodiment, the storage module may be configured to store an image to be detected, a detection result, a semantic segmentation model, and the like. Of course, the storage module may also be used to store a program, and the processing module executes the program after receiving the execution instruction.

It should be noted that, for convenience and brevity of description, specific working processes of the electronic device described above may refer to corresponding processes of each step in the foregoing method, and will not be described in detail herein.

The embodiment of the application also provides a computer readable storage medium. The computer-readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to perform the image semantic segmentation method as described in the above embodiments.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented in hardware, or by means of software plus a necessary general hardware platform, and based on this understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disc, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, an electronic device, or a network device, etc.) to execute the method described in the respective implementation scenario of the present application.

In summary, the application provides an image semantic segmentation method, electronic equipment and a storage medium. In the scheme, an image to be detected is obtained; inputting the image to be detected into a trained semantic segmentation model; extracting features of an image to be detected through a backbone network in a semantic segmentation model to obtain a first feature set, wherein the backbone network comprises a Conv module, a C3 module and an SPPF module; inputting the first feature set into a Hamburg Head module in the semantic segmentation model to obtain a second feature set; and inputting the second feature set into a detection head to obtain a detection result, wherein the detection result comprises a result of whether a target exists in the image to be detected and a mark corresponding to the target when the target exists. Therefore, based on the lightweight backbone network and combined with the attention mechanism of the Hamburg Head module, the semantic segmentation model has certain precision and can realize lightweight semantic segmentation, so that the method is beneficial to reducing the operand of the model in semantic segmentation and improving the operation efficiency.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus, system and method may be implemented in other manners as well. The above-described apparatus, system, and method embodiments are merely illustrative, for example, flow charts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of semantic segmentation of an image, the method comprising:

acquiring an image to be detected;

inputting the image to be detected into a trained semantic segmentation model;

2. The method of claim 1, wherein acquiring the image to be measured comprises:

3. The method of claim 1, wherein prior to inputting the image to be measured into the trained semantic segmentation model, the method further comprises:

4. A method according to claim 3, wherein the feature extraction of the image to be detected through the backbone network in the semantic segmentation model to obtain a first feature set comprises:

5. The method of claim 1, wherein prior to inputting the image to be measured into the trained semantic segmentation model, the method further comprises:

6. The method of claim 5, wherein inputting the first feature set into a Hamburger Head module in the semantic segmentation model results in a second feature set, comprising:

7. The method of any one of claims 1-6, wherein the C3 module comprises an eighth Conv module, a ninth Conv module, a tenth Conv module, and a Bottleneck module;

8. The method of claim 7, wherein the Bottleneck module comprises: two Conv modules and a shortcut module which are connected in series in sequence.

9. An electronic device comprising a processor and a memory coupled to each other, the memory storing a computer program that, when executed by the processor, causes the electronic device to perform the method of any of claims 1-8.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to perform the method according to any of claims 1-8.