CN110796665B

CN110796665B - Image segmentation method and related product

Info

Publication number: CN110796665B
Application number: CN201911000291.3A
Authority: CN
Inventors: 吴佳涛
Original assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2022-04-22
Anticipated expiration: 2039-10-21
Also published as: CN110796665A

Abstract

The embodiment of the application discloses an image segmentation method and a related product, wherein the method comprises the following steps: acquiring a target image, wherein the target image comprises a preset target; and inputting the target image into a preset semantic segmentation network to obtain a target segmentation result, wherein the preset semantic segmentation network comprises a space path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module and a convolution module. By adopting the embodiment of the application, the image segmentation precision can be improved.

Description

Image segmentation method and related product

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image segmentation method and a related product.

Background

With the widespread use of electronic devices (such as mobile phones, tablet computers, and the like), the electronic devices have more and more applications and more powerful functions, and the electronic devices are developed towards diversification and personalization, and become indispensable electronic products in the life of users.

At present, image processing technologies are becoming more popular, and especially, although a semantic segmentation network can implement image segmentation, segmentation accuracy has certain limitations, so that a problem of how to improve the segmentation accuracy of a voice segmentation network is urgently needed to be solved.

Disclosure of Invention

The embodiment of the application provides an image segmentation method and a related product, which can improve the image segmentation precision.

In a first aspect, an embodiment of the present application provides an image segmentation method, where the method includes:

acquiring a target image, wherein the target image comprises a preset target;

inputting the target image into a preset semantic segmentation network to obtain a target segmentation result, wherein the preset semantic segmentation network comprises a space path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module and a convolution module, the space path module comprises a 2-time down-sampling convolution layer and a first 4-time down-sampling convolution layer, the context path module comprises a second 4-time down-sampling convolution layer, an 8-time down-sampling convolution layer, a 16-time down-sampling convolution layer, a 32-time down-sampling convolution layer and a third connection module, wherein the second 4-time down-sampling convolution layer, the 8-time down-sampling convolution layer, the 16-time down-sampling convolution layer and the 32-time down-sampling convolution layer are connected with the third connection module through an attention optimization module and a multiplier, and the first 4-time down-sampling convolution layer is connected with the first connection module through a multiplier, The second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, the 2-time down-sampling convolution layer is connected with the second connection module through a decoder, and the second connection module is connected with the convolution module.

In a second aspect, an embodiment of the present application provides an image segmentation apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target image which comprises a preset target;

the segmentation unit is used for inputting the target image into a preset semantic segmentation network to obtain a target segmentation result, the preset semantic segmentation network comprises a space path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module and a convolution module, the space path module comprises a 2-time down-sampling convolutional layer and a first 4-time down-sampling convolutional layer, the context path module comprises a second 4-time down-sampling convolutional layer, an 8-time down-sampling convolutional layer, a 16-time down-sampling convolutional layer, a 32-time down-sampling convolutional layer and a third connection module, wherein the second 4-time down-sampling convolutional layer, the 8-time down-sampling convolutional layer, the 16-time down-sampling convolutional layer and the 32-time down-sampling convolutional layer are connected with the third connection module through an attention optimization module and a multiplier, and the first connection module is connected with the first connection module through a multiplier for the first 4-time down-sampling convolutional layer, The second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, the 2-time down-sampling convolution layer is connected with the second connection module through a decoder, and the second connection module is connected with the convolution module.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing the steps in the first aspect of the embodiment of the present application.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program enables a computer to perform some or all of the steps described in the first aspect of the embodiment of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

The embodiment of the application has the following beneficial effects:

it can be seen that the image segmentation method and the related product described in the embodiments of the present application obtain a target image, where the target image includes a preset target, and input the target image into a preset semantic segmentation network to obtain a target segmentation result, where the preset semantic segmentation network includes a spatial path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module, and a convolution module, the spatial path module includes a 2-fold down-sampling convolutional layer and a first 4-fold down-sampling convolutional layer, the context path module includes a second 4-fold down-sampling convolutional layer, an 8-fold down-sampling convolutional layer, a 16-fold down-sampling convolutional layer, a 32-fold down-sampling convolutional layer, and a third connection module, where the second 4-fold down-sampling convolutional layer, the 8-fold down-sampling convolutional layer, the 16-fold down-sampling convolutional layer, and the 32-fold down-sampling convolutional layer are all connected to the third connection module through an attention optimization module and a multiplier, the first 4 times down-sampling convolution layer is connected with the first connection module through a multiplier, the second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, and the 2 times down-sampling convolution layer is connected with a second connecting module through a decoder, the second connecting module is connected with the convolution module, the preset semantic segmentation network can reserve the spatial information through the spatial path module and enlarge the receptive field through the context path module, thus realizing the segmentation of the deep information of the image, in addition, the simplified feature fusion module increases the utilization of shallow pixel position information of the operation results of the spatial path module and the context path module, thus, deep layer and shallow layer information of the target are utilized, deep target segmentation can be achieved, and image segmentation efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1A is a schematic structural diagram of a bilateral semantic segmentation network according to an embodiment of the present disclosure;

fig. 1B is a schematic structural diagram of an ARM module according to an embodiment of the present disclosure;

fig. 1C is a schematic structural diagram of an FFM module according to an embodiment of the present disclosure;

fig. 1D is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 1E is a schematic flowchart of an image segmentation method provided in an embodiment of the present application;

FIG. 1F is a schematic structural diagram of an improved bilateral semantic segmentation network provided by an embodiment of the present application;

FIG. 1G is a schematic diagram illustrating a segmentation effect of two bilateral semantic segmentation networks provided in an embodiment of the present application;

FIG. 2 is a schematic flowchart of another image segmentation method provided in the embodiments of the present application;

fig. 3 is a schematic structural diagram of another electronic device provided in an embodiment of the present application;

fig. 4 is a block diagram of functional units of an image segmentation apparatus according to an embodiment of the present application.

Detailed Description

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The electronic device related to the embodiments of the present application may include various handheld devices, vehicle-mounted devices, wearable devices (smart watches, smart bracelets, wireless headsets, augmented reality/virtual reality devices, smart glasses), computing devices or other processing devices connected to wireless modems, and various forms of User Equipment (UE), Mobile Stations (MS), terminal devices (terminal device), and the like, which have wireless communication functions. For convenience of description, the above-mentioned devices are collectively referred to as electronic devices.

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that, in this embodiment of the present application, the preset semantic segmentation network may be a bilateral semantic segmentation network (BiSeNet), in the related art, a specific model of the BiSeNet is shown in fig. 1A, where the bilateral semantic segmentation network includes a spatial path module (spatial path), a context path module (context path), and a feature fusion module (feature fusion module, FFM), the spatial path module includes a 2-fold down-sampling layer (2x), a 4-fold down-sampling layer (4x), and an 8-fold down-sampling layer (8x), and the context path module includes: the system comprises a 4-time down-sampling layer (4x), an 8-time down-sampling layer (8x), a 16-time down-sampling layer (16x), a 32-time down-sampling layer (32x) and a connection module (concatenate, concat), wherein the 16-time down-sampling layer is connected with the connection module through an attention optimization module (ARM), the 32-time down-sampling layer is also respectively connected with a multiplier (mul) through a global average pooling layer (global average) and the attention optimization module, and the multiplier is connected with the connection module. The operation result of the 8-time down-sampling layer of the spatial path module and the operation result of the connection module are both connected with the attention optimization module, and then the operation result of the attention optimization module is up-sampled by 2 times to obtain a final operation result.

The ARM corresponding to the 16-time down-sampling layer is connected with the connection module after 2-time up-sampling, and the mul corresponding to the 32-time down-sampling layer is connected with the connection module after 4-time up-sampling.

The specific structure of the ARM module is shown in fig. 1B, and it can be seen that the ARM module mainly comprises global pool, 1 × 1 convolution, a normalization layer batch norm, an activation function sigmoid, and a multiplier mul, and the ARM module captures a global context by means of global average pooling and calculates an attention vector to guide feature learning.

The specific structure of the FFM module is shown in fig. 1C, and the output feature sizes of the two paths obtained before the FFM are different, so that simple addition cannot be performed. The SP characteristic of the spatial path rich in the position information and the CP characteristic of the context path rich in the semantic information are in different levels, so that FFM is needed to be fused, namely for given different characteristic inputs, the two characteristics are concatated firstly, then the scale of the characteristics is adjusted by Batch Normalization (BN), then the concatated result is pooled to obtain a characteristic vector and calculate a weight vector, and the weight vector can adjust the weight of the characteristics, thereby bringing the selection and combination of the characteristics.

The following describes embodiments of the present application in detail.

Referring to fig. 1D, fig. 1D is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application, the electronic device 100 includes a storage and processing circuit 110, and a sensor 170 connected to the storage and processing circuit 110, where:

the electronic device 100 may include control circuitry, which may include storage and processing circuitry 110. The storage and processing circuitry 110 may be a memory, such as a hard drive memory, a non-volatile memory (e.g., flash memory or other electronically programmable read-only memory used to form a solid state drive, etc.), a volatile memory (e.g., static or dynamic random access memory, etc.), etc., and the embodiments of the present application are not limited thereto. Processing circuitry in storage and processing circuitry 110 may be used to control the operation of electronic device 100. The processing circuitry may be implemented based on one or more microprocessors, microcontrollers, digital signal processors, baseband processors, power management units, audio codec chips, application specific integrated circuits, display driver integrated circuits, and the like.

The storage and processing circuitry 110 may be used to run software in the electronic device 100, such as an Internet browsing application, a Voice Over Internet Protocol (VOIP) telephone call application, an email application, a media playing application, operating system functions, and so forth. Such software may be used to perform control operations such as, for example, camera-based image capture, ambient light measurement based on an ambient light sensor, proximity sensor measurement based on a proximity sensor, information display functionality based on status indicators such as status indicator lights of light emitting diodes, touch event detection based on a touch sensor, functionality associated with displaying information on multiple (e.g., layered) display screens, operations associated with performing wireless communication functionality, operations associated with collecting and generating audio signals, control operations associated with collecting and processing button press event data, and other functions in the electronic device 100, to name a few.

The electronic device 100 may include input-output circuitry 150. The input-output circuit 150 may be used to enable the electronic device 100 to input and output data, i.e., to allow the electronic device 100 to receive data from an external device and also to allow the electronic device 100 to output data from the electronic device 100 to the external device. The input-output circuit 150 may further include a sensor 170. Sensor 170 may include an ambient light sensor, a proximity sensor based on light and capacitance, a fingerprint recognition module, a touch sensor (e.g., based on a light touch sensor and/or a capacitive touch sensor, where the touch sensor may be part of a touch display screen, or may be used independently as a touch sensor structure), an acceleration sensor, a camera, and other sensors, etc., where the camera may be a front-facing camera or a rear-facing camera, and the fingerprint recognition module may be integrated below the display screen for collecting fingerprint images.

Input-output circuit 150 may also include one or more display screens, such as display screen 130. The display 130 may include one or a combination of liquid crystal display, organic light emitting diode display, electronic ink display, plasma display, display using other display technologies. The display screen 130 may include an array of touch sensors (i.e., the display screen 130 may be a touch display screen). The touch sensor may be a capacitive touch sensor formed by a transparent touch sensor electrode (e.g., an Indium Tin Oxide (ITO) electrode) array, or may be a touch sensor formed using other touch technologies, such as acoustic wave touch, pressure sensitive touch, resistive touch, optical touch, and the like, and the embodiments of the present application are not limited thereto.

The electronic device 100 may also include an audio component 140. The audio component 140 may be used to provide audio input and output functionality for the electronic device 100. The audio components 140 in the electronic device 100 may include a speaker, a microphone, a buzzer, a tone generator, and other components for generating and detecting sound.

The communication circuit 120 may be used to provide the electronic device 100 with the capability to communicate with external devices. The communication circuit 120 may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and/or optical signals. The wireless communication circuitry in communication circuitry 120 may include radio-frequency transceiver circuitry, power amplifier circuitry, low noise amplifiers, switches, filters, and antennas. For example, the wireless Communication circuitry in Communication circuitry 120 may include circuitry to support Near Field Communication (NFC) by transmitting and receiving Near Field coupled electromagnetic signals. For example, the communication circuit 120 may include a near field communication antenna and a near field communication transceiver. The communications circuitry 120 may also include a cellular telephone transceiver and antenna, a wireless local area network transceiver circuitry and antenna, and so forth.

The electronic device 100 may further include a battery, power management circuitry, and other input-output units 160. The input-output unit 160 may include buttons, joysticks, click wheels, scroll wheels, touch pads, keypads, keyboards, cameras, light emitting diodes and other status indicators, and the like.

A user may input commands through input-output circuitry 150 to control the operation of electronic device 100, and may use output data of input-output circuitry 150 to enable receipt of status information and other outputs from electronic device 100.

Based on the electronic device described in fig. 1D, the following functions can be implemented:

acquiring a target image, wherein the target image comprises a preset target;

It can be seen that, in the electronic device described in this embodiment of the present application, a target image is obtained, the target image includes a preset target, the target image is input to a preset semantic segmentation network to obtain a target segmentation result, the preset semantic segmentation network includes a spatial path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module, and a convolution module, the spatial path module includes a 2-fold down-sampling convolutional layer and a first 4-fold down-sampling convolutional layer, the context path module includes a second 4-fold down-sampling convolutional layer, an 8-fold down-sampling convolutional layer, a 16-fold down-sampling convolutional layer, a 32-fold down-sampling convolutional layer, and a third connection module, where the second 4-fold down-sampling convolutional layer, the 8-fold down-sampling convolutional layer, the 16-fold down-sampling convolutional layer, and the 32-fold down-sampling convolutional layer are all connected to the third connection module through an attention optimization module and a multiplier, the first 4 times down-sampling convolution layer is connected with the first connection module through a multiplier, the second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, and the 2 times down-sampling convolution layer is connected with a second connecting module through a decoder, the second connecting module is connected with the convolution module, the preset semantic segmentation network can reserve the spatial information through the spatial path module and enlarge the receptive field through the context path module, thus realizing the segmentation of the deep information of the image, in addition, the simplified feature fusion module increases the utilization of shallow pixel position information of the operation results of the spatial path module and the context path module, thus, deep layer and shallow layer information of the target are utilized, deep target segmentation can be achieved, and image segmentation efficiency is improved.

Referring to fig. 1E, fig. 1E is a schematic flowchart of an image segmentation method according to an embodiment of the present application, and as shown in the drawing, the image segmentation method is applied to the electronic device shown in fig. 1A, and includes:

101. and acquiring a target image, wherein the target image comprises a preset target.

The preset target may be a human, an animal (such as a cat, a dog, a panda, etc.), an object (a table, a chair, clothes), etc., and is not limited herein. The electronic device may obtain the target image by shooting with a camera, or the target image may be any image stored in advance.

In one possible example, when the preset target is a person, the step 101 of acquiring the target image may include the following steps:

11. acquiring a preview image, wherein the preview image comprises the preset target;

12. carrying out face recognition on the preview image to obtain a face area image;

13. acquiring target skin color information of the face region image;

14. determining target shooting parameters corresponding to the target skin color information according to a mapping relation between preset skin color information and the shooting parameters;

15. shooting according to the target shooting parameters to obtain the target image.

In this embodiment of the present application, the skin color information may be at least one of the following: color, average brightness value, location, etc., without limitation. The shooting parameters can be at least one of the following: sensitivity ISO, white balance parameters, focal length, object distance, exposure time, shooting mode, and the like, which are not limited herein. The electronic equipment can also pre-store the mapping relation between the preset skin color information and the shooting parameters.

In specific implementation, the electronic device may obtain a preview image, the preview image may include a preset target, the preview image may be subjected to face recognition to obtain a face region image, target skin color information may be obtained based on the face region image, further, a target shooting parameter corresponding to the target skin color information may be determined according to a mapping relationship between the preset skin color information and the shooting parameter, and shooting may be performed according to the target shooting parameter to obtain a target image, so that a clear face image may be obtained by shooting.

102. Inputting the target image into a preset semantic segmentation network to obtain a target segmentation result, wherein the preset semantic segmentation network comprises a space path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module and a convolution module, the space path module comprises a 2-time down-sampling convolution layer and a first 4-time down-sampling convolution layer, the context path module comprises a second 4-time down-sampling convolution layer, an 8-time down-sampling convolution layer, a 16-time down-sampling convolution layer, a 32-time down-sampling convolution layer and a third connection module, wherein the second 4-time down-sampling convolution layer, the 8-time down-sampling convolution layer, the 16-time down-sampling convolution layer and the 32-time down-sampling convolution layer are connected with the third connection module through an attention optimization module and a multiplier, and the first 4-time down-sampling convolution layer is connected with the first connection module through a multiplier, The second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, the 2-time down-sampling convolution layer is connected with the second connection module through a decoder, and the second connection module is connected with the convolution module.

Wherein, the decoder is also called decoder, and the simplified feature fusion module is composed of conv, bn (batch normalization) and rule (activation function). The multiplier connected with the first connection module can correspond to a weight value, and the value range of the weight value is 0-1.

In one possible example, the multipliers corresponding to the second 4 times down-sampling convolutional layer, the 8 times down-sampling convolutional layer, the 16 times down-sampling convolutional layer, and the 32 times down-sampling convolutional layer each correspond to one weight value, so as to obtain four weight values, and the sum of the four weight values is 1.

In one possible example, the multiplier corresponding to the first 4-fold downsampled convolutional layer corresponds to a weight value, and the value range of the weight value is 0 to 1.

In one possible example, the method further comprises:

the 32 times down-sampled convolutional layer is also connected with a multiplier corresponding to the 32 times down-sampled convolutional layer through a global average pooling layer.

In one possible example, after the 8 times downsampling convolutional layer is connected with the corresponding attention optimization module, and after a2 times upsampling operation is performed, the third connection module is connected;

after the 16 times down-sampling convolutional layer is connected with the corresponding attention optimization module, 4 times up-sampling operation is carried out, and then the third connection module is connected;

and after the 32-time down-sampling convolutional layer is connected with the corresponding attention optimization module, 8-time up-sampling operation is carried out, and then the third connection module is connected.

In one possible example, the operation result of the simplified feature fusion module is connected to the second connection module after 2 times upsampling.

In one possible example, the operation result of the convolution module obtains the target segmentation result after performing 2 times of upsampling.

In one possible example, the attention optimization module includes a global pooling layer, 1 x 1 convolutional layer, normalization layer, sigmoid function, and a multiplier.

In one possible example, the second 4-fold down-sampled convolutional layer, the 8-fold down-sampled convolutional layer, the 16-fold down-sampled convolutional layer, and the 32-fold down-sampled convolutional layer are sequentially connected in series.

In the embodiment of the present application, as shown in fig. 1F, fig. 1F is compared with fig. 1B, and the utilization of the feature maps of the 8-fold and 4-fold down-sampling convolutional layers is increased in the context path, the feature maps of the 8-fold and 4-fold down-sampling convolutional layers in the context path are connected to the ARM module, up-sampling is performed by 2 times (the 4-fold down-sampling part is not up-sampled), and a learnable weight coefficient is increased, and different weight coefficients are given to the feature maps after up-sampling, so that the model learns the influence degrees of different levels on the model result in the learning process. And then carrying out dimension connection operation on the feature graphs, respectively carrying out 4-time upsampling and 8-time upsampling on the feature graphs of the 16-time and 32-time downsampling convolutional layers to correspond to the modified spatial path, removing the 8-time downsampling convolutional layer in the spatial path module, only keeping the 2-time and 4-time downsampling convolutional layers, adding a learnable weight coefficient to the image of the 4-time downsampling convolutional layer, adding a decoder module to the spatial path module, increasing multiplexing of the 2-time downsampling feature graphs in the spatial path module, enabling the model to learn more accurate pixel position information, simplifying an FFM module, namely replacing the FFM module with a conv + bn + relu module, and finally adding a convolution operation to the output of the decoder module, further extracting fused features from the connected feature graphs and using the fused features for final portrait prediction.

In the embodiment of the application, the BiSeNet shown in fig. 1A is improved, a real-time high-precision portrait segmentation algorithm is realized, and the BiSeNet is mainly improved in three aspects: 1. enriching the high-level semantic information utilized by the context path part; 2. the utilization of the spatial path part to accurate pixel point position information is increased; 3. the learnable weight coefficient is increased, so that the model can autonomously select the characteristic information which has positive influence on the result. The method has the advantages that the increase of the operation amount is effectively controlled while the accuracy is improved, and the model can still keep the advantage of real-time segmentation after optimization.

In the embodiment of the application, the ARM module is applied to 4 different scale layers (4x, 8x, 16x and 32x), so that the feature information on the receptive field of more different scales can be contained in the high-level semantic information extracted by the context path of the improved model, and the design of the spatial path part is primarily aimed at obtaining the position information of the accurate pixel points by using a shorter extraction path, but 3 convolution modules with 2 times of downsampling can make the image downsampled by 8 times, even if the convolution path is shorter, the downsampling multiples are too many, the pixel points which easily cause details are downsampled excessively and lost, and on the contrary, the spatial path is made to lose the design aim of extracting the accurate pixel position information. On the other hand, the BiSeNet used in fig. 1A only uses the image feature map of the 8-fold down-sampling layer at the bottom, whereas in view of the effectiveness of the decoder module in the field of image semantic segmentation, the spatial path part in the embodiment of the present application is improved, only convolution operations of two 2-fold down-sampling layers are adopted, and the decoder module is added to further fuse the image features of the 2-fold down-sampling layers, perform convolution operations on the fused features, and extract the feature map finally used for prediction.

In specific implementation, although 2 ARM modules and decoder fusion parts are added to the improved model, a layer of downsampling convolution layer is correspondingly removed, the ARM module mainly comprises global pole, 1 × 1 convolution, batch norm, sigmoid and mul, too much operation cannot be added, and one downsampling convolution is composed of convolution, bn and relu.

In a possible example, when the preset target is a human face, the following steps may be further included between step 101 and step 102:

a1, extracting a target face image from the target image;

a2, matching the target face image with a preset face template;

and A3, when the target face image is successfully matched with the preset face template, performing step 102.

The preset face template can be stored in the electronic equipment in advance. The electronic device may match the target face image with a preset face template, and execute step 102 when the target face image is successfully matched with the preset face template, otherwise, not execute step 102. Therefore, on one hand, the face segmentation can be realized only aiming at the specified face, and on the other hand, the safety can be improved.

In one possible example, the step a2, matching the target face image with a preset face template, may include the following steps:

a21, carrying out image segmentation on the target face image to obtain a target face region image;

a22, analyzing the distribution of the characteristic points of the target face area image;

a23, performing circular image interception on the target face region image according to M different circle centers to obtain M circular face region images, wherein M is an integer greater than 3;

a24, selecting a target circular face region image from the M circular face region images, wherein the number of feature points contained in the target circular face region image is larger than that of other circular face region images in the M circular face region images;

a25, dividing the target circular face region image into N circular rings, wherein the widths of the N circular rings are the same;

a26, starting from the circular ring with the smallest radius in the N circular rings, sequentially matching the N circular rings with a preset face template for feature points, and accumulating the matching values of the matched circular rings;

and A27, stopping feature point matching immediately when the accumulated matching value is larger than the target face recognition threshold value, and outputting a prompt message of face recognition success.

Wherein, the electronic device can perform image segmentation on a target face image to obtain a target face region image, further analyze the distribution of feature points of the target face region image, perform circular image interception on the target face region image according to M different circle centers to obtain M circular face region images, M is an integer greater than 3, select the target circular face region image from the M circular face region images, the number of the feature points contained in the target circular face region image is greater than that of other circular face region images in the M circular face region images, divide the target circular face region image to obtain N circular rings, the ring widths of the N circular rings are the same, perform feature point matching on the N circular rings with a preset face template in sequence from the circular ring with the smallest radius among the N circular rings, and accumulate the matching values of the matched circular rings, thus, in the face recognition process, feature points of different positions or different faces can be used for matching, namely, the whole face image is sampled, and the sampling can cover the whole face area, so that corresponding representative features can be found from each area for matching, when the accumulated matching value is larger than a target face recognition threshold value, feature point matching is immediately stopped, and a prompt message of face recognition success is output, so that the face recognition can be rapidly and accurately recognized.

It should be noted that the present embodiment improves the BiSeNet shown in fig. 1A to implement a real-time high-precision portrait segmentation algorithm, and compared with the BiSeNet shown in fig. 1A, the present embodiment can implement more-precision portrait segmentation without increasing or decreasing obvious computation workload.

To further illustrate the effectiveness of the methods described herein, a segmentation effect graph and data comparison is shown in fig. 1G, as shown in fig. 1G, wherein (a) is the segmentation result of the BiSeNet shown in fig. 1A, and (b) is the optimized BiSeNet model segmentation result shown in fig. 1F.

As can be seen from FIG. 1G, the details of the legs and the body contour of the character in the figure (b) are obviously better than those in the figure (a), the legs and the posture of the character can be clearly distinguished, the body contour is clearly seen, the legs and the background of the character in the figure (a) are integrated and are difficult to distinguish, and the edge of the character is uneven. Therefore, the optimized model can obviously improve the portrait segmentation effect, the details are obviously superior to those of the model before optimization, and the figure details can be well distinguished.

The invention patent will also explain the effectiveness of the optimized solution from two aspects, except the effect diagram: time and mIOU. The data before and after optimization are shown in the following table. The test set is a whole-body portrait image and comprises various figure images and various edge details which can be contacted in daily life, for example, a figure carries a small personal object, the image comprises objects such as a dummy and the like, and the figure is partially shielded and the like. The test set was 576 x 576 pictures, which shows that the mlou was increased by 2.87% compared to the BiSeNet shown in fig. 1A, and the time consumption was almost negligible.

It can be seen that the image segmentation method described in this embodiment of the present application obtains a target image, where the target image includes a preset target, inputs the target image into a preset semantic segmentation network to obtain a target segmentation result, where the preset semantic segmentation network includes a spatial path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module, and a convolution module, the spatial path module includes a 2-fold down-sampling convolutional layer and a first 4-fold down-sampling convolutional layer, the context path module includes a second 4-fold down-sampling convolutional layer, an 8-fold down-sampling convolutional layer, a 16-fold down-sampling convolutional layer, a 32-fold down-sampling convolutional layer, and a third connection module, where the second 4-fold down-sampling convolutional layer, the 8-fold down-sampling convolutional layer, the 16-fold down-sampling convolutional layer, and the 32-fold down-sampling convolutional layer are all connected to the third connection module through an attention optimization module and a multiplier, the first 4 times down-sampling convolution layer is connected with the first connection module through a multiplier, the second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, and the 2 times down-sampling convolution layer is connected with a second connecting module through a decoder, the second connecting module is connected with the convolution module, the preset semantic segmentation network can reserve the spatial information through the spatial path module and enlarge the receptive field through the context path module, thus realizing the segmentation of the deep information of the image, in addition, the simplified feature fusion module increases the utilization of shallow pixel position information of the operation results of the spatial path module and the context path module, thus, deep layer and shallow layer information of the target are utilized, deep target segmentation can be achieved, and image segmentation efficiency is improved.

In summary, the BiSeNet described in the embodiments of the present application has the following differences compared with the BiSeNet shown in fig. 1A:

1. increasing the utilization of 8-fold and 4-fold down-sampling feature maps in the context path; connecting the characteristic graphs of an 8-time down-sampling layer and a 4-time down-sampling layer in the context path with an ARM module, and then performing 2-time up-sampling (the 4-time down-sampling part can not perform up-sampling);

2. and increasing learnable weight coefficients, namely endowing the feature graph after up-sampling with different weight coefficients, enabling the model to independently learn the influence degrees of different levels on the model result in the learning process, and then carrying out dimension connection operation on the model result. Respectively performing 4-time upsampling and 8-time upsampling on the feature maps of the 16-time downsampling convolutional layer and the 32-time downsampling convolutional layer so as to correspond to the modified spatial path;

3. removing the 8 times downsampling convolutional layer in the spatial path module, and only retaining the 2 times downsampling convolutional layer and the 4 times downsampling convolutional layer; adding a weight coefficient which can be learned to the image of the 4 times downsampling convolutional layer;

4. adding a decoder module to the spatial path module; the multiplexing of the feature map of the 2-time downsampling convolutional layer in the spatial path module is increased, so that the model can learn more accurate pixel position information;

5. the FFM module is removed, and a conv + bn + relu module is used for replacing, so that the FFM module structure is simplified, and the operation efficiency is improved;

6. and adding a convolution operation to the output of the decoder module, and further extracting the fused features from the connected feature graphs for final portrait prediction.

In summary, the improved BiSeNet mainly performs improvements in three aspects: 1. enriching the high-level semantic information utilized by the context path part; 2. the utilization of the spatial path part to accurate pixel point position information is increased; 3. the learnable weight coefficient is increased, so that the model can autonomously select the characteristic information which has positive influence on the result. The method and the device have the advantages that the increase of the operation amount is effectively controlled while the accuracy is improved, and the model can still keep the advantage of real-time segmentation after optimization.

Referring to fig. 2, in keeping with the embodiment shown in fig. 1E, fig. 2 is a schematic flowchart of an image segmentation method provided in the present application, and as shown in the figure, the image segmentation method is applied to the electronic device shown in fig. 1D, and the image segmentation method includes:

201. and acquiring a target image, wherein the target image comprises a human face.

202. And extracting a target face image from the target image.

203. And matching the target face image with a preset face template.

204. When the target face image is successfully matched with the preset face template, the target image is input into a preset semantic segmentation network to obtain a target segmentation result, the preset semantic segmentation network comprises a space path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module and a convolution module, the space path module comprises a 2-time down-sampling convolution layer and a first 4-time down-sampling convolution layer, the context path module comprises a second 4-time down-sampling convolution layer, an 8-time down-sampling convolution layer, a 16-time down-sampling convolution layer, a 32-time down-sampling convolution layer and a third connection module, wherein the second 4-time down-sampling convolution layer, the 8-time down-sampling convolution layer, the 16-time down-sampling convolution layer and the 32-time down-sampling convolution layer are connected with the third connection module through an attention optimization module and a multiplier, the first 4 times down-sampling convolutional layer is connected with the first connecting module through a multiplier, the second connecting module is connected with the first connecting module, the first connecting module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connecting module, the 2 times down-sampling convolutional layer is connected with the second connecting module through a decoder, and the second connecting module is connected with the convolutional module.

For the detailed description of the steps 201 to 204, reference may be made to the corresponding steps of the image segmentation method described in the above fig. 1E, and details are not repeated here.

It can be seen that, in the image segmentation method described in the embodiment of the present application, for a face image, a preset semantic segmentation network, that is, an improved BiSeNet algorithm, can be used to implement image segmentation, where the preset semantic segmentation network can retain spatial information through a spatial path module and expand a receptive field through a context path module, so that deep information of the image can be segmented, and in addition, a simplified feature fusion module increases utilization of shallow pixel position information of operation results of the spatial path module and the context path module, so that deep and shallow information of a target are both utilized, deep target segmentation can be implemented, and image segmentation efficiency is improved.

In accordance with the foregoing embodiments, please refer to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in the drawing, the electronic device includes a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and in an embodiment of the present application, the programs include instructions for performing the following steps:

acquiring a target image, wherein the target image comprises a preset target;

In one possible example, the 32 times downsampled convolutional layer also connects the multiplier corresponding to the 32 times downsampled convolutional layer through a global average pooling layer.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the above-mentioned functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Fig. 4 is a block diagram showing functional units of an image segmentation apparatus 400 according to an embodiment of the present application. The image segmentation apparatus 400 is applied to an electronic device, and the apparatus 400 includes an obtaining unit 401 and a segmentation unit 402, wherein,

an obtaining unit 401, configured to obtain a target image, where the target image includes a preset target;

a segmentation unit 402, configured to input the target image into a preset semantic segmentation network to obtain a target segmentation result, where the preset semantic segmentation network includes a spatial path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module, and a convolution module, the spatial path module includes a 2-fold down-sampling convolution layer and a first 4-fold down-sampling convolution layer, the context path module includes a second 4-fold down-sampling convolution layer, an 8-fold down-sampling convolution layer, a 16-fold down-sampling convolution layer, a 32-fold down-sampling convolution layer, and a third connection module, where the second 4-fold down-sampling convolution layer, the 8-fold down-sampling convolution layer, the 16-fold down-sampling convolution layer, and the 32-fold down-sampling convolution layer are all connected to the third connection module through an attention optimization module and a multiplier, and the first connection module is connected to the first 4-fold down-sampling convolution layer through a multiplier, The second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, the 2-time down-sampling convolution layer is connected with the second connection module through a decoder, and the second connection module is connected with the convolution module.

It can be seen that the image segmentation apparatus described in the embodiment of the present application obtains a target segmentation result by obtaining a target image, where the target image includes a preset target, and inputting the target image into a preset semantic segmentation network, where the preset semantic segmentation network includes a spatial path module, a context path module, a simplified feature fusion module, a first connection module, a second connection module, and a convolution module, the spatial path module includes a 2-fold down-sampling convolutional layer and a first 4-fold down-sampling convolutional layer, the context path module includes a second 4-fold down-sampling convolutional layer, an 8-fold down-sampling convolutional layer, a 16-fold down-sampling convolutional layer, a 32-fold down-sampling convolutional layer, and a third connection module, where the second 4-fold down-sampling convolutional layer, the 8-fold down-sampling convolutional layer, the 16-fold down-sampling convolutional layer, and the 32-fold down-sampling convolutional layer are all connected to the third connection module through an attention optimization module and a multiplier, the first 4 times down-sampling convolution layer is connected with the first connection module through a multiplier, the second connection module is connected with the first connection module, the first connection module is connected with the simplified feature fusion module, the simplified feature fusion module is connected with the second connection module, and the 2 times down-sampling convolution layer is connected with a second connecting module through a decoder, the second connecting module is connected with the convolution module, the preset semantic segmentation network can reserve the spatial information through the spatial path module and enlarge the receptive field through the context path module, thus realizing the segmentation of the deep information of the image, in addition, the simplified feature fusion module increases the utilization of shallow pixel position information of the operation results of the spatial path module and the context path module, thus, deep layer and shallow layer information of the target are utilized, deep target segmentation can be achieved, and image segmentation efficiency is improved.

It can be understood that the functions of each program module of the image segmentation apparatus of this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of image segmentation, the method comprising:

acquiring a target image, wherein the target image comprises a preset target;

2. The method of claim 1, wherein the multipliers corresponding to the second 4 times downsampled convolutional layer, the 8 times downsampled convolutional layer, the 16 times downsampled convolutional layer, and the 32 times downsampled convolutional layer each correspond to one weight value, resulting in four weight values, and the sum of the four weight values is 1.

3. The method according to claim 1 or 2, wherein the multiplier corresponding to the first 4-fold downsampled convolutional layer corresponds to a weight value, and the value range of the weight value is 0 to 1.

4. The method according to claim 1 or 2, characterized in that the method further comprises:

5. The method according to claim 1 or 2, wherein the 8 times downsampling convolutional layer is connected with the corresponding attention optimization module, and then the third connection module is connected after 2 times upsampling operation is performed;

6. The method according to claim 5, wherein the operation result of the simplified feature fusion module is connected to the second connection module after 2 times upsampling.

7. The method of claim 6, wherein the operation result of the convolution module is obtained after 2 times upsampling.

8. An image segmentation apparatus, characterized in that the apparatus comprises:

9. An electronic device comprising a processor, a memory for storing one or more programs and configured for execution by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-7.