CN108765425B

CN108765425B - Image segmentation method and device, computer equipment and storage medium

Info

Publication number: CN108765425B
Application number: CN201810463609.0A
Authority: CN
Inventors: 林迪; 黄惠
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2018-05-15
Filing date: 2018-05-15
Publication date: 2022-04-22
Anticipated expiration: 2038-05-15
Also published as: CN108765425A

Abstract

The application relates to an image segmentation method, an image segmentation device, a computer device and a storage medium. The method comprises the following steps: the method comprises the steps of obtaining an image to be segmented, inputting the image to be segmented into an input variable of a full convolution neural network, outputting a convolution characteristic diagram, inputting the convolution characteristic diagram into the input variable of the context switchable neural network, outputting context expression information, generating an intermediate characteristic diagram according to the convolution characteristic diagram and the context expression information, and enabling the intermediate characteristic diagram to be used for image segmentation. By combining the reduced images of the convolutional neural network and stacking the images for calculating the semantic features of the images, and using context expression information generated by the context switchable neural network for image segmentation, the accuracy of image segmentation can be improved.

Description

Image segmentation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image segmentation method and apparatus, a computer device, and a storage medium.

Background

With the development of image processing technology, image segmentation is an important part in the image processing technology field, and machine learning plays an important role in the image processing technology field. The depth data can provide geometric information in an image, and the depth data contains a large amount of information useful for image segmentation.

However, in the current method of implementing image segmentation by the convolutional neural network, most of information useful for image segmentation contained in the depth data is lost by the output of the convolutional neural network, and the image segmentation accuracy is poor.

Disclosure of Invention

In view of the above, it is necessary to provide an image segmentation method, an apparatus, a computer device, and a storage medium capable of improving the image segmentation accuracy in view of the above technical problems.

A method of image segmentation, the method comprising:

acquiring an image to be segmented;

inputting the image to be segmented into an input variable of a full convolution neural network, and outputting a convolution characteristic diagram;

inputting the convolution characteristic diagram into an input variable of a context switchable neural network, and outputting context expression information;

and generating an intermediate feature map according to the convolution feature map and the context expression information, wherein the intermediate feature map is used for image segmentation.

In one embodiment, the inputting the convolution feature map into an input variable of a context-switchable neural network and outputting context expression information includes:

dividing the convolution feature map into super-pixel regions, wherein the super-pixel regions are sub-regions of the convolution feature map;

and generating a local feature map according to the super-pixel region.

In one embodiment, the inputting the convolution feature map into an input variable of a context-switchable neural network and outputting context expression information further includes:

calculating an average depth value of the super pixel region;

and generating context expression information corresponding to the super pixel area according to the average depth value.

In one embodiment, the generating context expression information corresponding to the super pixel region according to the average depth value includes:

comparing the average depth value to a conditional depth value;

compressing the super-pixel region when the average depth value is less than the conditional depth value;

and when the average depth value is larger than or equal to the conditional depth value, expanding the super-pixel area.

In one embodiment, the compressing the super-pixel region when the average depth value is smaller than the conditional depth value includes

Inputting the local characteristic diagram corresponding to the super pixel area into three preset convolution neural networks for processing to obtain a compressed super pixel area;

wherein the three convolutional neural networks comprise two neural networks with convolution kernels of 1 and one neural network with convolution kernels of 3.

In one embodiment, the expanding the super-pixel region when the average depth value is greater than or equal to the conditional depth value includes:

inputting the local characteristic diagram corresponding to the super pixel area into three preset convolution neural networks for processing to obtain an expanded super pixel area;

wherein the three convolutional neural networks comprise two neural networks with convolution kernels of 7 and one neural network with convolution kernel of 1.

In one embodiment, the context switchable neural network is trained by:

obtaining an input layer node sequence according to the convolution feature map and the category of the convolution feature map, projecting the input layer node sequence to obtain a hidden layer node sequence corresponding to a first hidden layer, and taking the first hidden layer as a current processing hidden layer;

and obtaining a hidden layer node sequence of a next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently, taking the next hidden layer as the hidden layer processed currently, repeatedly entering the step of obtaining the hidden layer node sequence of the next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently until an output layer, and obtaining a context expression information probability matrix which is output by the output layer and corresponds to the category of the convolution feature map.

An image segmentation apparatus, the apparatus comprising:

the image acquisition module is used for acquiring an image to be segmented;

the characteristic diagram output module is used for inputting the image to be segmented into an input variable of a full convolution neural network and outputting a convolution characteristic diagram;

the information output module is used for inputting the convolution characteristic diagram into an input variable of the context switchable neural network and outputting context expression information;

and the feature map generation module is used for generating an intermediate feature map according to the convolution feature map and the context expression information, and the intermediate feature map is used for carrying out image segmentation.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an image to be segmented;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an image to be segmented;

According to the image segmentation method, the image segmentation device, the computer equipment and the storage medium, the image to be segmented is input into the input variable of the full convolution neural network by obtaining the image to be segmented, the convolution characteristic graph is output and input into the input variable of the context switchable neural network, the context expression information is output, an intermediate characteristic graph is generated according to the convolution characteristic graph and the context expression information, and the intermediate characteristic graph is used for image segmentation. By combining the reduced images of the convolutional neural network and stacking the images for calculating the semantic features of the images, and using context expression information generated by the context switchable neural network for image segmentation, the accuracy of image segmentation can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram illustrating an internal structure of a computer device according to an embodiment;

FIG. 2 is a flow diagram illustrating a method for image segmentation in one embodiment;

FIG. 3 is a flow diagram that illustrates a method for generating a local feature map, according to one embodiment;

FIG. 4 is a flow diagram that illustrates a method for generating context expression information, in accordance with an embodiment;

FIG. 5 is a flow diagram illustrating a method for processing a super pixel region in one embodiment;

FIG. 6 is a schematic diagram of a compression architecture in one embodiment;

FIG. 7 is a diagram of an extended architecture in one embodiment;

FIG. 8 is an architectural diagram of a context switchable neural network in one embodiment;

FIG. 9 is a partial block diagram illustrating an average depth value of a super pixel area in accordance with one embodiment;

FIG. 10 is a block diagram showing the structure of an image segmentation apparatus according to an embodiment;

FIG. 11 is a block diagram of an information output module in one embodiment;

FIG. 12 is a schematic diagram showing a comparison of images collected in NYUdv2 dataset by different segmentation methods during the experiment;

FIG. 13 is a schematic diagram of images collected in the SUN-RGBD data set by different segmentation methods during an experiment for comparison.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present disclosure should be considered as being described in the present application.

Fig. 1 is a schematic diagram of an internal structure of a computer device in one embodiment. The computer equipment can be a terminal or a server, wherein the terminal can be an electronic equipment with a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, wearable equipment, vehicle-mounted equipment and the like, and the server can be an independent server or a server cluster. Referring to fig. 1, the computer apparatus includes a processor, a non-volatile storage medium, an internal memory, and a network interface connected through a system bus. Wherein the non-volatile storage medium of the computer device may store an operating system and a computer program that, when executed, may cause the processor to perform an image processing method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The internal memory may store an operating system, a computer program, and a database. Wherein the computer program, when executed by a processor, causes the processor to perform an image processing method. The network interface of the computer device is used for network communication.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, as shown in fig. 2, an image segmentation method is provided, which is described by taking the method as an example applied to the computer device in fig. 1, and comprises the following steps:

step 202, acquiring an image to be segmented.

Wherein, the image is a description or a photo of an objective object, and is a commonly used information carrier, and the image contains the related information of the described object. For example, the image may be a photograph taken by a camera and containing different objects, or may be an information carrier containing information about different objects synthesized by computer software.

Image segmentation refers to the division of an image into a number of specific regions with unique properties. The image to be segmented may be acquired in real time or may be pre-stored, for example, the image to be segmented acquired by the user may be acquired in real time through a camera, or the image to be segmented may be pre-stored in a database and then the image to be segmented is acquired from the database.

And step 204, inputting the image to be segmented into an input variable of the full convolution neural network, and outputting a convolution characteristic diagram.

The full convolution neural network fcn (full Convolutional neural networks) is a pre-trained neural network model, and can be used for image segmentation. The FCN used for image segmentation may recover the class to which each pixel belongs from abstract features, i.e. from image-level classification extending further to pixel-level classification. The full convolutional neural network model may include convolutional layers and pooling layers. The convolution layer has convolution kernel for extracting characteristic weight matrix, and the convolution kernel has step length for setting convolution operation to reduce the weight number. The pooling layer, also called a down-sampling layer, may reduce the dimensionality of the matrix.

And taking the image to be segmented as input item data of a pre-trained full convolution neural network. And inputting the image to be segmented into the convolution layer, scanning the input image to be segmented from front to back by the convolution kernel according to the corresponding length, namely according to the size of the convolution kernel, and executing convolution operation. The treatment is carried out on the pooling layer after the convolution treatment, and the pooling layer can effectively reduce dimensionality. The full convolution neural network can output the convolution characteristic graph obtained after the convolution layer and the pooling layer are processed.

And step 206, inputting the convolution characteristic diagram into an input variable of the context switchable neural network, and outputting context expression information.

The context switchable neural network is pre-trained according to the image structure and the depth data, and is a full convolution neural network and also comprises a convolution layer and a pooling layer. The context expression information refers to some or all of the information that can affect the object in the image.

After the convolution characteristic diagram is input to the input variable of the context switchable neural network, the convolution layer and the pooling layer in the context switchable neural network process the convolution characteristic diagram, and the context expression information of the convolution characteristic diagram can be obtained.

And step 208, generating an intermediate feature map according to the convolution feature map and the context expression information, wherein the intermediate feature map is used for image segmentation.

The intermediate feature map may be a plurality of feature maps of different resolutions. The intermediate feature maps may be ordered from low to high resolution from top to bottom. The intermediate feature map may be generated according to the convolution feature map and the context expression information, and the specific formula may be expressed as: f^l+1＝M^l+1+D^l→(l+1)L-1, 0. Wherein L represents the number of intermediate feature maps, F^l+1Representing the finally generated intermediate feature map, F¹Representing the intermediate feature map with the lowest resolution, F^lRepresenting intermediate features with highest resolutionFigure (a). M represents the convolution signature output by the full convolution neural network, and D^l→(l+1)And representing the context expression information corresponding to the convolution characteristic diagram M. The generated intermediate feature map is used for image segmentation.

The method comprises the steps of inputting an image to be segmented into input variables of a full convolution neural network by obtaining the image to be segmented, outputting a convolution characteristic diagram, inputting the convolution characteristic diagram into the input variables of the context switchable neural network, outputting context expression information, and generating an intermediate characteristic diagram according to the convolution characteristic diagram and the context expression information, wherein the intermediate characteristic diagram is used for image segmentation. By combining the reduced images of the convolutional neural network and stacking the images for calculating the semantic features of the images, and using context expression information generated by the context switchable neural network for image segmentation, the accuracy of image segmentation can be improved.

As shown in fig. 3, in an embodiment, the provided image segmentation method may further include a process of generating a local feature map, and the specific steps include:

step 302, dividing the convolution feature map into super-pixel regions, wherein the super-pixel regions are sub-regions of the convolution feature map.

The superpixel is a graph originally at a pixel level (pixel-level) and is divided into a graph at a region level (discrete-level), and the image can be subjected to superpixel region segmentation by using a superpixel algorithm. After the image is subjected to superpixel division, a plurality of regions with different sizes can be obtained, and the regions contain effective information, such as color histograms and texture information. For example, there is a person in the image, and we can perform superpixel segmentation on the image of the person, and then recognize which part (head, shoulder, leg) of the human body these regions are located in by extracting features of each small region, thereby establishing a joint image of the human body.

After the convolution characteristic diagram is subjected to super-pixel area division, a plurality of super-pixel areas can be obtained, the obtained super-pixel areas are not overlapped areas, and the super-pixel areas are all sub-areas of the convolution characteristic diagram.

Step 304, a local feature map is generated from the super-pixel region.

Each super pixel region corresponds to a local feature map, and a formula generated by the local feature map can be expressed as:

wherein S is_nRepresenting a super-pixel region and the region ri representing a receptive field in the image to be segmented. The receptive field is the size of the visual perception area, and in the convolutional neural network, the definition of the receptive field is the area size of the mapping of the pixel points on the characteristic diagram output by each layer of the convolutional neural network on the original image. Phi (S)_n) Represents a set of the centers of the field of view in a plurality of superpixel regions, and H (: represents a partial structure diagram. As can be obtained from the formula, for the region ri, the generated local feature map includes the features of the intermediate feature map, and therefore, the generated local feature map retains the contents of the original region ri.

The convolution feature map is divided into super-pixel regions, the super-pixel regions are sub-regions of the convolution feature map, and the local feature map is generated according to the super-pixel regions. The convolution characteristic graph is divided into super pixel areas, and then the local characteristic graph is generated according to the super pixel areas, so that the content in the original area can be reserved, and the image segmentation is more accurate.

In an embodiment, as shown in fig. 4, the provided image segmentation method may further include a process of generating context expression information, and the specific steps include:

in step 402, an average depth value of the superpixel region is calculated.

The gray value of each pixel point in the depth image can be used for representing the distance between a certain point in the scene and the camera, and the depth value is the distance value between the certain point in the scene and the camera. Multiple objects can coexist in a super pixel region at the same time. By obtaining the depth value of each object in the super-pixel region, the average depth value of the whole super-pixel region can be calculated according to the depth value of each object.

In step 404, context expression information corresponding to the super-pixel region is generated according to the average depth value.

Depth values are important data for generating context expression information. The context expression information corresponding to each super pixel region is generated according to the average depth value of each super pixel region.

Context expression information corresponding to the super pixel region is generated according to the average depth value by calculating the average depth value of the super pixel region. And generating corresponding context expression information according to the average depth value of the super-pixel region, so that the generated context expression information is more accurate, and the image segmentation accuracy is improved.

As shown in fig. 5, in an embodiment, the provided image segmentation method may further include a process of processing the super-pixel region, and the specific steps include:

step 502, compare the average depth value with the conditional depth value.

The conditional depth value may be a specific value set in advance. The computer device may compare the magnitude of the conditional depth value after calculating the average depth value.

Step 504, when the average depth value is smaller than the conditional depth value, the super pixel area is compressed.

When the average depth value is smaller than the conditional depth value, the information amount in the super-pixel region is large, a compression system structure needs to be adopted to refine the information in the super-pixel region, and transition diversified information in the super-pixel region is reduced. The compression architecture may learn to re-weight the corresponding superpixel regions, the formula for compressing the superpixel regions being:

where rj denotes a super-pixel region with an average depth value smaller than the conditional depth value, c denotes a compression architecture, D^lShows a structural feature diagram of

Representing multiplication of corresponding elements of the matrix.

Representing a super-pixel region after compression by a compression architecture.

In step 506, when the average depth value is greater than or equal to the conditional depth value, the super-pixel area is expanded.

When the average depth value is greater than or equal to the conditional depth value, the amount of information in the superpixel region is represented to be small, and an extended architecture needs to be adopted to enrich the information in the superpixel region. The formula for the extension architecture to extend the superpixel region is as follows:

wherein the content of the first and second substances,

representing a super-pixel region extended by the extension architecture.

By comparing the average depth value with the conditional depth value, the super-pixel region is compressed when the average depth value is less than the conditional depth value, and the super-pixel region is expanded when the average depth value is greater than or equal to the conditional depth value. The super pixel region is processed by selecting a compression architecture or an expansion architecture according to the size of the average depth value, so that the image segmentation accuracy can be improved.

In one embodiment, the context expression information generated by the context switchable neural network can be expressed by a formula, wherein the formula is as follows:

wherein the super pixel region S_nAnd a super pixel region S_mAdjacent, the super pixel region S can be formed by this formula_mThe information of the receptor field region rj in (a) is transferred from top to bottom to the receptor field region ri,

an extended architecture is shown in which the architecture is extended,

representing the compression architecture, d (S)_n) Representing a super pixel area S_nAverage depth of (d). Indicating function

For switching between the extended architecture and the compressed architecture. When d (S)_n)＜d(S_m) When d (S), the information of the receptive field region ri can be refined by switching to the compression architecture_n)＞d(S_m) When necessary, the compression architecture can be switched to enrich the information of the receptive field region ri.

In an embodiment, as shown in fig. 6, the provided image segmentation method may further include a process of compressing the super-pixel region, specifically including: and inputting the local characteristic diagram corresponding to the super pixel area into three preset convolution neural networks for processing to obtain the compressed super pixel area. Wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 1 and one neural network with convolution kernels of 3.

As shown in fig. 6, the local feature map 610 is input and output to the compression architecture. The compression architecture consists of a first 1 x 1 convolutional layer 620, a 3 x 3 convolutional layer 630, and a second 1 x 1 convolutional layer 640. The first 1 × 1 convolutional layer 620 and the second 1 × 1 convolutional layer 640 are convolutional layers having a convolution kernel of 1, and the 3 × 3 convolutional layer 630 is a convolutional layer having a convolution kernel of 3.

The local feature map 610 is input into the compression architecture and processed by the first 1 x 1 convolutional layer 620. The first 1 × 1 convolutional layer 620 is used to halve the dimension of the local feature map 610, and the halving process may filter useless information in the local feature map 610 and may also retain useful information in the local feature map 610. After dimensionality reduction, the 3 x 3 convolutional layer 630 can restore the dimensionality, rebuild back to the original dimensionality, and then use the second 1 x 1 convolutional layer to generate a re-weighted vector c (D)^l(rj)) and based on the weight vector c (D)^l(rj)) generating a compressed super imageA pixel region.

As shown in fig. 7, in an embodiment, the provided image segmentation method may further include a process of expanding the super-pixel region, specifically including: and inputting the local characteristic diagram corresponding to the super pixel area into three preset convolution neural networks for processing to obtain the expanded super pixel area. Wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 7 and one neural network with convolution kernel of 1.

As shown in fig. 7, the extended architecture consists of a first 7 x 7 convolutional layer 720, a 1 x 1 convolutional layer 730, and a second 7 x 7 convolutional layer 740. After the local feature map 710 is entered into the extended architecture, the first 7 x 7 convolutional layer 720 uses a larger kernel to expand the receptive field and learn the relevant context expression information. The 1 x 1 convolutional layer 730 is used to halve the dimension and remove redundant information contained in the large core of the first 7 x 7 convolutional layer 720. The second 7 x 7 convolutional layer 740 is used to recover the dimension, and the second 7 x 7 convolutional layer 740 may also be epsilon (D)^l(rj)) and D^l(rj) dimension matching.

In one embodiment, the context switchable neural network is trained by: and obtaining an input layer node sequence according to the convolution feature map and the category of the convolution feature map, projecting the input layer node sequence to obtain a hidden layer node sequence corresponding to the first hidden layer, and taking the first hidden layer as a current processing hidden layer.

And obtaining a hidden layer node sequence of a next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently, taking the next hidden layer as the hidden layer processed currently, repeatedly entering the step of obtaining the hidden layer node sequence of the next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently until the hidden layer is output, and obtaining a context expression information probability matrix which is output by the output layer and corresponds to the category of the convolution characteristic graph.

After the convolution characteristic diagram is input into the context switchable neural network, local characteristic diagrams with different resolutions are generatedIs sent to a pixel-by-pixel classifier for semantic segmentation. The pixel-by-pixel classifier may output a set of class labels for the pixels of the local feature map, and the sequence of class labels may be represented as: y ═ F (F)^l) Wherein the function f (:) is a regression quantity of the flexible maximum value which generates pixel-by-pixel classes. y ═ F (F)^l) Can be used to predict pixel-by-pixel class labels. The objective function of training the context-switchable neural network may be formulated as:

where the function L (: is the maximum loss of flexibility value, for the receptor field region ri, the prediction category label of the receptor field region ri can be represented by y (ri). The computer device may compare the results output by the context switchable neural network with the predicted class labels, thereby enabling training of the context switchable neural network.

The processing procedure in the context switchable neural network is: the convolution feature map generated through the full convolution neural network and the category label of the convolution feature map are used as input, the input convolution feature map is divided into super pixel areas, and a local feature map is generated according to the super pixel areas. And calculating the average depth value of the super pixel area, and processing the super pixel area according to the size of the average depth value. When the average depth value is smaller than the conditional depth value, processing the super-pixel area by adopting a compression system structure to obtain context expression information of the compressed super-pixel area; and when the average depth value is larger than or equal to the conditional depth value, processing the super pixel area by adopting an expansion system structure, obtaining context expression information of the expanded super pixel area, and outputting the obtained context expression information.

In one embodiment, a gradient descent method is employed to adjust the weight parameters in the context-switchable neural network.

The gradient is calculated as:

wherein S is_nAnd S_mTwo adjacent super-pixel regions are provided, ri, rj and rk respectively represent receptive fields in an image to be segmented, and J represents an objective function for training a context switchable neural network model. In the formula for calculating the gradient, the gradient is calculated,

representing the update signal. When the weight parameters in the context switchable neural network are adjusted by using the calculation formula of the gradient, the intermediate feature map can be optimized. Receptor field region ri from super pixel region S_nReceiving the update signal of the reception field region rk

The refresh signal

For adjusting features located in the same super-pixel region so that these regions exhibit object co-existence. The receptor field region ri is also from the neighboring super-pixel region S_mMiddle F^l+1Receive the update signal in the receptive field region rj.

In the formula for calculating the gradient, the gradient is calculated,

indicating an update signal from the receptive field region rj. When the update signal is transmitted from the reception field region rj to the reception field region ri, the update signal

According to the signal

Weighting is performed so as to expand the receptive field region rj. At the same time, the parameter λ_cAnd a parameter lambda_eIs formed by a super-pixel region S_nAnd a super pixel region S_mI.e. according to the super-pixel region S_nAnd a super pixel region S_mCan be determined by using a parameter lambda_cFor compression, again using the parameter lambda_eAnd (5) performing expansion. For example, a signal

Can be expanded to:

at the same time, the signals

Can be expanded to:

when d (S)_n)＜d(S_m) I.e. super-pixel region S_nIs smaller than the super pixel region S_mAt average depth value of

For signals transmitted from receptive field region rj

The gradients passed back are weighted. The compression architecture C (: can be optimized by a backward pass, which refers to the objective function

To compression architecture C (: to). Wherein the vector C (D) is reweighted^l(rk)) also participate in the update signal

Vector C (D) is used in training context-switchable neural networks^l(rk)) to select a local feature map D^l(rk) information useful for segmentation to construct an intermediate feature map F^l+1(rj). With the re-weighting vector C (D)^l(rk)) together, the information useful for segmentation in the receptive field region rj can better guide the receptive field region ri to update information.

When d (S)_n)≥d(S_m) I.e. super-pixel region S_nIs greater than or equal to the super pixel region S_mAt average depth value of

Will be aligned with the signal

An influence is produced. The hopping connection between the receptor field region rj and the receptor field region ri can be formed by a factor of 1, and the hopping connection means that information between the receptor field region ri and rj is not propagated through any neural network structure. This information is weighted by a factor of 1 as it propagates between the different regions, without any change in the signal. The extended architecture obtains context expression information by widening the receptive field, but the large convolution kernel of the extended architecture may scatter the backward propagating signal from the receptive field region rj to the receptive field region ri during training, allowing the backward propagating signal to propagate directly from the receptive field region rj to the receptive field region ri using a hopping connection.

The weight parameters of the context switchable neural network can be optimized through a gradient descent algorithm, and image segmentation is facilitated.

In one embodiment, the architecture of a context switchable neural network is shown in FIG. 8. After the image to be segmented is input into the full convolution neural network, a plurality of convolution characteristic graphs can be output. A first convolved feature map 810, a second convolved feature map 820, a third convolved feature map 830, a fourth convolved feature map 840, and so on. Taking the fourth convolved feature map 840 as an example, the fourth convolved feature map 840 is input to a context switchable neural network, and the context switchable neural network divides the superpixel regions for the fourth convolved feature map 840 and generates the local feature map 844 according to the superpixel regions. The context-switchable neural network calculates an average depth value of the superpixel region and selects a compression or expansion architecture according to the average depth value to generate context expression information 846. The context-switchable neural network generates an intermediate feature map 842 from the local feature map 844 and the context expression information 846, the intermediate feature map 842 being used for image segmentation.

In one embodiment, the local structure corresponding to the average depth value of the super-pixel region is shown in FIG. 9. The context-switchable neural network calculates an average depth value for each superpixel region and compares the calculated average depth value with the conditional depth value to determine whether the superpixel region is processed using the compressed architecture or the extended architecture. For example, the first super pixel region 910 has an average depth value of 6.8, the second super pixel region 920 has an average depth value of 7.5, the third super pixel region 930 has an average depth value of 7.3, the fourth super pixel region 940 has an average depth value of 3.6, the fifth super pixel region 950 has an average depth value of 4.3, and the sixth super pixel region 960 has an average depth value of 3.1. When the preset conditional depth value is 5.0, the first super pixel area 910, the second super pixel area 920 and the third super pixel area 930 should be processed using a compression architecture, and the fourth super pixel area 940, the fifth super pixel area 950 and the sixth super pixel area 960 should be processed using an extension architecture.

In one embodiment, an image processing method is provided, which is illustrated as being applied in a computer device as shown in fig. 1.

First, a computer device may acquire an image to be segmented. Image segmentation refers to the division of an image into a number of specific regions with unique properties. The image to be segmented may be acquired in real time or may be pre-stored, for example, the image to be segmented acquired by the user may be acquired in real time through a camera, or the image to be segmented may be pre-stored in a database and then the image to be segmented is acquired from the database.

Then, the computer device can input the image to be segmented into the input variables of the full convolution neural network, and output the convolution characteristic map. And taking the image to be segmented as input item data of a pre-trained full convolution neural network. And inputting the image to be segmented into the convolution layer, scanning the input image to be segmented from front to back by the convolution kernel according to the corresponding length, namely according to the size of the convolution kernel, and executing convolution operation. The treatment is carried out on the pooling layer after the convolution treatment, and the pooling layer can effectively reduce dimensionality. The full convolution neural network can output the convolution characteristic graph obtained after the convolution layer and the pooling layer are processed.

The computer device may then input the convolution signature into an input variable of the context switchable neural network, outputting context expression information. The context switchable neural network is pre-trained according to the image structure and the depth data, and is a full convolution neural network and also comprises a convolution layer and a pooling layer. The context expression information refers to some or all of the information that can affect the object in the image. After the convolution characteristic diagram is driven into the input variable of the context switchable neural network, the convolution layer and the pooling layer in the context switchable neural network process the convolution characteristic diagram, and the context expression information of the convolution characteristic diagram can be obtained.

The computer device may also divide the convolution signature into superpixel regions, the superpixel regions being sub-regions of the convolution signature. After the image is subjected to superpixel division, a plurality of regions with different sizes can be obtained, and the regions contain effective information, such as color histograms and texture information. For example, there is a person in the image, and we can perform superpixel segmentation on the image of the person, and then recognize which part (head, shoulder, leg) of the human body these regions are located in by extracting features of each small region, thereby establishing a joint image of the human body. After the convolution characteristic diagram is subjected to super-pixel area division, a plurality of super-pixel areas can be obtained, the obtained super-pixel areas are not overlapped areas, and the super-pixel areas are all sub-areas of the convolution characteristic diagram. The computer device may also generate a local feature map from the super-pixel regions.

The computer device may also calculate an average depth value for the super-pixel region. The gray value of each pixel point in the depth image can be used for representing the distance between a certain point in the scene and the camera, and the depth value is the distance value between the certain point in the scene and the camera. Multiple objects can coexist in a super pixel region at the same time. By obtaining the depth value of each object in the super-pixel region, the average depth value of the whole super-pixel region can be calculated according to the depth value of each object. The computer device may also generate context expression information corresponding to the super-pixel region from the average depth value.

The computer device may also compare the average depth value to the conditional depth value. And when the average depth value is smaller than the conditional depth value, compressing the super-pixel area. And when the average depth value is larger than or equal to the condition depth value, expanding the super pixel area. When the super-pixel area is compressed, the local feature maps corresponding to the super-pixel area are input into three preset convolution neural networks for processing, and the compressed super-pixel area is obtained. Wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 1 and one neural network with convolution kernels of 3. When the super-pixel region is expanded, the local feature maps corresponding to the super-pixel region are input into three preset convolution neural networks to be processed, and the expanded super-pixel region is obtained. Wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 7 and one neural network with convolution kernel of 1.

Then, the context switchable neural network is trained by: and obtaining an input layer node sequence according to the convolution feature map and the category of the convolution feature map, projecting the input layer node sequence to obtain a hidden layer node sequence corresponding to the first hidden layer, and taking the first hidden layer as a current processing hidden layer. And obtaining a hidden layer node sequence of a next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently, taking the next hidden layer as the hidden layer processed currently, repeatedly entering the step of obtaining the hidden layer node sequence of the next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently until the hidden layer is output, and obtaining a context expression information probability matrix which is output by the output layer and corresponds to the category of the convolution characteristic graph.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated in the application, and may be performed in other orders. Moreover, at least a portion of the steps in the above-described flowcharts may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 10, there is provided an image processing apparatus including: an image acquisition module 1010, a feature map output module 1020, an information output module 1030, and a feature map generation module 1040, wherein:

an image obtaining module 1010, configured to obtain an image to be segmented.

And a feature map output module 1020, configured to input the image to be segmented into an input variable of the full convolution neural network, and output a convolution feature map.

And an information output module 1030, configured to input the convolution feature map into an input variable of the context-switchable neural network, and output context expression information.

And a feature map generation module 1040, configured to generate an intermediate feature map according to the convolution feature map and the context expression information, where the intermediate feature map is used to perform image segmentation.

In one embodiment, the information output module 1030 may be further configured to divide the convolution feature map into super-pixel regions, where the super-pixel regions are sub-regions of the convolution feature map, and generate the local feature map according to the super-pixel regions.

In one embodiment, the information output module 1030 may be further configured to calculate an average depth value of the super pixel region, and generate context expression information corresponding to the super pixel region according to the average depth value.

In one embodiment, as shown in fig. 11, the information output module 1030 includes a comparison module 1032, a compression module 1034, and an expansion module 1036, where:

a comparison module 1032 for comparing the average depth value with the conditional depth value.

A compression module 1034 configured to compress the super-pixel region when the average depth value is less than the conditional depth value.

An expansion module 1036 for expanding the super-pixel region when the average depth value is greater than or equal to the conditional depth value.

In an embodiment, the compression module 1034 may further be configured to input the local feature map corresponding to the super pixel region into three preset convolutional neural networks for processing, so as to obtain a compressed super pixel region. Wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 1 and one neural network with convolution kernels of 3.

In an embodiment, the expansion module 1036 may be further configured to input the local feature map corresponding to the super pixel region into three preset convolutional neural networks for processing, so as to obtain an expanded super pixel region. Wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 7 and one neural network with convolution kernel of 1.

In one embodiment, a context switchable neural network in an image segmentation apparatus is provided, which is trained by: and obtaining an input layer node sequence according to the convolution feature map and the category of the convolution feature map, projecting the input layer node sequence to obtain a hidden layer node sequence corresponding to the first hidden layer, and taking the first hidden layer as a current processing hidden layer. And obtaining a hidden layer node sequence of a next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently, taking the next hidden layer as the hidden layer processed currently, repeatedly entering the step of obtaining the hidden layer node sequence of the next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently until the hidden layer is output, and obtaining a context expression information probability matrix which is output by the output layer and corresponds to the category of the convolution characteristic graph.

For specific limitations of the image segmentation apparatus, reference may be made to the above limitations of the image segmentation method, which are not described herein again. The respective modules in the image segmentation apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring an image to be segmented; inputting an image to be segmented into an input variable of a full convolution neural network, and outputting a convolution characteristic diagram; inputting the convolution characteristic diagram into an input variable of a context switchable neural network, and outputting context expression information; and generating an intermediate feature map according to the convolution feature map and the context expression information, wherein the intermediate feature map is used for carrying out image segmentation.

In one embodiment, the processor, when executing the computer program, further performs the steps of: dividing the convolution characteristic graph into super pixel areas, wherein the super pixel areas are sub-areas of the convolution characteristic graph; a local feature map is generated from the superpixel regions.

In one embodiment, the processor, when executing the computer program, further performs the steps of: calculating the average depth value of the super pixel area; and generating context expression information corresponding to the super-pixel area according to the average depth value.

In one embodiment, the processor, when executing the computer program, further performs the steps of: comparing the average depth value with the conditional depth value; when the average depth value is smaller than the conditional depth value, compressing the super-pixel area; and when the average depth value is larger than or equal to the condition depth value, expanding the super pixel area.

In one embodiment, the processor, when executing the computer program, further performs the steps of: inputting the local feature map corresponding to the super-pixel area into three preset convolution neural networks for processing to obtain a compressed super-pixel area; wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 1 and one neural network with convolution kernels of 3.

In one embodiment, the processor, when executing the computer program, further performs the steps of: inputting the local feature map corresponding to the super-pixel area into three preset convolution neural networks for processing to obtain an expanded super-pixel area; wherein, the three convolution neural networks comprise two neural networks with convolution kernels of 7 and one neural network with convolution kernel of 1.

In one embodiment, the context switchable neural network is trained by: obtaining an input layer node sequence according to the convolution feature map and the category of the convolution feature map, projecting the input layer node sequence to obtain a hidden layer node sequence corresponding to a first hidden layer, and taking the first hidden layer as a current processing hidden layer; and obtaining a hidden layer node sequence of a next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently, taking the next hidden layer as the hidden layer processed currently, repeatedly entering the step of obtaining the hidden layer node sequence of the next hidden layer by adopting nonlinear mapping according to the hidden layer node sequence corresponding to the hidden layer processed currently and the weight and deviation of each neuron node corresponding to the hidden layer processed currently until the hidden layer is output, and obtaining a context expression information probability matrix which is output by the output layer and corresponds to the category of the convolution characteristic graph.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

The technical scheme of the application is proved to be feasible through experiments, and the specific experimental process is as follows:

in the experiment, the context switchable neural network of the present invention was tested using two common benchmarks, namely the NYUDv2 dataset and the SUN-RGBD dataset, for semantic segmentation of Depth images RGB-D (Red, Green, Blue, Depth Map). The NYUDv2 dataset has been widely used to evaluate segmentation performance, with 1449 RGB-D images. In this dataset, 795 images were used for training and 654 images were used for testing. First, a validation set of 414 images from the original training set may be selected. The categories of the image are labeled using a pixel-by-pixel label, with all pixels labeled 40 categories. The above method was evaluated using the NYUDv2 dataset and further the SUN-RGBD dataset was used for comparison with the advanced method.

Next, a multi-scale test is used to compute the segmentation results. That is, the test image is resized before being provided to the network using four scales (i.e., 0.6,0.8,1, 1.1). For post-processing of the conditional random field algorithm CRF (conditional random field algorithm), the output segmentation scores of the rescaled image are averaged.

When performing experiments on the NYUDv2 dataset, it was first necessary to calculate the sensitivity of the number of superpixels. In a context switchable neural network, the control of the context expression information depends in part on the size of the superpixel. The superpixel is resized using tools and different scales are empirically selected, 500, 1000, 2000, 4000, 8000 and 12000 respectively. For each scale, a context-switchable neural network can be trained based on the ResNet-101 model. The RGB image of the input depth image of the context-switchable neural network is used to switch features and the RGB image is used to segment the image. The segmentation accuracy of the NYUDv2 validation set is shown in table 1:

super pixel scale	500	1000	2000	4000	8000	12000
							Accuracy of segmentation	42.7	43.5	45.6	43.6	44.2	42.9

TABLE 1

As shown in table 1, the segmentation accuracy for each scale is expressed in average interaction ratio (%). When the scale is set to 500, the segmentation accuracy is lowest. This occurs because the superpixel is too small and therefore contains too little context information. As the scale increases, the segmentation performance improves. When the scale is set to 2000, the context-switchable neural network segmentation accuracy is best. Too large a superpixel may degrade performance because a too large superpixel may include additional objects that limit the staged preservation of superpixel properties. In subsequent experiments, the scale of 2000 can continue to be used to construct context-switchable neural networks.

Next, the experiment also requires a strategy for local structural information transfer. Local structural information transfer results in features that have a stronger relationship to the region. Analysis local structure information delivery is replaced by other strategies using structure information, as shown in table 2. The first experiment measures the performance of the method without using local structural information, applying the full version of context switchable neural network, and achieved a segmentation score of 45.6 on the NYUDv2 validation set. The context switchable neural network is then retrained without passing the local structural information of the superpixel. Likewise, all intermediate features are processed by the global identity map to an accuracy of 40.3. In addition, interpolation and deconvolution are applied to generate new features, where each region contains information of a broader but regular receptive field, but these methods generate features that are structure insensitive with lower scores than context switchable neural networks.

TABLE 2

As shown in Table 2, there are several ways to convey the local structural information of a superpixel. The information is calculated by averaging the features of the regions in the same superpixel, which means that the local structure mapping is implemented by the same kernel, from which a segmentation score of 43.8 is achieved. Since the same kernel does not contain learnable parameters, flexibility in selecting useful information is missed. Using different convolution kernels, e.g., 3 x 3 and 5 x 5 in size, larger kernels yield poorer results than 1 x 1 kernels, which capture the finer structure of the superpixel.

The experiment then also requires an evaluation of the switchable transfer from top to bottom. Given local structural features, a top-down switchable information transfer can be applied to generate context expressions that are guided by superpixels and depth, as follows:

as shown in Table 3, the top-down transfer can be measured from different data without using superpixels and depth, applying only deconvolution and interpolation to construct the context expression. The segmentation accuracy obtained is lower than for context-switchable neural networks.

TABLE 3

In the following tests, only the steering of the superpixel was disabled, followed by top-down information transfer. Without superpixels, switchable process transfers are performed on the compressed and extended feature maps, where information transfer is defined by the conventional kernel. Compared to this setup, the full context switchable neural network has better performance. In addition to the fact that superpixels provide more natural information transfer, the average depth computed over each superpixel achieves more stable feature switching by avoiding the noise depth of isolated regions.

Furthermore, experiments have also investigated the case where depth is not used in top-down handover information transfer. In this case, the compression and expansion feature maps are expressed as contexts, respectively. As shown in table 3, the independent compression/expansion feature mapping lacks flexibility to identify appropriate segmentation features. Their performance is lower than switchable architectures expressed by depth-driven contexts.

Next, the experiment also needs to adjust the tight characteristics of the context information. The top-down switchable information transfer consists of a compression structure and an expansion structure, which provide different context information. These architectures use compact features to generate context expressions. In experiments, the context information is adjusted by compressing and expanding the structure and indicating that they can achieve efficient compact properties.

TABLE 4

In table 4, the experiment provides a comparison of different compression structure designs. Among them, there is a simple way to perform information compression by learning compact features using 1 × 1 convolution, with the 1 × 1 convolution then used to recover the feature dimensions. This results in less accuracy than the compression architecture. In contrast to the simple alternative using two consecutive 1 × 1 convolutions, the compressed structure involves 3 × 3 convolutions between the two 1 × 1 convolutions. To some extent, the 3 × 3 convolution achieves a wider context, complementing the compact features resulting from the size reduction that may lead to information loss, and the features obtained by the 3 × 3 convolution of the compressed structure are still compact. Performance is lower than for the compression architecture when the last 1 x 1 convolution to recover the feature dimension is removed and the 3 x 3 convolution is used directly to generate the relatively high dimensional feature. This indicates the importance of the compact features generated by the 3 x 3 convolution.

In table 5, the extended structure was studied initially and compared to different information extensions. Again, only one convolution layer with 7 x 7 of a single convolution kernel was used to expand the receptive field, resulting in a segmentation score of 43.8. Performance can be further improved given the large convolutional layer that adds additional convolutional kernels. Two 7 x 7 convolutional layers were therefore used to obtain a higher 44.2 point. The segmentation score resulting from the above convolution is lower than the extended structure, which uses 1 × 1 convolutional layer to compute the compact features.

TABLE 5

Next, comparative experiments of context-switchable neural networks with advanced methods are as follows: comparing the context-switchable neural network with the most advanced methods, all methods are divided into two groups. All methods were evaluated on the NYUDv2 test set. The first group includes methods that use only RGB images for segmentation, and the performance of these methods is listed in the column RGB input. Deep networks have top-down information delivery, resulting in high quality segmentation features. The accuracy of the multipath refined network is highest in this group. As shown in table 6:

TABLE 6

The context-switchable neural network is then compared to a second set of methods, which take the RGB-D image as input. Each depth image is coded into a HHA image with 3 channels to maintain richer geometric information. HHA images are used to train a separate segmentation network instead of RGB images. The trained network is tested on the HHA images to obtain a segmentation score map, which is combined with a score map computed by the network trained on the RGB images. Using this combination strategy, the best approach is to cascade the feature networks, with a result of 47.7. Using RGB and HHA images can improve segmentation accuracy compared to the network.

In addition, RGB and HHA images may also be used as training and testing data. Based on ResNet-101, the context-switchable neural network reaches 48.3 points. Further, a deeper ResNet-152 structure is adopted to construct a context switchable neural network, and the segmentation score is improved to 49.6. This result is about 2% better than the most advanced method.

As shown in fig. 12, the image of the context-switchable neural network segmentation process was compared to the image of the advanced method segmentation process, where the picture was taken in NYUDv2 dataset. The context switchable neural network can improve the image segmentation precision.

Next, the context switchable neural network also performs experiments on the SUN-RGBD dataset. The sungbd dataset, which contains 10335 images marked with 37 classes in the SUN-RGBD dataset, has more complex scene and depth conditions than the NYUDv2 dataset. From this data set, 5285 images were selected for training, and the remainder were tested. In this experiment, the context-switchable neural network was again compared to a method that employs RGB and HHA together as the input image. The best performance on the SUN-RGBD data set was previously generated by the cascaded feature network approach. The model is based on the ResNet-152 structure, and a simpler ResNet-101 structure can be used to obtain better results due to reasonable modeling processing for information transfer. With deeper ResNet-152, a segmentation accuracy of 50.7 was obtained, which is better than all comparison methods.

As shown in fig. 13, the context-switchable neural network segmentation processed image is compared with the advanced method segmentation processed image, wherein the picture is collected from the SUN-RGBD data set. The context switchable neural network can improve the image segmentation precision.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of image segmentation, the method comprising:

acquiring an image to be segmented;

inputting the convolution characteristic diagram into an input variable of a context switchable neural network, and dividing the convolution characteristic diagram into super pixel regions, wherein the super pixel regions are sub-regions of the convolution characteristic diagram; the context switchable neural network is a full convolution neural network which is trained in advance according to an image structure and depth data;

calculating an average depth value of the super pixel region;

generating context expression information corresponding to the super pixel area according to the average depth value;

2. The method of claim 1, further comprising:

and generating a local feature map according to the super-pixel region.

3. The method of claim 2, wherein generating context expression information corresponding to the super-pixel region according to the average depth value comprises:

comparing the average depth value to a conditional depth value;

4. The method of claim 3, wherein compressing the super-pixel region when the average depth value is less than the conditional depth value comprises

5. The method of claim 3, wherein expanding the super-pixel region when the average depth value is greater than or equal to the conditional depth value comprises:

6. The method of any of claims 1 to 5, wherein the context-switchable neural network is trained by:

7. An image segmentation apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring an image to be segmented;

the information output module is used for inputting the convolution characteristic diagram into an input variable of a context switchable neural network, and dividing the convolution characteristic diagram into super pixel areas, wherein the super pixel areas are sub-areas of the convolution characteristic diagram; the context switchable neural network is a full convolution neural network which is trained in advance according to an image structure and depth data; calculating an average depth value of the super pixel region; generating context expression information corresponding to the super pixel area according to the average depth value;

8. The apparatus of claim 7, wherein the information output module is further configured to generate a local feature map from the super-pixel region.

9. The apparatus of claim 7, wherein the information output module is further configured to compare the average depth value with a conditional depth value; compressing the super-pixel region when the average depth value is less than the conditional depth value; and when the average depth value is larger than or equal to the conditional depth value, expanding the super-pixel area.

10. The apparatus according to claim 9, wherein the information output module is further configured to, when the average depth value is smaller than the conditional depth value, input the local feature map corresponding to the super-pixel region into three preset convolutional neural networks for processing, so as to obtain a compressed super-pixel region; wherein the three convolutional neural networks comprise two neural networks with convolution kernels of 1 and one neural network with convolution kernels of 3.

11. The apparatus according to claim 9, wherein the information output module is further configured to, when the average depth value is greater than or equal to the conditional depth value, input the local feature map corresponding to the super-pixel region into three preset convolutional neural networks for processing, so as to obtain an expanded super-pixel region; wherein the three convolutional neural networks comprise two neural networks with convolution kernels of 7 and one neural network with convolution kernel of 1.

12. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.