CN112749707A

CN112749707A - Method, apparatus, and medium for object segmentation using neural networks

Info

Publication number: CN112749707A
Application number: CN202110097767.0A
Authority: CN
Inventors: 伍天意; 郭国栋; 朱欤
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-04

Abstract

The present disclosure provides a method, an apparatus, and a medium for object segmentation using a neural network, which relate to the technical field of artificial intelligence, and in particular, to the technical field of deep learning and computer vision. The neural network includes: a first sub-network configured to receive a prior image to generate a prior feature map of the prior image; a second sub-network subsequent to the first sub-network configured to receive a prior feature map of the prior image and a target segmentation result of the prior image to generate at least one set of template features of the prior image; a third sub-network juxtaposed to the first sub-network and configured to receive the current image to generate a current feature map of the current image; a fourth sub-network subsequent to the second and third sub-networks, configured to receive the current feature map and at least one set of template features of a prior image to generate a space-time information feature map; and a fifth sub-network following the fourth sub-network, configured to receive the space-time information feature map to generate a predicted target segmentation result for the current image.

Description

Method, apparatus, and medium for object segmentation using neural networks

Technical Field

The present disclosure relates to the field of artificial intelligence technology, in particular to the field of deep learning and computer vision technology, and more particularly to a method, device, and medium for object segmentation using a neural network.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

The target segmentation is a hot direction of computer vision, and is widely applied to the fields of automatic driving, intelligent video monitoring, industrial detection and the like. Reducing human capital expenditure by implementing computer vision is of great practical significance. Therefore, the target segmentation becomes a research hotspot of theory and application in recent years. Due to the wide application of deep learning, the target segmentation method is developed rapidly, but the accuracy of the existing target segmentation method is still to be improved.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a method, apparatus, and medium for object segmentation using neural networks.

According to an aspect of the present disclosure, there is provided a neural network configured to receive a current image, a previous image, and a target segmentation result of the previous image to predict a target segmentation result of the current image, the target segmentation result indicating a class of each pixel in a corresponding image, the neural network comprising: a first sub-network configured to receive the prior image to generate a prior feature map of the prior image; a second sub-network subsequent to the first sub-network, the second sub-network configured to receive a prior feature map of the prior image and a target segmentation result of the prior image to generate at least one set of template features of the prior image; a third sub-network juxtaposed with the first sub-network, the third sub-network configured to receive the current image to generate a current feature map of the current image; a fourth sub-network subsequent to the second and third sub-networks, the fourth sub-network configured to receive the current feature map and at least one set of template features of the prior image to generate a space-time information feature map; and a fifth sub-network following the fourth sub-network, the fifth sub-network configured to receive the space-time information feature map to generate a predicted target segmentation result for the current image.

According to another aspect of the present disclosure, there is provided a method of object segmentation using a neural network including a first sub-network, a second sub-network, a fourth sub-network, a fifth sub-network, and a third sub-network preceding the fourth sub-network connected in sequence, the method including: processing a prior image with the first sub-network, wherein the first sub-network is configured to receive the prior image to generate a prior feature map of the prior image; processing the prior feature map and the target segmentation results of the prior image with the second sub-network, wherein the second sub-network is configured to receive the prior feature map and the target segmentation results of the prior image to generate at least one set of template features of the prior image; processing a current image with the third sub-network, wherein the third sub-network is configured to receive the current image to generate a current feature map of the current image; processing the current feature map and at least one set of template features of the prior image with the fourth sub-network, wherein the fourth sub-network is configured to receive the current feature map and the at least one set of template features of the prior image to generate a space-time information feature map; and processing the space-time information feature map by using the fifth sub-network, wherein the fifth sub-network is configured to receive the space-time information feature map to generate a predicted target segmentation result of the current image, wherein the target segmentation result is used for indicating the category of each pixel in the corresponding image.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described target segmentation method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above-described target segmentation method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the above object segmentation method when executed by a processor.

According to one or more embodiments of the disclosure, by inputting the feature map of the previous image and the target segmentation result into the second sub-network to obtain the template features of the previous image output by the second sub-network, and by inputting at least one set of template features of the previous image and the feature map of the current image into the fourth sub-network to obtain the space-time information feature map output by the fourth sub-network, the neural network can obtain the target segmentation result of the current image based on the similarity information between the feature map of the current image characterized by the space-time information feature map and each set of template features of the previous image, and meanwhile, because the template features of the previous image are used as the input of the fourth sub-network instead of the feature map, the requirements and expenses of the neural network on the computing resources and the display resources during prediction are greatly reduced, thereby significantly improving the processing speed of the neural network, and because fewer features are used, the generalization capability of the neural network is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

Fig. 1 shows an application architecture block diagram of a neural network according to an exemplary embodiment of the present disclosure;

FIG. 2 shows a block diagram of an application architecture of a fourth subnetwork in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 shows a block diagram of an application architecture of a neural network according to an exemplary embodiment of the present disclosure;

fig. 4 shows an application structure block diagram of a sixth sub-network according to an exemplary embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a method of target segmentation using a neural network, according to an example embodiment of the present disclosure;

FIG. 6 shows a flowchart for processing at least one set of template features of a current feature map and a prior image using a fourth sub-network, according to an example embodiment of the present disclosure;

FIG. 7 shows a flowchart of a method of target segmentation using a neural network, according to an example embodiment of the present disclosure;

FIG. 8 shows a flowchart for processing a spatiotemporal information feature map and a target segmentation result of a prior image using a sixth sub-network according to an example embodiment of the present disclosure; and

FIG. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the related art, predicting the target segmentation result of the current image by using a plurality of previous images and the target segmentation result of the previous images requires calculating feature maps of the current image and each previous image, calculating similarity between each pixel in the feature map of the current image and each pixel in the feature map of each previous image, and obtaining a space-time information feature map including previous image information and current image information according to the similarity. However, this method needs to process a large amount of feature information while using a video memory, and needs to consume a large amount of computing resources, so when a corresponding neural network is used to execute a target segmentation task, the processing speed of the prediction process is slow, and the performance is poor.

In order to solve the problems, the disclosure inputs the feature map of the previous image and the target segmentation result into the second sub-network to obtain the template features of the previous image output by the second sub-network, and inputs at least one set of template features of the previous image and the feature map of the current image into the fourth sub-network to obtain the space-time information feature map output by the fourth sub-network, so that the neural network can obtain the target segmentation result of the current image based on the similarity information between the feature map of the current image characterized by the space-time information feature map and each set of template features of the previous image, and simultaneously, because the template features of the previous image are used instead of the feature map as the input of the fourth sub-network, the requirement and the cost of the neural network on the computing resources and the display resources are greatly reduced during prediction, the processing speed of the neural network is remarkably improved, and because fewer features are used, the generalization capability of the neural network is improved.

In this disclosure, the term "representation" is used to refer to a user-interactive graphical user interface object displayed on an electronic device. Such as a pattern, icon, image, shape, button, specific area, text, and any combination thereof. Illustratively, the representation of the application may be, for example, an icon of the application, a name text of the application, or a combination of a text passage introducing the application and an application icon, and so on. Illustratively, the authentication information representation may be, for example, an authentication information input area, a displayed authentication image or a pattern characterizing a fingerprint identification request, or the like.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

According to an aspect of the present disclosure, a neural network is provided. The neural network is configured to receive a current image, a previous image, and a target segmentation result of the previous image to predict a target segmentation result of the current image. As shown in fig. 1, the neural network may include: a first sub-network 101 configured to receive a prior image 107 to generate a prior feature map of the prior image; a second sub-network 102 following the first sub-network 101, configured to receive a prior feature map of the prior image and a target segmentation result 106 of the prior image to generate at least one set of template features of the prior image; a third sub-network 103, juxtaposed to the first sub-network 101, configured to receive the current image 108 to generate a current feature map of the current image; a fourth sub-network 104 following the second sub-network 102 and the third sub-network 103, configured to receive the current feature map and at least one set of template features of a previous image, to generate a space-time information feature map; and a fifth sub-network 105 following the fourth sub-network 104 is configured to receive the spatio-temporal information feature map to generate a predicted target segmentation result for the current image. Therefore, the feature map of the prior image and the target segmentation result are input into the second sub-network to obtain the template features of the prior image output by the second sub-network, at least one group of template features of the prior image and the feature map of the current image are input into the fourth sub-network to obtain the space-time information feature map output by the fourth sub-network, so that the neural network can obtain the target segmentation result of the current image based on the similarity information of the feature map of the current image and each group of template features of the prior image, and simultaneously, because the template features of the prior image are used instead of the feature map as the input of the fourth sub-network, the requirement and the cost of the neural network on computing resources and display resources are greatly reduced during prediction, the processing speed of the neural network is remarkably improved, and because fewer features are used, the generalization capability of the neural network is improved.

According to some embodiments, the previous image and the current image may be different video frames in the same video, and the previous image may be taken earlier in time than the current image. Therefore, the target segmentation result of the current image is predicted by combining the characteristic information of the current image with the characteristic information of the previous image with the shooting time being earlier than that of the current image, so that the neural network can estimate the information such as the position and the size of the target in the current image based on the time sequence relation between the previous image and the current image, and the prediction capability of the neural network is improved. It is understood that a plurality of previous images may be used, and the feature information of each of the plurality of previous images and the respective target segmentation result may be used in combination with the feature information of the current image to predict the target segmentation result of the current image.

According to some embodiments, the target segmentation result may be used to indicate a class of each pixel in the corresponding image. The target segmentation result may be, for example, a mask, the size of which is the same as the size of the image, and the value of each pixel of the mask can indicate the class of this pixel in the image. For example, in a mask of an image in which a bicycle and a person appear simultaneously, a value of a pixel corresponding to the bicycle may be 1, a value of a pixel corresponding to the person may be 2, and values of other pixels may be 0. It is to be understood that the above is only one way of representing the mask, for example, different channels of the mask may be used to represent different classes of pixels in the image, and other ways than the mask may also be used to characterize the target segmentation result. The target segmentation result of the previous image may be generated by a neural network, or may be pre-labeled, and is not limited herein.

According to some embodiments, the class of each pixel in the previous, current image is able to characterize whether the pixel belongs to one of the at least one object comprised in the previous image. When there are a plurality of previous images, the plurality of previous images may include a reference image; when there is only one preceding image, the preceding image may be used as the reference image. In this case, the class of each pixel in the previous image, the reference image or the current image can characterize whether the pixel belongs to one of the at least one object comprised by the reference image. Thus, by determining at least one object in the reference image, the neural network is enabled to predict the segmentation result of the objects in the current image, and the neural network is enabled to focus more on the feature information related to the objects in the prediction process. In a preferred exemplary embodiment, the target segmentation result of the reference image is manually annotated.

It is understood that the current image does not necessarily include all of the objects included in the segmentation result of the reference image. Illustratively, if the reference image includes a bicycle and a person, then only the bicycle, only the person, or neither may appear in the current image, and the present invention is not limited thereto. When there are a plurality of previous images, the previous images other than the reference image do not necessarily include all the targets included in the segmentation result of the reference image.

According to some embodiments, the reference image may be a previous image whose shooting time is earliest. For example, the neural network may predict a target segmentation result of a second frame based on a first frame, i.e., a reference image, of the video, and then predict a target segmentation result of a third frame based on the first frame and the second frame, and so on until target segmentation results of all video frames in the video are obtained. Illustratively, the previous image except the reference image is a video frame which has obtained the target segmentation result by using a neural network, and the current image is a video frame which needs to execute the target segmentation task.

For example, the first sub-network 101 may use a convolutional neural network, may use a residual neural network, or may construct a network that can output a feature map of an input image based on the image, which is not limited herein. Preferably, the first sub-network 101 employs a ResNet50 network. It will be appreciated that when there are multiple prior images, the same number of first sub-network groups (not shown) as prior images may be used for parallel processing.

The second sub-network 102 may be configured to receive a feature map of a prior image and a target segmentation result to generate at least one set of template features of the prior image. According to some embodiments, the at least one set of template features of the previous image may have a one-to-one correspondence with the at least one object comprised by the reference image, and wherein each set of template features of the at least one set of template features of the previous image comprises one or more template features. Thus, the neural network is able to compute a set of template features corresponding to each target in the previous image. It will be appreciated that when there are multiple prior images, a second set of subnetworks (not shown) may be used in the same number as the prior images for parallel processing.

According to some embodiments, each set of template features of the at least one set of template features of the previous image may be obtained by performing a clustering algorithm on regions of the previous feature map of the previous image to which the objects corresponding to the set of template features are mapped. Therefore, by using the clustering algorithm, each group of template features can learn the most typical features of the target corresponding to the group of template features, so that the information in the online feature map can be kept as much as possible under the condition that the display memory resource consumption and the calculation resource consumption of the neural network are greatly reduced.

The specific principle of performing the clustering algorithm is as follows:

in the second sub-network 102, a set of template features for one of the targets can be represented as a probabilistic mixture model, which is a linear combination of a family of distributions. The formalization definition is as follows:

wherein s is_iAnd representing the ith characteristic sample, wherein the sample characteristic is the pixel set of which the category in the prior characteristic graph is the target. Let the size of the prior feature map be H × W × C, the value range of i is:

i is more than or equal to 1 and less than or equal to the pixel number corresponding to one target is less than or equal to (H multiplied by W)

s_iThe size of (a) is 1 × 1 × C. And theta is a model parameter. p is a radical of_k(s_iAnd | θ) represents the kth probability model based on vector distance, namely the kth template feature, and the value range of K is more than or equal to 1 and less than or equal to K, wherein K is the number of probability models, namely the number of template features in each group of template features. The value of K may be, for example, 3, 5, or 10, or other values, and the values of K corresponding to different targets may be the same or different, which is not limited herein. Wherein w is more than or equal to 0_kIs less than or equal to 1, and

are normalized parameters. p is a radical of_k(s_i| θ) is specifically defined as:

wherein, beta_c(θ) is a normalization parameter; mu.s_kE is theta and is the mean vector of the kth model;

p_k(s_i| θ) can be further written as:

here theta is taken into mu and k. Normalized parameter beta_c(k) Expressed as:

wherein, I_v() Is a Bessel function. Through the expression, the probability model corresponding to each template characteristic can be obtained. By learning the model parameters using the EM algorithm, each template feature μ in each set of template features can be derived_k. Wherein, E-steps can obtain the expectation of the characteristic sample based on the given model parameters and the characteristic sample:

and in M-steps, the parameters of the expected updating model, namely the template characteristics:

thus, a set of template features for one of the objects can be obtained. By executing the above process on each target, at least one set of template features corresponding to each target is obtained.

For example, the third sub-network 103 may use a convolutional neural network, may use a residual neural network, or may construct a network that can output a feature map of an input image based on the image, which is not limited herein. The third sub-network may be structurally identical to the first sub-network 101. Preferably, the third sub-network 103 employs a ResNet50 network.

The fourth sub-network 104 may be configured to receive each set of template features of the current feature map and all prior images to generate a spatio-temporal information feature map. According to some embodiments, as shown in fig. 2, the fourth sub-network 104 may comprise: a similarity calculation layer 1041 configured to receive the current feature map 1045 and the at least one set of template features 1044 of the previous image, so as to generate a similarity calculation result of the at least one set of template features in the current feature map and the previous image; a matrix multiplication layer 1042 following the similarity calculation layer 1041, configured to receive the similarity calculation result and at least one set of template features 1044 of the previous image to generate a space-time matching feature map; and a second concatenation layer 1043, following the matrix multiplication layer 1042, configured to receive the current profile 1045 and the space-time matching profile, and concatenate the current profile 1045 and the space-time matching profile into a space-time information profile. Therefore, the similarity between the features of the image and all the template features of all the previous images is calculated, and the similarity information is used as the weight to be multiplied by all the previous template features to obtain a space-time matching feature map, so that the attention mechanism of two layers of time and space is realized, and the accuracy of the prediction result of the neural network can be improved. And splicing the space-time matching characteristic graph and the current characteristic graph, so that the neural network can predict based on the matching result and the characteristics of the current image.

According to some embodiments, the similarity calculation may be implemented using, for example, matrix multiplication. The operation process of the fourth sub-network 104 may include, for example: let the size of the current feature map be H × W × C, and the size of the template features received by the similarity calculation layer 1041 after being spliced be T × K × C, where T represents the number of previous images and K represents the number of template features. The size of the feature map is changed to HW × C, the size of the feature of the splicing template is changed to TK × C, and a similarity matrix output by the similarity calculation layer 1041 can be obtained by matrix multiplication, with the size of HW × TK. The similarity matrix and the feature of the splicing template are input into the matrix multiplication layer 1042 and are subjected to matrix multiplication to obtain a space-time matching feature map output by the matrix multiplication layer 1042, and the size of the space-time matching feature map is HW × C. The space-time matching feature map and the current feature map are input into the second splicing layer 1043, the size of the space-time matching feature map is changed to H × W × C, and then the space-time matching feature map and the current feature map are spliced along the channel direction to obtain a space-time information feature map, wherein the size of the space-time information feature map is H × W × 2C.

According to some embodiments, as shown in fig. 3, the neural network may further include a sixth sub-network 306 located between the fourth sub-network 304 and the fifth sub-network 305, and configured to receive the space-time information feature map and the target segmentation result 307 of the previous image to generate a feature map to be processed. The fifth sub-network 305 may be configured to receive the feature map to be processed to output a predicted target segmentation result of the current image. The other sub-networks 301-. Therefore, by constraining the space-time information feature map based on the target segmentation result of the prior image, the spatial consistency between the prior image and the current image can be ensured, so that apparent confusion or confusion between different instances of the same kind of object can be eliminated.

According to some embodiments, as shown in fig. 4, the sixth sub-network 306 may comprise: a first concatenation layer 3061 configured to receive the space-time information feature map 3065 and a target segmentation result 3064 of the previous image, and concatenate the space-time information feature map and the target segmentation result into a first spatial constraint feature map; at least one convolutional layer 3062 after the first stitching layer 3061, configured to receive the first spatial constraint feature map to generate a second spatial constraint feature map; and a point multiplication layer 3063 after the at least one convolution layer 3062, configured to receive the space-time information feature map 3065 and the second spatial constraint feature map, and perform point multiplication on the space-time information feature map and the second spatial constraint feature map to generate a feature map to be processed. Therefore, the space-time information characteristic graph is restrained by using the target segmentation result of the previous image, so that the target segmentation result of the previous image can provide guidance for estimating the target of the current image during prediction of the neural network, and the prediction accuracy of the neural network is improved.

In one exemplary embodiment, when a plurality of previous images are used for prediction, a previous image having the shortest photographing time interval from the current image may be used as an input of the first mosaic layer 3061.

According to some embodiments, when the width and height of the mask representing the segmentation result of the previous image are different from those of the space-time information feature map, the mask can be scaled and then spliced with the space-time information feature map.

For example, the fifth sub-network 105 may use a convolutional neural network, may also use a residual neural network, or may self-construct a neural network that can output a prediction target segmentation result of the current image based on an input space-time information feature map, which is not limited herein.

According to another aspect of the present disclosure, a method of target segmentation using a neural network is provided. Wherein the neural network comprises a first sub-network, a second sub-network, a fourth sub-network, a fifth sub-network and a third sub-network parallel to the first sub-network, which are connected in sequence, as shown in fig. 5, the method of object segmentation may comprise: step S501, processing a prior image by utilizing a first sub-network, wherein the first sub-network is configured to receive the prior image to generate a prior feature map of the prior image; step S502, processing a prior feature map and a target segmentation result of a prior image by using the second sub-network, wherein the second sub-network is configured to receive the prior feature map and the target segmentation result of the prior image to generate at least one set of template features of the prior image; step S503, processing the current image by a third sub-network, wherein the third sub-network is configured to receive the current image to generate a current feature map of the current image; step S504, processing at least one set of template features of a current feature map and a previous image by using a fourth sub-network, wherein the fourth sub-network is configured to receive the current feature map and the at least one set of template features of the previous image to generate a space-time information feature map; and step S505, processing a space-time information feature map by using a fifth sub-network, wherein the fifth sub-network is configured to receive the space-time information feature map to generate a prediction target segmentation result of the current image. Therefore, the feature map of the prior image and the target segmentation result are input into the second sub-network to obtain the template features of the prior image output by the second sub-network, at least one group of template features of the prior image and the feature map of the current image are input into the fourth sub-network to obtain the space-time information feature map output by the fourth sub-network, so that the neural network can obtain the target segmentation result of the current image based on the similarity information of the feature map of the current image and each group of template features of the prior image, and simultaneously, because the template features of the prior image are used instead of the feature map as the input of the fourth sub-network, the requirement and the cost of the neural network on computing resources and display resources are greatly reduced during prediction, the processing speed of the neural network is remarkably improved, and because fewer features are used, the generalization capability of the neural network is improved.

According to some embodiments, the at least one set of template features of the previous image may have a one-to-one correspondence with the at least one object comprised by the reference image, and wherein each set of template features of the at least one set of template features of the previous image comprises one or more template features. Thus, the neural network is able to compute a set of template features corresponding to each target in the previous image.

According to some embodiments, the fourth subnetwork comprises a sequentially connected similarity calculation layer, a matrix multiplication layer and a second splicing layer. As shown in fig. 6, the processing of the current feature map and the at least one set of template features of the previous image by the fourth sub-network at step S504 may include: step S5041, processing at least one set of template features of a current feature map and a previous image by using a similarity calculation layer, wherein the similarity calculation layer is configured to receive the current feature map and the at least one set of template features of the previous image to generate a similarity calculation result of the current feature map and the at least one set of template features of the previous image; step S5042, processing a similarity calculation result and at least one set of template features of a previous image by using a matrix multiplication layer, wherein the matrix multiplication layer is configured to receive the similarity calculation result and the at least one set of template features of the previous image to generate a space-time matching feature map; and step S5043, processing the current feature map and the space-time matching feature map by using a second concatenation layer, where the second concatenation layer is configured to receive the current feature map and the space-time matching feature map, and concatenate the current feature map and the space-time matching feature map into the space-time information feature map. Therefore, the similarity between the features of the image and all the template features of all the previous images is calculated, and the similarity information is used as the weight to be multiplied by all the previous template features to obtain a space-time matching feature map, so that the attention mechanism of two layers of time and space is realized, and the accuracy of the prediction result of the neural network can be improved. And splicing the space-time matching characteristic graph and the current characteristic graph, so that the neural network can predict based on the matching result and the characteristics of the current image.

According to some embodiments, the neural network further comprises a sixth sub-network located between the fourth sub-network and the fifth sub-network. As shown in fig. 7, the method of object segmentation may further include: step S705, processing a space-time information feature map and a target segmentation result of a previous image by using a sixth sub-network, wherein the sixth sub-network is configured to receive the space-time information feature map and the target segmentation result of the previous image to generate a feature map to be processed; and step S706, processing the feature map to be processed by using a fifth sub-network, wherein the fifth sub-network is configured to receive the feature map to be processed to output a prediction target segmentation result of the current image. Steps S701 to S704 in fig. 7 are similar to steps S501 to S504 in fig. 5, and are not repeated here. Therefore, by constraining the space-time information feature map based on the target segmentation result of the prior image, the spatial consistency between the prior image and the current image can be ensured, so that apparent confusion or confusion between different instances of the same kind of object can be eliminated.

According to some embodiments, the sixth sub-network comprises a first splice layer, at least one convolutional layer, and a dot-multiplied layer, which are sequentially connected. As shown in fig. 8, the step S705 of processing the space-time information feature map and the target segmentation result of the previous image by using the sixth sub-network may further include: step S7051, processing a space-time information characteristic diagram and a target segmentation result of a previous image by using a first splicing layer, wherein the first splicing layer is configured to receive the space-time information characteristic diagram and the target segmentation result of the previous image, and splice the space-time information characteristic diagram and the target segmentation result into a first space constraint characteristic diagram; step S7052, processing the first spatially-constrained feature map by using at least one convolutional layer, wherein the at least one convolutional layer is configured to receive the first spatially-constrained feature map to generate a second spatially-constrained feature map; and step S7053, processing the space-time information characteristic diagram and the second spatial constraint characteristic diagram by using a point multiplication layer, wherein the point multiplication layer is configured to receive the space-time information characteristic diagram and the second spatial constraint characteristic diagram, and perform point multiplication on the space-time information characteristic diagram and the second spatial constraint characteristic diagram to generate the feature diagram to be processed. Therefore, the space-time information characteristic graph is restrained by using the target segmentation result of the previous image, so that the target segmentation result of the previous image can provide guidance for estimating the target of the current image during prediction of the neural network, and the prediction accuracy of the neural network is improved.

In one exemplary embodiment, when prediction is performed using a plurality of previous images, step S7051 may use a previous image having the shortest capturing time interval from the current image.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

A block diagram of a structure of an electronic device 900, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described with reference to fig. 9. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the device 900, and the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 908 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 909 allows the device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the method of object segmentation. For example, in some embodiments, the method of object segmentation may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the method of object segmentation described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of object segmentation.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A neural network configured to receive a current image, a previous image, and a target segmentation result of the previous image to predict a target segmentation result of the current image, the target segmentation result indicating a class of each pixel in a corresponding image, the neural network comprising:

a first sub-network configured to receive the prior image to generate a prior feature map of the prior image;

a second sub-network subsequent to the first sub-network, the second sub-network configured to receive a prior feature map of the prior image and a target segmentation result of the prior image to generate at least one set of template features of the prior image;

a third sub-network juxtaposed with the first sub-network, the third sub-network configured to receive the current image to generate a current feature map of the current image;

a fourth sub-network subsequent to the second and third sub-networks, the fourth sub-network configured to receive the current feature map and at least one set of template features of the prior image to generate a space-time information feature map; and

a fifth sub-network subsequent to the fourth sub-network, the fifth sub-network configured to receive the space-time information feature map to generate a predicted target segmentation result for the current image.

2. A neural network as claimed in claim 1, wherein the class of each pixel is capable of characterising whether the pixel belongs to one of the at least one object comprised in the previous image.

3. The neural network of claim 1, further comprising:

a sixth sub-network located between the fourth sub-network and the fifth sub-network and configured to receive the space-time information feature map and a target segmentation result of the prior image to generate a feature map to be processed; and

the fifth sub-network is configured to receive the feature map to be processed to output a predicted target segmentation result of the current image.

4. The neural network of claim 2, wherein the at least one set of template features of the prior image has a one-to-one correspondence with the at least one target included in the prior image, and wherein each set of template features of the at least one set of template features of the prior image includes one or more template features.

5. A neural network as claimed in claim 4, wherein each set of at least one set of template features of the previous image is derived by performing a clustering algorithm on regions of the feature map to which objects corresponding to that set of template features map.

6. The neural network of claim 3, the sixth sub-network comprising:

the first splicing layer is configured to receive the space-time information characteristic diagram and a target segmentation result of the previous image, and splice the space-time information characteristic diagram and the target segmentation result into a first space constraint characteristic diagram;

at least one convolutional layer after the first splice layer configured to receive the first spatially constrained feature map to generate a second spatially constrained feature map; and

and the dot multiplication layer behind the at least one convolution layer is configured to receive the space-time information characteristic diagram and the second spatial constraint characteristic diagram, and perform dot multiplication on the space-time information characteristic diagram and the second spatial constraint characteristic diagram to generate the feature diagram to be processed.

7. The neural network of claim 1, wherein the fourth sub-network comprises:

a similarity calculation layer configured to receive the current feature map and at least one set of template features of the previous image to generate a similarity calculation result of the current feature map and the at least one set of template features of the previous image;

a matrix multiplication layer following the similarity calculation layer configured to receive the similarity calculation result and at least one set of template features of the prior image to generate a space-time matching feature map; and

and a second splicing layer after the matrix multiplication layer is configured to receive the current feature map and the space-time matching feature map and splice the current feature map and the space-time matching feature map into the space-time information feature map.

8. The neural network of claim 1, wherein the prior image and the current image are different video frames in the same video, wherein the prior image is captured earlier than the current image.

9. A method of object segmentation using a neural network comprising a first sub-network, a second sub-network, a fourth sub-network, a fifth sub-network, and a third sub-network juxtaposed to the first sub-network, connected in series, the method comprising:

processing a prior image with the first sub-network, wherein the first sub-network is configured to receive the prior image to generate a prior feature map of the prior image;

processing the prior feature map and the target segmentation results of the prior image with the second sub-network, wherein the second sub-network is configured to receive the prior feature map and the target segmentation results of the prior image to generate at least one set of template features of the prior image;

processing a current image with the third sub-network, wherein the third sub-network is configured to receive the current image to generate a current feature map of the current image;

processing the current feature map and at least one set of template features of the prior image with the fourth sub-network, wherein the fourth sub-network is configured to receive the current feature map and the at least one set of template features of the prior image to generate a space-time information feature map; and

processing the space-time information feature map with the fifth sub-network, wherein the fifth sub-network is configured to receive the space-time information feature map to generate a predicted target segmentation result for the current image,

wherein the target segmentation result is used to indicate a category of each pixel in the corresponding image.

10. The method of claim 9, wherein the class of each pixel is capable of characterizing whether the pixel belongs to one of the at least one object comprised in the previous image.

11. The method of claim 1, wherein the neural network further comprises a sixth sub-network located between the fourth sub-network and the fifth sub-network, and wherein the method further comprises:

processing the space-time information feature map and the target segmentation result of the previous image by using the sixth sub-network, wherein the sixth sub-network is configured to receive the space-time information feature map and the target segmentation result of the previous image to generate a feature map to be processed; and

processing the feature map to be processed by using the fifth sub-network, wherein the fifth sub-network is configured to receive the feature map to be processed to output a prediction target segmentation result of the current image.

12. The method of claim 10, wherein the at least one set of template features of the prior image has a one-to-one correspondence with the at least one object included in the prior image, and wherein each set of template features of the at least one set of template features of the prior image includes one or more template features.

13. The method of claim 12, wherein each of the at least one set of template features of the prior image is obtained by performing a clustering algorithm on regions of the feature map to which the objects corresponding to the set of template features are mapped.

14. The method of claim 11, wherein the sixth sub-network comprises a first splice layer, at least one convolutional layer, and a dot-multiplied layer connected in sequence,

wherein the processing, by the sixth sub-network, the space-time information feature map and the target segmentation result of the prior image comprises:

processing the space-time information characteristic diagram and a target segmentation result of the previous image by using the first splicing layer, wherein the first splicing layer is configured to receive the space-time information characteristic diagram and the target segmentation result of the previous image and splice the space-time information characteristic diagram and the target segmentation result into a first space constraint characteristic diagram;

processing the first spatially constrained feature map with the at least one convolutional layer, wherein the at least one convolutional layer is configured to receive the first spatially constrained feature map to generate a second spatially constrained feature map; and

and processing the space-time information characteristic diagram and the second spatial constraint characteristic diagram by using the point multiplication layer, wherein the point multiplication layer is configured to receive the space-time information characteristic diagram and the second spatial constraint characteristic diagram and perform point multiplication on the space-time information characteristic diagram and the second spatial constraint characteristic diagram to generate the characteristic diagram to be processed.

15. The method of claim 9, wherein the fourth subnetwork comprises a sequentially connected similarity computation layer, a matrix multiplication layer, and a second splicing layer,

wherein the processing, with the fourth sub-network, the current feature map and the at least one set of template features of the prior image comprises:

processing the current feature map and the at least one set of template features of the previous image by using the similarity calculation layer, wherein the similarity calculation layer is configured to receive the current feature map and the at least one set of template features of the previous image to generate a similarity calculation result of the current feature map and the at least one set of template features in the previous image;

processing the similarity calculation result and the at least one set of template features of the prior image with the matrix multiplication layer, wherein the matrix multiplication layer is configured to receive the similarity calculation result and the at least one set of template features of the prior image to generate a space-time matching feature map; and

processing the current feature map and the space-time matching feature map by using the second concatenation layer, wherein the second concatenation layer is configured to receive the current feature map and the space-time matching feature map and concatenate the current feature map and the space-time matching feature map into the space-time information feature map.

16. The method of claim 9, wherein the prior image and the current image are different video frames in the same video, wherein the prior image was captured earlier than the current image.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-16.

19. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-16 when executed by a processor.