CN114549535A

CN114549535A - Image segmentation method, device, equipment, storage medium and product

Info

Publication number: CN114549535A
Application number: CN202210109064.XA
Authority: CN
Inventors: 韩文华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-27

Abstract

The present disclosure provides an image segmentation method, apparatus, device, storage medium and product, which relate to the field of artificial intelligence, and in particular to computer vision, image recognition and deep learning technologies. The specific implementation scheme is as follows: acquiring a current frame image, and acquiring a previous frame image of the current frame image and a mask image of the previous frame image; fusing the mask image of the previous frame image with the current frame image, and inputting the fused image into a first encoder; determining a difference image between the previous frame image and the current frame image and inputting the difference image into a second encoder; determining a first characteristic diagram of a first encoder and a second characteristic diagram of a second encoder, and fusing the first characteristic diagram and the second characteristic diagram; and inputting the fused feature map into a decoder to obtain a mask image of the current frame image. The method and the device introduce the difference between the previous frame image and the current frame image into the network, thereby further enriching semantic information and improving the accuracy of the mask image of the current frame image.

Description

Image segmentation method, device, equipment, storage medium and product

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to computer vision, image recognition and deep learning techniques.

Background

With the development and application of artificial intelligence related technologies, a strong demand for intelligent and automatic technologies emerges in more and more fields, one of which is the short video field. In the field of short video, the removal of a specified target in video, background blurring and the like all depend on a video target segmentation algorithm. It can be understood that the development of the video object segmentation method has very important significance for the intellectualization of short video processing, special effect processing and the like. The portrait video segmentation algorithm is an important branch in video object segmentation, and each Application program (App) has a great demand on the video portrait segmentation algorithm. But also have high requirements, especially in terms of processing accuracy and speed.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, storage medium and product for image segmentation.

According to an aspect of the present disclosure, there is provided an image segmentation method including:

acquiring a current frame image, and acquiring a previous frame image of the current frame image and a mask image of the previous frame image; fusing the mask image of the previous frame image with the current frame image, and inputting the fused image into a first encoder; determining a difference image between the previous frame image and the current frame image, and inputting the difference image into a second encoder; performing feature extraction on the fused image in the first encoder to obtain a first feature map, performing feature extraction on the difference image in the second encoder to obtain a second feature map, and fusing the first feature map and the second feature map; and inputting the fused feature map into a decoder to obtain a mask image of the current frame image.

According to another aspect of the present disclosure, there is provided an image segmentation apparatus including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a current frame image, and acquiring a previous frame image of the current frame image and a mask image of the previous frame image; the processing unit is used for fusing the mask image of the previous frame image with the current frame image and inputting the fused image into a first encoder; the encoder is also used for determining a difference image between the previous frame image and the current frame image and inputting the difference image into a second encoder; the encoding unit is used for performing feature extraction on the fused image in the first encoder to obtain a first feature map, performing feature extraction on the difference image in the second encoder to obtain a second feature map, and fusing the first feature map and the second feature map; and the decoding unit is used for inputting the fused feature map into a decoder to obtain a mask image of the current frame image.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method.

According to yet another aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow diagram of an image segmentation method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart for determining a first feature map and a second feature map according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating the determination of a first profile and a second profile according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart illustrating a process for determining a mask image for a current frame image according to an embodiment of the present disclosure;

FIG. 5 is a detailed flow diagram of image segmentation according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a current frame image according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a mask image of a current frame image according to an embodiment of the present disclosure;

FIG. 8 is a block diagram illustrating an image segmentation apparatus according to an exemplary embodiment;

fig. 9 is a block diagram of an electronic device for implementing an image segmentation method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure is applied to a scene in which an object is segmented in a short video. In the related art, in order to meet the speed requirement, a correlation network (such as a mobile-decoder 3 +) is generally selected as a target segmentation network, the network is composed of an encoder-decoder combination, mask images (denoted as masks) of a current frame image and a previous frame image are input, the encoder downsamples to 1/16 of the input image and then directly restores to 1/4, and the encoder fuses with a feature map with a size of 1/4 in the encoder and then restores to the size of the input image to obtain the mask image of the current frame image. However, in the video target segmentation method in the related art, if a target object in two adjacent frames of images acquired according to a preset interval is greatly displaced, the mask image of the current frame of image acquired according to the mask image of the previous frame of image may be inaccurate. It can be understood that, because the mask image of only the previous frame image and the current frame image are input, if the mask image of the previous frame image is not accurate, the segmentation of the current frame image is misled to some extent, and therefore, the segmentation is misled.

In order to solve the above problem, the present disclosure provides an image segmentation method. On the basis of inputting the mask images of the current frame image and the previous frame image in the encoder, a path parallel to the encoder is added, and the difference between the previous frame image and the current frame image is introduced into a network, so that semantic information is further enriched, and the accuracy of the mask image of the current frame image is improved.

It should be noted that the image segmentation method provided by the present disclosure is not only suitable for segmenting a human figure in an image, but also suitable for segmenting other specified objects, such as an animal, a vehicle, and the like. The image segmentation process comprises the steps of determining an object to be segmented, and segmenting the object to be segmented in the current frame image to obtain a mask image of the current frame image. The mask image of the current frame image is also referred to as the prediction result.

For convenience of description, the following embodiments refer to one of the two encoders as a first encoder and the other of the two encoders as a second encoder. The first encoder is used for processing the image formed by fusing the mask image of the previous frame image and the current frame image. The second encoder is used for processing a difference image between the previous frame image and the current frame image. The feature map obtained by down-sampling in the first encoder is referred to as a first intermediate feature map. The feature map obtained by down-sampling in the second encoder is referred to as a second intermediate feature map.

The following embodiments of the present disclosure will explain an image segmentation method provided by the present disclosure with reference to the accompanying drawings.

FIG. 1 is a schematic flow diagram of an image segmentation method according to an embodiment of the present disclosure; as shown in fig. 1, an image segmentation method in the embodiment of the present disclosure includes the following steps.

In step S101, a current frame image is acquired, and a previous frame image of the current frame image and a mask image of the previous frame image are acquired.

When the method and the device are used for processing the frame image in the short video, the current frame image is obtained according to the preset time interval. If the current frame image is acquired for the first time, the current frame image acquired for the first time is predicted to obtain a mask image of the current frame image. If the current frame image is not acquired for the first time, the current frame image is acquired, and a previous frame image of the current frame image and a mask image of the previous frame image are acquired.

In step S102, the mask image of the previous frame image and the current frame image are fused, and the fused image is input to the first encoder.

In step S103, a difference image between the previous frame image and the current frame image is determined, and the difference image is input to the second encoder.

And (4) making a difference between the previous frame image and the current frame image, and determining a difference image between the previous frame image and the current frame image. By determining the difference image between the previous frame image and the current frame image, whether the target object is greatly displaced between the two adjacent frame images can be determined. And representing the displacement of the target object between two adjacent frames of images through the difference image. The accuracy of the predicted mask image can be improved by considering the difference image in the subsequent image segmentation process.

In step S104, feature extraction is performed on the fused image in the first encoder to obtain a first feature map, feature extraction is performed on the difference image in the second encoder to obtain a second feature map, and the first feature map and the second feature map are fused.

The present disclosure extracts features of an object to be segmented in an image by downsampling. The down-sampling is to sample a sample sequence every several samples, so that the obtained new sequence is the down-sampling of the original sequence. Downsampling is decimation, and is one of the basic contents in multi-rate signal processing.

In the first encoder, in order to acquire features of different sizes of the fused image, the present disclosure performs downsampling on the fused image a plurality of times to obtain a plurality of first intermediate feature maps of different sizes. The first intermediate feature map having the smallest size in the first encoder is set as the first feature map.

In the second encoder, the difference image is downsampled a plurality of times to obtain a plurality of second intermediate feature maps of different sizes. And taking the second intermediate characteristic diagram with the smallest size in the second encoder as a second characteristic diagram. And fusing the first feature map and the second feature map.

Note that the size in the embodiments of the present disclosure ignores the batch size (batch _ size) and the channel information.

In step S105, the fused feature map is input to a decoder, and a mask image of the current frame image is obtained.

And upsampling the fused feature map in a decoder to ensure that the size of the obtained mask image of the current frame image is the same as that of the input current frame image. The upsampling (upsampling) in the present disclosure may also be referred to as image interpolation (interpolating), which is essentially to perform an amplification process on the fused feature map so that the fused feature map has the same size as the input current frame image. For example, the fused feature map may be up-sampled by an interpolation method, that is, a suitable interpolation algorithm may be used to insert new elements between pixel points on the basis of the fused feature map.

According to the image segmentation method, the mask image of the previous frame image is considered, the difference between the previous frame image and the current frame image is also considered, and the accuracy of predicting the mask image of the current frame image can be improved for a scene with large movement between adjacent frame images.

In the embodiment of the present disclosure, by fusing the first intermediate feature map in the first encoder with the second intermediate feature map having the same size in the second encoder, more semantic information is retained in the case of reducing the amount of computation.

FIG. 2 is a schematic flow chart for determining a first feature map and a second feature map according to an embodiment of the present disclosure; as shown in fig. 2, the embodiment of the present disclosure performs feature extraction on the fused image in the first encoder to obtain a first feature map, and performs feature extraction on the difference image in the second encoder to obtain a second feature map, including the following steps.

In step S201, the first encoder down-samples the fused image to obtain a plurality of first intermediate feature maps of different sizes.

In order to acquire features of different sizes of a fused image, the fused image is downsampled to generate a first intermediate feature map a. The size of the first intermediate feature map a is 1/2 of the fused image. The first intermediate feature map a is downsampled to generate a first intermediate feature map B. The size of the first intermediate feature map B is 1/4 of the fused image. The first intermediate feature map B is then downsampled to generate a first intermediate feature map C. The size of the first intermediate feature map C is 1/8 of the fused image. And analogizing in sequence, finishing down-sampling of the fused image, and obtaining a plurality of first intermediate feature maps with different sizes. Wherein the first intermediate characteristic diagram comprises a first intermediate characteristic diagram A, a first intermediate characteristic diagram B, a first intermediate characteristic diagram C and the like.

In step S202, the first intermediate feature map having the smallest size in the first encoder is set as the first feature map.

In the above example, if the first intermediate feature map having the smallest size in the first encoder is 1/16 of the fused image, the first intermediate feature map having the size of 1/16 of the fused image is used as the first feature map.

In step S203, the difference image is down-sampled in the second encoder to obtain a plurality of second intermediate feature maps of different sizes.

In order to obtain features of different sizes of a difference image, the difference image is first downsampled to generate a second intermediate feature map a. The size of the second intermediate feature map a is 1/4 of the difference image. And then the second intermediate feature map A is downsampled to generate a second intermediate feature map B. The size of the second intermediate feature map B is 1/16 of the difference image. Based on this, downsampling of the fusion difference image is completed, and a plurality of second intermediate feature maps with different sizes are obtained. Wherein the second intermediate feature map comprises a second intermediate feature map a, a second intermediate feature map B, and the like.

In step S204, the second intermediate feature map having the smallest size in the second encoder is set as the second feature map.

In the above example, if the size of the second intermediate feature map having the smallest size in the second encoder is 1/16 for the difference image, the second intermediate feature map having the size of 1/16 for the difference image is used as the second feature map.

In an embodiment of the disclosure, the multiple of downsampling in the first encoder is different from the multiple of downsampling in the second encoder. For example, for a fused image or difference image with size M × N, s-fold down-sampling is performed on the fused image or difference image, so as to obtain an image with resolution of (M/s) × (N/s). Wherein s is the common divisor of M and N. In one embodiment, the multiple of downsampling in the first encoder may be 2, and the multiple of downsampling in the second encoder may be 4.

In the embodiment of the disclosure, the fused image and the difference image are respectively downsampled to obtain the first feature map and the second feature map, the first feature map and the second feature map are fused, and the second feature map representing the features of the difference image is comprehensively considered when the mask image of the current frame image is predicted, so that the accuracy of predicting the mask image of the current frame image is improved, and the error segmentation is avoided.

FIG. 3 is a schematic flow chart illustrating the determination of a first profile and a second profile according to an embodiment of the present disclosure; as shown in fig. 3, the embodiment of the present disclosure performs feature extraction on the fused image in the first encoder to obtain a first feature map, and performs feature extraction on the difference image in the second encoder to obtain a second feature map, including the following steps.

In step S301, the first encoder down-samples the fused image m times to obtain a plurality of first intermediate feature maps of different sizes.

In an embodiment of the disclosure, the multiple of the downsampling in the first encoder is different from the multiple of the downsampling in the second encoder. In one embodiment, the multiple of the down-sampling in the first encoder is in an integer multiple relationship with the multiple of the down-sampling in the second encoder. It can be understood that, assuming that the size of the first intermediate feature map needs to be obtained is 1/4 of the fused image, two downsampling operations are required according to the multiple of the downsampling operation in the first encoder, so that 1/4 of the feature map with the size of the fused image can be obtained. If the multiple of the downsampling in the second encoder is used, downsampling needs to be performed once, and 1/4 that the feature map has a size of the fused image can be obtained.

It is assumed that the first encoder needs to perform downsampling n times to obtain all the first intermediate feature maps. In the embodiment of the present disclosure, in order to retain the edge features and the position features of more difference images, after downsampling is performed m times in the first encoder, the first intermediate feature map and the second intermediate feature map, which have the same size as that in the second encoder, in the first encoder are fused to obtain a fused feature map. Then, the merged feature map is continuously downsampled to obtain all the first intermediate feature maps.

In step S302, the difference image is down-sampled in the second encoder to obtain a plurality of second intermediate feature maps of different sizes.

In step S303, the first intermediate feature map and the second intermediate feature map are fused.

And the sizes of the first intermediate characteristic diagram and the second intermediate characteristic diagram which are fused are the same. Illustratively, a second intermediate feature map of 1/4 is obtained in size as the difference image, and the second intermediate feature map of 1/4 is fused with the first intermediate feature map of 1/4. Note that since the mask images of the current frame image, the previous frame image, and the previous frame image are all the same in size, the second intermediate feature map 1/4 having a size of the difference image is the same in size as the first intermediate feature map 1/4 having a size of the fused image. The fusion can adopt a splicing concat mode or an addition mode.

In step S304, the first encoder continues downsampling the fused intermediate feature map n-m times to obtain a plurality of first intermediate feature maps of different sizes.

m and n are positive integers, and m < n.

In step S305, a first intermediate feature map having the smallest size in the first encoder is set as a first feature map, and a second intermediate feature map having the smallest size in the second encoder is set as a second feature map.

In the embodiment of the disclosure, in the process of downsampling by the first encoder, the second intermediate feature map with the same size is fused with the first intermediate feature map and then downsampling is continued, so that edge features and position features of more difference images are retained, and the segmentation accuracy is further improved.

On the basis of any one of the above embodiments, fig. 4 is a schematic flowchart of determining a mask image of a current frame image according to an embodiment of the present disclosure; as shown in fig. 4, the method for obtaining a mask image of a current frame image by inputting a fused feature map into a decoder in the embodiment of the present disclosure includes the following steps.

In step S401, first intermediate feature maps other than the first intermediate feature map having the smallest size in the first encoder are acquired.

In one embodiment, the first intermediate feature maps in the first encoder include the first intermediate feature map of 1/2, the first intermediate feature map of 1/4, the first intermediate feature map of 1/8, and the first intermediate feature map of 1/16. Wherein the first intermediate feature map of 1/16 is the smallest size first intermediate feature map.

In one embodiment, the remaining first intermediate feature maps include: fusing the first intermediate feature map and the second intermediate feature map to obtain a first intermediate feature map, wherein the fused first intermediate feature map and the fused second intermediate feature map have the same size; meanwhile, the second intermediate feature map and the first intermediate feature map having the same size do not include the first intermediate feature map having the smallest size and the second intermediate feature map having the smallest size. For example, the second intermediate feature map of 1/4 is merged with the first intermediate feature map of 1/4, and the merged first intermediate feature map of 1/4 is used as the remaining first intermediate feature map. The feature map obtained by fusing the second intermediate feature map and the first intermediate feature map with the same size can improve the segmentation accuracy.

In step S402, the fused feature map is fused with the remaining first intermediate feature map to obtain a third feature map.

The fused feature map is fused with the first intermediate feature map of 1/8, then fused with the first intermediate feature map of 1/4, and then fused with the first intermediate feature map of 1/2, so as to obtain a third feature map. The fusion here can be in a splicing concat mode. Here, the first intermediate feature map of 1/4 may be a feature map obtained by fusing the second intermediate feature map of 1/4 and the first intermediate feature map of 1/4.

In step S403, the third feature map is up-sampled to obtain a mask image of the current frame image.

And upsampling the third characteristic image to enable the mask image of the current frame image to be the same as the current frame image in size.

In the embodiment of the disclosure, the fused feature map and the first intermediate feature map are fused to obtain a third feature map, and the mask image of the current frame image is obtained through the third feature map, so that semantic information of the current frame image and difference information between adjacent frame images are retained, and the prediction accuracy is improved.

The Network in the embodiment of the present disclosure may be a convolutional Neural Network model, a Feed-Forward (FF) Neural Network model, a Recursive Neural Network (RNN) Neural Network model, a Long/Short Term Memory (LSTM) Network model, and the like. The present disclosure realizes training and prediction of the network through the above embodiments.

FIG. 5 is a detailed flow diagram of image segmentation according to an embodiment of the present disclosure; as shown in fig. 5, the mask image of the previous frame image is fused with the current frame image, and the fused image is input to the first encoder. And the difference is made between the previous frame image and the current frame image to obtain a difference image, and the difference image is input into a second encoder. The fused image is down-sampled in the first encoder to obtain 1/2 a first intermediate feature map. The downsampling of the first intermediate feature map at 1/2 continues to obtain 1/4 a first intermediate feature map. A second intermediate feature map of 1/4 is obtained, and feature fusion is performed on the first intermediate feature map of 1/4 and the second intermediate feature map of 1/4. The downsampling is continued on the fused feature map to obtain 1/8 a first intermediate feature map. The first intermediate feature map of 1/8 is downsampled to obtain 1/16 first intermediate feature map. Similarly, the difference image is downsampled in the second encoder to obtain 1/4 a second intermediate feature map. The downsampling of the second intermediate feature map at 1/4 continues to obtain 1/16 a second intermediate feature map. The first intermediate feature map of 1/16 and the second intermediate feature map of 1/16 are fused. The feature map obtained by fusing the first intermediate feature map of 1/16 with the second intermediate feature map of 1/16 is input to the decoder and is fused with the first intermediate feature map of 1/8 in turn. Then fused with the first intermediate feature map of 1/4 and then fused with the first intermediate feature map of 1/2. And finally, performing up-sampling on the fused feature map to obtain a mask image of the current frame image. In the process, the difference between the previous frame image and the current frame image is introduced into the network, so that the semantic information is further enriched, and the accuracy of the mask image of the current frame image is improved. The current frame image shown in fig. 6 is predicted by the method shown in fig. 5, and a mask image of the current frame image shown in fig. 7 is obtained.

Based on the same conception, the embodiment of the disclosure also provides an image segmentation device.

It is understood that the image segmentation apparatus provided by the embodiments of the present disclosure includes hardware structures and/or software modules for performing the respective functions in order to implement the functions described above. The disclosed embodiments can be implemented in hardware or a combination of hardware and computer software, in combination with the exemplary elements and algorithm steps disclosed in the disclosed embodiments. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

Fig. 8 is a block diagram illustrating an image segmentation apparatus according to an exemplary embodiment. Referring to fig. 8, the apparatus 500 includes an acquisition unit 501, a processing unit 502, an encoding unit 503, and a decoding unit 504.

An obtaining unit 501, configured to obtain a current frame image, and obtain a previous frame image of the current frame image and a mask image of the previous frame image; a processing unit 502, configured to fuse a mask image of a previous frame image with a current frame image, and input the fused image into a first encoder; the encoder is also used for determining a difference image between the previous frame image and the current frame image and inputting the difference image into a second encoder; an encoding unit 503, configured to perform feature extraction on the fused image in a first encoder to obtain a first feature map, perform feature extraction on the difference image in a second encoder to obtain a second feature map, and fuse the first feature map and the second feature map; a decoding unit 504, configured to input the fused feature map into a decoder, so as to obtain a mask image of the current frame image.

In one embodiment, the encoding unit 503 is further configured to: in a first encoder, downsampling the fused image to obtain a plurality of first intermediate feature maps with different sizes; taking the first intermediate characteristic diagram with the smallest size in the first encoder as a first characteristic diagram; in a second encoder, downsampling the difference image to obtain a plurality of second intermediate feature maps with different sizes; taking the second intermediate characteristic diagram with the smallest size in the second encoder as a second characteristic diagram; wherein a multiple of the downsampling in the first encoder is different from a multiple of the downsampling in the second encoder.

In one embodiment, the encoding unit 503 is further configured to: in a first encoder, performing m times of downsampling on the fused image to obtain a plurality of first intermediate feature maps with different sizes; in a second encoder, downsampling the difference image to obtain a plurality of second intermediate feature maps with different sizes; fusing the first intermediate feature map and the second intermediate feature map, wherein the fused first intermediate feature map and the fused second intermediate feature map have the same size; in a first encoder, continuing to perform n-m times of downsampling on the fused intermediate feature map to obtain a plurality of first intermediate feature maps with different sizes; taking a first intermediate characteristic diagram with the minimum size in the first encoder as a first characteristic diagram, and taking a second intermediate characteristic diagram with the minimum size in the second encoder as a second characteristic diagram; wherein the multiple of down-sampling in the first encoder is different from the multiple of down-sampling in the second encoder, and m and n are positive integers.

In one embodiment, the decoding unit 504 is further configured to: acquiring the rest first intermediate characteristic diagrams except the first intermediate characteristic diagram with the minimum size in the first encoder; fusing the fused feature map with the rest of the first intermediate feature maps to obtain a third feature map; and performing up-sampling on the third characteristic image to obtain a mask image of the current frame image.

In one embodiment, the remaining first intermediate feature maps include: fusing the first intermediate characteristic diagram and the second intermediate characteristic diagram to obtain a first intermediate characteristic diagram; the first intermediate feature map and the second intermediate feature map which are fused have the same size, and the second intermediate feature map and the first intermediate feature map which have the same size do not include the first intermediate feature map and the second intermediate feature map which have the smallest sizes.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as an image segmentation method. For example, in some embodiments, the image segmentation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the image segmentation method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the image segmentation method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image segmentation method comprising:

acquiring a current frame image, and acquiring a previous frame image of the current frame image and a mask image of the previous frame image;

fusing the mask image of the previous frame image with the current frame image, and inputting the fused image into a first encoder;

determining a difference image between the previous frame image and the current frame image, and inputting the difference image into a second encoder;

performing feature extraction on the fused image in the first encoder to obtain a first feature map, performing feature extraction on the difference image in the second encoder to obtain a second feature map, and fusing the first feature map and the second feature map;

and inputting the fused feature map into a decoder to obtain a mask image of the current frame image.

2. The method according to claim 1, wherein the performing feature extraction on the fused image in the first encoder to obtain a first feature map, and performing feature extraction on the difference image in the second encoder to obtain a second feature map comprises:

in the first encoder, down-sampling the fused image to obtain a plurality of first intermediate feature maps with different sizes;

taking the first intermediate characteristic diagram with the smallest size in the first encoder as a first characteristic diagram;

in the second encoder, downsampling the difference image to obtain a plurality of second intermediate feature maps with different sizes;

taking the second intermediate characteristic diagram with the smallest size in the second encoder as a second characteristic diagram;

wherein a multiple of down-sampling in the first encoder is different from a multiple of down-sampling in the second encoder.

3. The method according to claim 1, wherein the performing feature extraction on the fused image in the first encoder to obtain a first feature map, and performing feature extraction on the difference image in the second encoder to obtain a second feature map comprises:

in the first encoder, performing downsampling on the fused image for m times to obtain a plurality of first intermediate feature maps with different sizes;

fusing the first intermediate feature map and the second intermediate feature map, wherein the fused first intermediate feature map and the fused second intermediate feature map have the same size;

in the first encoder, continuing to perform downsampling on the fused intermediate feature map for n-m times to obtain a plurality of first intermediate feature maps with different sizes;

taking a first intermediate characteristic diagram with the minimum size in the first encoder as a first characteristic diagram, and taking a second intermediate characteristic diagram with the minimum size in the second encoder as a second characteristic diagram;

wherein the multiple of down-sampling in the first encoder is different from the multiple of down-sampling in the second encoder, and m and n are positive integers, and m is smaller than n.

4. The method according to any one of claims 1-3, wherein the inputting the fused feature map into a decoder to obtain a mask image of the current frame image comprises:

acquiring the rest first intermediate characteristic diagrams except the first intermediate characteristic diagram with the minimum size in the first encoder;

fusing the fused feature map with the rest of the first intermediate feature maps to obtain a third feature map;

and performing up-sampling on the third characteristic diagram to obtain a mask image of the current frame image.

5. The method of claim 4, wherein the remaining first intermediate feature maps comprise:

and fusing the first intermediate feature map and the second intermediate feature map to obtain a first intermediate feature map, wherein the fused first intermediate feature map and the fused second intermediate feature map have the same size.

6. An image segmentation apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a current frame image, and acquiring a previous frame image of the current frame image and a mask image of the previous frame image;

the processing unit is used for fusing the mask image of the previous frame image with the current frame image and inputting the fused image into a first encoder; the encoder is also used for determining a difference image between the previous frame image and the current frame image and inputting the difference image into a second encoder;

the encoding unit is used for performing feature extraction on the fused image in the first encoder to obtain a first feature map, performing feature extraction on the difference image in the second encoder to obtain a second feature map, and fusing the first feature map and the second feature map;

and the decoding unit is used for inputting the fused feature map into a decoder to obtain a mask image of the current frame image.

7. The apparatus of claim 6, wherein the encoding unit is further configured to:

8. The apparatus of claim 6, wherein the encoding unit is further configured to:

9. The apparatus of any of claims 6-8, wherein the decoding unit is further configured to:

10. The apparatus of claim 9, wherein the remaining first intermediate feature maps comprise: fusing the first intermediate characteristic diagram and the second intermediate characteristic diagram to obtain a first intermediate characteristic diagram; wherein the first and second intermediate feature maps that are fused have the same size.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.