WO2022089115A1 - 图像分割方法、装置、设备及存储介质 - Google Patents

图像分割方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022089115A1
WO2022089115A1 PCT/CN2021/120815 CN2021120815W WO2022089115A1 WO 2022089115 A1 WO2022089115 A1 WO 2022089115A1 CN 2021120815 W CN2021120815 W CN 2021120815W WO 2022089115 A1 WO2022089115 A1 WO 2022089115A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
image
sample
visual
multimodal
Prior art date
Application number
PCT/CN2021/120815
Other languages
English (en)
French (fr)
Inventor
孔涛
荆雅
李磊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Priority to JP2023525962A priority Critical patent/JP2023547917A/ja
Priority to US18/251,228 priority patent/US20230394671A1/en
Publication of WO2022089115A1 publication Critical patent/WO2022089115A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30108Industrial image inspection
    • G06T2207/30128Food products

Definitions

  • the present disclosure relates to the technical field of image processing, for example, to an image segmentation method, apparatus, device, and storage medium.
  • Image segmentation under language instruction is a very important technology in cross-modal learning, also known as referential image segmentation.
  • the goal of image segmentation under language instruction is to segment the language-specified objects in the image. Linguistically-directed image segmentation is more challenging due to the need to eliminate the semantic gap between images and language descriptions.
  • the present disclosure provides an image segmentation method, device, device and storage medium, which can effectively segment a specified object in an image under the instruction of a description language.
  • the present disclosure provides an image segmentation method, including:
  • the segmentation result of the target object is determined according to the image corresponding to the multimodal feature and the response heat map.
  • the present disclosure also provides an image segmentation device, including:
  • a fusion module configured to fuse visual features corresponding to the original image and text features corresponding to a description language to obtain multimodal features, wherein the description language is used to specify the target object to be segmented in the original image;
  • a visual area determination module configured to determine the visual area of the target object according to the image corresponding to the multimodal feature, and record the image corresponding to the visual area as a response heat map;
  • the segmentation result determination module is configured to determine the segmentation result of the target object according to the image corresponding to the multimodal feature and the response heat map.
  • the present disclosure also provides an electronic device, comprising:
  • processors one or more processors
  • memory arranged to store one or more programs
  • the above-described image segmentation method is implemented when the one or more programs are executed by the one or more processors.
  • the present disclosure also provides a computer-readable storage medium on which a computer program is stored, which implements the above-mentioned image segmentation method when the program is executed by a processor.
  • Embodiment 1 is a flowchart of an image segmentation method provided in Embodiment 1 of the present disclosure
  • FIG. 2 is a flowchart of an image segmentation method according to Embodiment 2 of the present disclosure
  • FIG. 3 is a schematic structural diagram of an image segmentation model according to Embodiment 2 of the present disclosure.
  • FIG. 4 is an implementation flowchart of an image segmentation method provided in Embodiment 2 of the present disclosure.
  • FIG. 5 is a schematic diagram of an original image according to Embodiment 2 of the present disclosure.
  • FIG. 6 is a schematic diagram of a segmentation result according to Embodiment 2 of the present disclosure.
  • Fig. 7 is the schematic diagram of the segmentation result that adopts traditional method to obtain
  • FIG. 8 is a schematic diagram of a comparison of results of segmenting the same image by using the image segmentation method and related technologies provided by the embodiment of the present disclosure according to Embodiment 2 of the present disclosure;
  • FIG. 9 is a structural diagram of an image segmentation apparatus according to Embodiment 3 of the present disclosure.
  • FIG. 10 is a structural diagram of an electronic device according to Embodiment 4 of the present disclosure.
  • method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a flowchart of an image segmentation method provided in Embodiment 1 of the present disclosure.
  • This embodiment can be applied to the case of object segmentation in an image under language instruction, and can be applied to image editing or human-computer interaction during interaction. It can also be applied to language-driven image object detection or language-driven image understanding and other fields.
  • the method may be performed by an image segmentation apparatus, which may be implemented in software and/or hardware, and the apparatus may be configured in an electronic device, which may be a terminal with image data processing functions, such as a mobile phone, A mobile terminal such as a tablet and a notebook may also be a fixed terminal such as a desktop computer or a server. As shown in Figure 1, the method includes the following steps:
  • the original image can be an image containing at least one object, which can be acquired by means of a camera or a scanner, or can be selected from an existing image library.
  • the visual feature may be an image feature corresponding to the original image at a set resolution.
  • the visual feature is actually an image, and the size of the resolution is not limited in this embodiment.
  • the image features of the original image at a set resolution can be extracted through a visual feature extraction network to obtain a corresponding visual feature image.
  • the visual feature extraction network can adopt the Darknet network structure or other network structures that can be used to extract visual features. Darknet is an open-source deep learning framework with a simple structure and no dependencies, making it more flexible to extract visual features.
  • the description language can be the text corresponding to the descriptive language description, which is used to specify the target object in the original image for segmentation.
  • the description language may include image information, position information of the target object, and appearance information of the target object.
  • the image information is used to determine the information of the image to be segmented, and may include, for example, information that uniquely identifies the image, such as the name or number of the image.
  • the position information is used to preliminarily determine the position of the target object in the original image.
  • the appearance information is used to determine the final target object.
  • the description language may be the user in the image A who is holding a badminton racket and wearing red clothes.
  • Text features can be features that reflect the meaning of the description language, and can generally be represented in the form of vectors.
  • the text features of the description language can be extracted through a text feature extraction network.
  • a text feature extraction network This embodiment does not limit the structure of the text feature extraction network.
  • GRU Gated recurrent unit
  • a type of neural network can be used.
  • the multi-modal feature is the fusion feature obtained by fusing the features of multiple modalities.
  • the visual feature and the text feature are fused to obtain the multi-modal feature, which realizes the cross-modal feature representation and eliminates the difference between the image and the description language. semantic gap.
  • the data of the corresponding positions of the visual feature and the text feature may be cross-multiplied to obtain a fusion feature of the visual feature and the text feature, that is, a multimodal feature.
  • a multimodal feature is actually an image, and a multimodal feature can also be referred to as a multimodal feature image or an image corresponding to a multimodal feature, that is, an image containing multimodal features.
  • the visual area is the area where the target object is located.
  • the visual area is the area where the pizza is located.
  • the visual area of the target object is first determined, and the target object is segmented on the basis of the visual area, which can effectively reduce the complexity of image segmentation.
  • the image corresponding to the multi-modal feature may be filtered to eliminate the interference of the non-target object on the target object, so as to obtain the visual area of the target object.
  • the filtered image that is, the image corresponding to the visual area, is recorded as the response heat map, which represents the position information of the target object.
  • Each area corresponds to a response value.
  • an area with a response value greater than a set threshold value may be used as a visual area of the target object and highlighted, and the size of the set threshold value is not limited in this embodiment.
  • the segmentation result may be determined through an image segmentation model combined with images corresponding to multimodal features and a response heat map.
  • the image segmentation model is used to determine the segmentation result of the target object.
  • the structure of the model can be set as required, for example, it can include a convolution layer and an upsampling layer.
  • the convolutional layer is used to perform convolution operations on the input image, and the upsampling layer It is used to upsample the convolution result to obtain the segmentation result.
  • the size of the image corresponding to the segmentation result is the same as the size of the real segmentation result in the original image.
  • the image segmentation model in this embodiment takes the image corresponding to the multimodal feature and the response heatmap as input.
  • the image corresponding to the multimodal feature and the response heatmap can be input into the image segmentation model to train the image segmentation model. to adjust the parameters of the convolutional and upsampling layers.
  • the loss value of the segmentation result output by the image segmentation model relative to the real segmentation result corresponding to the original image can be determined.
  • the training ends, and the loss value corresponding to when the loss value is less than the set threshold can be determined.
  • the model is used as an image segmentation model for segmenting the target object in this embodiment.
  • Embodiment 1 of the present disclosure provides an image segmentation method, which obtains multimodal features by fusing visual features corresponding to the original image and text features corresponding to a description language, where the description language is used to specify the target to be segmented in the original image object; determine the visual area of the target object according to the image corresponding to the multimodal feature, and record the image corresponding to the visual area as a response heat map; according to the image corresponding to the multimodal feature and the response A heat map to determine the segmentation result of the target object.
  • the method decomposes the image segmentation process.
  • the visual area of the target object is determined based on the images corresponding to the multimodal features obtained by fusion, and the response heat map is obtained, and then the segmentation results are determined according to the images corresponding to the multimodal features and the response heat map. , effectively eliminating the semantic gap between the image and the description language, and segmenting the target object specified by the description language.
  • FIG. 2 is a flowchart of an image segmentation method according to Embodiment 2 of the present disclosure. This embodiment is described on the basis of the foregoing embodiments. Referring to FIG. 2 , the method may include the following steps:
  • the number of visual features may be one or more, and in order to improve the accuracy of the segmentation result, visual features extracted from the original images at at least two resolutions may be included. Too few visual features will affect the accuracy of the segmentation results, and too many will increase the amount of computation. In this embodiment, three visual features are taken as an example, and the amount of calculation can be reduced while improving the accuracy of the segmentation result.
  • the visual feature in this embodiment may include a first visual feature extracted from the original image (R H ⁇ W ⁇ 3 ) at a first resolution
  • the second visual feature extracted from the original image at the second resolution and the third visual feature extracted from the original image at the third resolution Wherein, the first resolution ⁇ the second resolution ⁇ the third resolution.
  • the numerical values of the first resolution, the second resolution and the third resolution can be set according to the situation.
  • the first resolution is taken as the
  • the second resolution is the resolution of the original image
  • the third resolution is the resolution of the original image
  • H and W are the length and width of the original image respectively
  • multimodal features can be obtained as follows:
  • the second resolution is greater than the first resolution, and the second resolution is the smallest resolution except the first resolution. .
  • the text features and visual features can be mapped to the same feature space first.
  • visual features can be mapped to the space where text features are located
  • text features can also be mapped to the space where visual features are located
  • text features and visual features can be mapped to other feature spaces.
  • the embodiment takes the mapping of the text feature to the feature space where the visual feature is located as an example, which can simplify the mapping process and reduce the amount of calculation.
  • this embodiment based on the Sort the visual features to get the sorted result.
  • the resolution of the visual features in the sorting result is in ascending order, that is, the visual features and text features with the smallest resolution are first spliced, and then the splicing results are up-sampled, and then combined with the sorting results.
  • Visual features corresponding to a resolution are spliced until the end of splicing with the visual feature with the highest resolution.
  • the first visual feature, the second visual feature, and the third visual feature correspond to the first resolution and the second resolution, respectively. rate and the third resolution, where the first resolution ⁇ the second resolution ⁇ the third resolution.
  • the value of the corresponding position of the first visual feature and the mapped text feature can be calculated by means of cross product, so as to realize the splicing of the first visual feature and the mapped text feature, and obtain the first splicing feature.
  • the resolution of the first visual feature ⁇ the resolution of the second visual feature ⁇ the resolution of the third visual feature that is, the resolution of the first stitching feature is smaller than the resolution of the second visual feature, in order to achieve the same resolution
  • the first stitching feature is upsampled, so that the resolution of the stitching feature obtained by the upsampling is the same as the resolution of the second visual feature, and then a stitching operation similar to the first stitching feature is performed, Splicing the second visual feature and the up-sampled first splicing feature to obtain the second splicing feature, then up-sampling the second splicing feature, and splicing the up-sampled second splicing feature and the third visual feature to obtain
  • the third stitching feature is the multimodal feature.
  • the text feature and the visual feature are fused to realize the cross-modal representation of the feature, and the accuracy of the segmentation result can be improved when the target object is segmented subsequently.
  • a convolution kernel can be generated according to the text feature, and the image corresponding to the multi-modal feature can be convolved according to the convolution kernel, so as to realize the correlation filtering of the multi-modal feature image, and obtain the corresponding image of each region.
  • the response value, the area with the larger response value indicates the greater possibility of the existence of the target object, and the corresponding visual area can be obtained according to the response value.
  • FIG. 3 is a schematic structural diagram of an initial image segmentation model provided in Embodiment 2 of the present disclosure.
  • the image segmentation model includes an input layer, a parallel first convolution layer, a stitching layer, and a second volume.
  • the number of the first convolutional layer can be set according to the situation of the accumulation layer, the upsampling layer and the output layer.
  • Figure 3 takes 5 first convolutional layers as an example, so that the content of the image at different scales can be better captured.
  • Each first convolution layer corresponds to a sampling rate, that is, the first convolution operations with five different sampling rates are performed on the input image respectively, and five convolution results are obtained.
  • the concatenation layer is used to concatenate these 5 convolution results.
  • the second convolutional layer is used to perform the convolution operation again on the concatenated result.
  • the upsampling layer is used to ensure that the resolution of the segmentation result output by the image segmentation model is consistent with the resolution of the real segmentation result of the original image.
  • the parameters of the first convolution layer, the stitching layer, the second convolution layer and the upsampling layer in the initial image segmentation model can be trained to obtain the target image segmentation model.
  • the training process is as follows:
  • Obtain a sample image and a sample description language and extract the sample visual feature of the sample image and the sample text feature of the sample description language; fuse the sample visual feature and the sample text feature to obtain a sample multimodal feature;
  • the image corresponding to the sample multimodal feature determines the sample visual area of the sample target object, and the image corresponding to the sample visual area is recorded as the sample response heat map; according to the image corresponding to the sample multimodal feature and
  • the sample response heat map trains an initial image segmentation model to obtain the target image segmentation model.
  • This embodiment does not limit the number of sample images and sample description languages.
  • multiple sets of sample images and multiple sets of sample description languages can be selected; then the sample visual features and samples of the sample images are extracted and spliced. Describe the sample text features of the language to obtain the sample multimodal features; perform correlation filtering on the multimodal features to obtain the sample response heat map; thus, the initial image segmentation can be trained according to the images corresponding to the sample multimodal features and the sample response heat map model to obtain the target image segmentation model.
  • the image corresponding to the sample multimodal feature and the sample response heat map can be input into the initial image segmentation model to obtain multiple first convolutions of the image corresponding to the sample multimodal feature and the sample response heat map.
  • Result splicing multiple first convolution results to obtain a splicing result; performing a second convolution operation on the splicing results to obtain a second convolution result; upsampling the second convolution result to obtain a sample segmentation result; determining the sample segmentation The loss value of the result relative to the real segmentation result of the sample image; when the loss value is less than the set threshold, the training of the initial image segmentation model is stopped, and the image segmentation model whose loss value is less than the set threshold is used as the target image segmentation model; when the loss value is not When it is less than the set threshold, continue to train the initial image segmentation model until the loss value is less than the set threshold.
  • the following loss function can be used to determine the loss value of the sample segmentation result relative to the real
  • L is the loss value of the sample segmentation result relative to the real segmentation result of the sample image
  • y l is the element value of each area in the real segmentation result after the original image is downsampled
  • p l is the element value of each area in the sample segmentation result value.
  • the size of the set threshold can be set according to the situation, for example, it can be 0.5, that is, when L ⁇ 0.5, the training ends.
  • FIG. 4 is a flowchart of an implementation of an image segmentation method according to Embodiment 2 of the present disclosure.
  • First obtain the original image and description language then extract the visual features of the original image at different levels and the text features corresponding to the description language.
  • Figure 4 takes three levels as an example, corresponding to three resolutions respectively, and then splices the first visual feature F v1
  • the first splicing feature F m1 is obtained with the mapped text feature, and the first splicing feature F m1 is upsampled and then spliced with the second visual feature F v2 to obtain the second splicing feature F m2 .
  • the third visual feature F v3 After sampling, it is spliced with the third visual feature F v3 to obtain the third splicing feature F m3 , that is, the multimodal feature. Then perform correlation filtering on the multimodal feature F m3 to obtain the response heat map, and input the response heat map and the image corresponding to the multimodal feature F m3 into the target image segmentation model to obtain the segmentation result of the target object, which is simple and effective.
  • FIG. 5 is a schematic diagram of an original image provided in Embodiment 2 of the present disclosure. It is assumed that the language text is "Pizza Nearest", that is, the pizza with the closest distance is segmented, and the image segmentation provided by the above embodiment is performed. method, the segmentation result as shown in Figure 6 can be obtained. The segmentation results obtained by the traditional method are shown in Figure 7. Exemplarily, referring to FIG. 8 , FIG. 8 is a schematic diagram of a comparison of results of segmenting the same image by using the image segmentation method and related technology provided by the embodiment of the present disclosure according to Embodiment 2 of the present disclosure.
  • the first column is respectively the three original images
  • the second column is respectively the objects obtained by segmentation using the method provided by the embodiment of the present disclosure
  • the third column and the fourth column are respectively the objects obtained by segmentation using the related technology. It can be seen from FIGS. 6 and 8 that the object segmented by the image segmentation method provided by the embodiment of the present disclosure is closer to the real result, and the accuracy of the image segmentation result is improved.
  • the second embodiment of the present disclosure provides an image segmentation method.
  • the process of image segmentation is decomposed, the visual area of the target object is initially determined, and then an initial image segmentation model is constructed, which simplifies the initial image segmentation model.
  • an initial image segmentation model is constructed, which simplifies the initial image segmentation model.
  • multimodal feature images and response heat maps to train the initial image segmentation model to obtain the target image segmentation model, and then use the target image segmentation model to obtain the segmentation results, effectively eliminating the semantic gap between the image and the description language. To a certain extent, the accuracy of the segmentation results is improved.
  • FIG. 9 is a structural diagram of an image segmentation apparatus according to Embodiment 3 of the present disclosure.
  • the apparatus can execute the image segmentation method described in the above-mentioned embodiments, and the apparatus can be integrated into an electronic device.
  • the apparatus may include :
  • the fusion module 31 is configured to fuse the visual features corresponding to the original image and the text features corresponding to the description language to obtain multimodal features, and the description language is used to specify the target object to be segmented in the original image; the visual area determination module 32 , set to determine the visual area of the target object according to the image corresponding to the multi-modal feature, and record the image corresponding to the visual area as a response heat map; the segmentation result determination module 33 is set to be based on the multi-modal feature. The image corresponding to the state feature and the response heat map are used to determine the segmentation result of the target object.
  • Embodiments of the present disclosure provide an image segmentation device, which obtains multimodal features by fusing visual features corresponding to an original image and text features corresponding to a description language, where the description language is used to specify a target object to be segmented in the original image ; Determine the visual area of the target object according to the image corresponding to the multimodal feature, and record the image corresponding to the visual area as a response heat map; According to the image corresponding to the multimodal feature and the response heat Figure, determine the segmentation result of the target object.
  • the device decomposes the image segmentation process.
  • the visual area of the target object is determined based on the images corresponding to the multi-modal features obtained by fusion, and the response heat map is obtained, and then the segmentation results are determined according to the images corresponding to the multi-modal features and the response heat map. , effectively eliminating the semantic gap between the image and the description language, and segmenting the target object specified by the description language.
  • the visual area determination module 32 includes:
  • the filtering unit is configured to perform correlation filtering on the image corresponding to the multimodal feature to obtain the visual area of the target object.
  • the filtering unit is set to:
  • the visual features include visual features extracted from the original image at at least two resolutions, respectively.
  • the fusion module 31 is set to:
  • the second resolution is greater than the first resolution, and the second resolution is the smallest resolution except the first resolution. .
  • the segmentation result determination module 33 is set to:
  • the image corresponding to the multimodal feature and the response heat map are input into the target image segmentation model, and the result output by the target image segmentation model is obtained as the segmentation result of the target object.
  • the training process of the target image segmentation model is as follows:
  • sample image and a sample description language and extract a sample visual feature of the sample image and a sample text feature of the sample description language, and the sample description language is used to specify the sample target object to be segmented in the sample image; fusion;
  • the sample visual feature and the sample text feature obtain the sample multimodal feature; determine the sample visual area of the sample target object according to the image corresponding to the sample multimodal feature, and assign the sample visual area corresponding to the sample visual area.
  • the image is recorded as a sample response heat map; an initial image segmentation model is trained according to the image corresponding to the multi-modal feature of the sample and the sample response heat map, and the target image segmentation model is obtained.
  • the initial image segmentation model is trained according to the image corresponding to the multimodal feature of the sample and the sample response heat map, and the target image segmentation model is obtained, including:
  • the multiple first convolution results are obtained by performing the first convolution operation on the image corresponding to the multimodal feature of the sample and the sample response heat map at different sampling rates; splicing the multiple first convolution results; A convolution result is obtained to obtain a splicing result; a second convolution operation is performed on the splicing result to obtain a second convolution result; the second convolution result is up-sampled to obtain a sample segmentation result; the sample segmentation is determined The loss value of the result relative to the real segmentation result of the sample image; when the loss value is less than the set threshold, stop training the initial image segmentation model, and use the image segmentation model whose loss value is less than the set threshold as the target Image segmentation model; when the loss value is not less than the set threshold, continue to train the initial image segmentation model until the loss value
  • the image segmentation apparatus provided by the embodiment of the present disclosure and the image segmentation method provided by the above-mentioned embodiments belong to the same concept.
  • the technical details not described in detail in this embodiment please refer to the above-mentioned embodiments, and this embodiment has the same image segmentation method. Effect.
  • FIG. 10 it shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure.
  • the electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistants, PDAs), tablet computers (PADs), portable multimedia players (Portable Media Players) , PMP), in-vehicle terminals (such as in-vehicle navigation terminals), and other mobile terminals, as well as fixed terminals such as digital (Television, TV), desktop computers, and servers.
  • PDAs Personal Digital Assistants
  • PDAs tablet computers
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia players
  • PMP portable multimedia
  • the electronic device 600 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 601, which may be stored in a read-only memory (Read-Only Memory, ROM) 602 according to a program or from a storage device 608 programs loaded into Random Access Memory (RAM) 603 to perform various appropriate actions and processes.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An Input/Output (I/O) interface 605 is also connected to the bus 604 .
  • I/O interface 605 input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 607 , speaker, vibrator, etc.; storage device 608 , including, for example, magnetic tape, hard disk, etc.; and communication device 609 .
  • Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 10 shows an electronic device 600 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 608, or from the ROM 602.
  • the processing apparatus 601 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM) or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium may be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • clients and servers can communicate using any currently known or future developed network protocols, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • Communication eg, a communication network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: fuses the visual features corresponding to the original image and the text features corresponding to the description language to obtain a multi-mode modal feature, the description language is used to specify the target object to be segmented in the original image; the visual area of the target object is determined according to the image corresponding to the multimodal feature, and the image corresponding to the visual area is recorded. is a response heat map; the segmentation result of the target object is determined according to the image corresponding to the multimodal feature and the response heat map.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (e.g., through an Internet connection using an Internet service provider).
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware.
  • the name of the module does not constitute a limitation of the module itself in one case.
  • the splicing module can also be described as "splicing the visual features corresponding to the original image and the text features corresponding to the language text to obtain multimodal features. module".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. Examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM or flash memory, optical fibers, CD-ROMs, optical storage devices, magnetic storage devices, or Any suitable combination of the above.
  • the present disclosure provides an image segmentation method, including:
  • the visual features corresponding to the original image and the text features corresponding to the description language are fused to obtain multi-modal features, wherein the description language is used to specify the target object to be segmented in the original image; according to the corresponding multi-modal features
  • the image determines the visual area of the target object, and the image corresponding to the visual area is recorded as a response heat map; according to the image corresponding to the multimodal feature and the response heat map, the segmentation result of the target object is determined .
  • the determining of the visual area of the target object according to the image corresponding to the multimodal feature includes:
  • Correlation filtering is performed on the image corresponding to the multimodal feature to obtain the visual area of the target object.
  • performing correlation filtering on the image corresponding to the multimodal feature to obtain the visual area of the target object includes:
  • the visual features include visual features respectively extracted from the original image at at least two resolutions.
  • the multi-modal features are obtained by fusing the visual features corresponding to the original image and the text features corresponding to the description language, including:
  • the second resolution is greater than the first resolution, and the second resolution is the smallest resolution except the first resolution. .
  • the determining the segmentation result of the target object according to the image corresponding to the multimodal feature and the response heat map includes:
  • the image corresponding to the multimodal feature and the response heat map are input into the target image segmentation model, and the result output by the target image segmentation model is obtained as the segmentation result of the target object.
  • the training process of the target image segmentation model is as follows:
  • Obtain a sample image and a sample description language and extract a sample visual feature of the sample image and a sample text feature of the sample description language, wherein the sample description language is used to specify the sample target object to be segmented in the sample image ; Fuse the sample visual feature and the sample text feature to obtain the sample multimodal feature; determine the sample visual area of the sample target object according to the image corresponding to the sample multimodal feature, and combine the sample visual area
  • the corresponding image is recorded as the sample response heat map; the initial image segmentation model is trained according to the image corresponding to the multi-modal feature of the sample and the sample response heat map, and the target image segmentation model is obtained.
  • the initial image segmentation model is trained according to the image corresponding to the multimodal feature of the sample and the sample response heat map to obtain the target Image segmentation models, including:
  • an image segmentation apparatus including:
  • the fusion module is set to fuse the visual features corresponding to the original image and the text features corresponding to the description language to obtain multi-modal features, wherein the description language is used to specify the target object to be segmented in the original image; the visual area determination module , set to determine the visual area of the target object according to the image corresponding to the multi-modal feature, and record the image corresponding to the visual area as a response heat map; the segmentation result determination module is set to be based on the multi-modal feature. The image corresponding to the feature and the response heat map are used to determine the segmentation result of the target object.
  • the present disclosure provides an electronic device, comprising:
  • One or more processors a memory configured to store one or more programs; when the one or more programs are executed by the one or more processors, the image segmentation method provided by any embodiment of the present disclosure is implemented .
  • the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the image segmentation method provided by any embodiment of the present disclosure .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Character Input (AREA)

Abstract

本文公开了一种图像分割方法、装置、设备及存储介质。该图像分割方法包括:融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,描述语言用于指定原始图像中待分割的目标物体;根据多模态特征对应的图像确定目标物体的视觉区域,并将视觉区域对应的图像记为响应热度图;根据多模态特征对应的图像和响应热度图,确定目标物体的分割结果。

Description

图像分割方法、装置、设备及存储介质
本申请要求在2020年10月30日提交中国专利局、申请号为202011197790.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及图像处理技术领域,例如涉及一种图像分割方法、装置、设备及存储介质。
背景技术
语言指示下的图像分割是跨模态学习中的一项非常重要的技术,又被称为指代性图像分割,语言指示下的图像分割的目标是分割出图像中语言指定的物体。语言指示下的图像分割由于需要消除图像与语言描述之间的语义鸿沟问题,因而更加具有挑战性。
发明内容
本公开提供一种图像分割方法、装置、设备及存储介质,能够在描述语言的指示下有效分割出图像中的指定物体。
本公开提供了一种图像分割方法,包括:
融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,其中,所述描述语言用于指定所述原始图像中待分割的目标物体;
根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;
根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
本公开还提供了一种图像分割装置,包括:
融合模块,设置为融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,其中,所述描述语言用于指定所述原始图像中待分割的目标物体;
视觉区域确定模块,设置为根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;
分割结果确定模块,设置为根据所述多模态特征对应的图像和所述响应热 度图,确定所述目标物体的分割结果。
本公开还提供了一种电子设备,包括:
一个或多个处理器;
存储器,设置为存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行时实现上述的图像分割方法。
本公开还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的图像分割方法。
附图说明
图1为本公开实施例一提供的一种图像分割方法的流程图;
图2为本公开实施例二提供的一种图像分割方法的流程图;
图3为本公开实施例二提供的一种图像分割模型的结构示意图;
图4为本公开实施例二提供的一种图像分割方法的实现流程图;
图5为本公开实施例二提供的一种原始图像的示意图;
图6为本公开实施例二提供的一种分割结果的示意图;
图7为采用传统方法得到的分割结果的示意图;
图8为本公开实施例二提供的一种利用本公开实施例提供的图像分割方法和相关技术对同一图像进行分割的结果对比示意图;
图9为本公开实施例三提供的一种图像分割装置的结构图;
图10为本公开实施例四提供的一种电子设备的结构图。
具体实施方式
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,然而本公开可以通过多种形式来实现,而且不应该被解释为限于这里阐述的实施例,提供这些实施例是为了更加透彻和完整地理解本公开。
本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一 实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块、单元或操作进行区分,并非用于限定这些装置、模块、单元或操作所执行的功能的顺序或者相互依存关系。
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,除非在上下文另有指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
实施例一
图1为本公开实施例一提供的一种图像分割方法的流程图,本实施例可适用于在语言指示下的图像中进行物体分割的情况,可以应用于交互时的图像编辑或人机交互等领域,也可以应用于语言驱动的图像物体检测或语言驱动的图像理解等领域。该方法可以由图像分割装置来执行,该装置可以采用软件和/或硬件的方式实现,该装置可以配置在电子设备中,该电子设备可以是具备图像数据处理功能的终端,例如可以是手机、平板、笔记本等移动终端,也可以是台式机等固定终端或服务器。如图1所示,该方法包括如下步骤:
S110、融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,所述描述语言用于指定所述原始图像中待分割的目标物体。
原始图像可以是包含至少一个物体的图像,可以通过摄像头或扫描仪等方式获取,也可以从已有的图像库中选择。视觉特征可以是原始图像在设定分辨率下对应的图像特征,视觉特征实际上是一个图像,本实施例对分辨率的大小不进行限定。在一个示例中,可以通过视觉特征提取网络提取原始图像在设定分辨率下的图像特征,得到对应的视觉特征图像。视觉特征提取网络可以采用Darknet网络结构或其他可以用于提取视觉特征的网络结构。Darknet是一个开源的深度学习框架,结构简单,没有依赖项,利用其提取视觉特征时灵活性更好。
描述语言可以是指代性语言描述对应的文本,用于指定分割原始图像中的目标物体。描述语言可以包括图像信息、目标物体的位置信息以及目标物体的外表信息等。图像信息用于确定需要分割的图像的信息,例如可以包括图像的名称或编号等唯一标识该图像的信息。位置信息用于初步确定目标物体在原始图像中的位置。外表信息用于确定最终的目标物体。例如描述语言可以是图像A中手拿羽毛球拍、穿红色衣服的用户。文本特征可以是反映描述语言含义的特 征,一般可以采用向量的形式表示。可选的,可以通过文本特征提取网络提取描述语言的文本特征,本实施例对文本特征提取网络的结构不进行限定,例如可以采用门控循环单元(Gated Recurrent Unit,GRU)网络,GRU是循环神经网络的一种。
多模态特征即融合多个模态的特征得到的融合特征,本实施例将视觉特征和文本特征融合得到多模态特征,实现了跨模态的特征表示,消除了图像和描述语言之间的语义鸿沟。可选的,可以将视觉特征和文本特征对应位置的数据叉乘得到视觉特征和文本特征的融合特征,也即多模态特征。多模态特征实际上也是一个图像,多模态特征也可以称为多模态特征图像或多模态特征对应的图像,即包含多模态特征的图像。
S120、根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图。
视觉区域为目标物体所在的区域,例如目标物体为披萨时,视觉区域为披萨所在的区域。本实施例在分割目标物体时,首先确定目标物体的视觉区域,在视觉区域的基础上分割目标物体,可以有效降低图像分割的复杂度。可选的,可以对多模态特征对应的图像进行滤波,消除非目标物体对目标物体的干扰,得到目标物体的视觉区域。本实施例将滤波后的图像也即视觉区域对应的图像记为响应热度图,该图表示了目标物体的位置信息,每一个区域对应有一个响应值,响应值越大,表示该区域存在目标物体的可能性越大。可选的,可以将响应值大于设定阈值的区域作为目标物体的视觉区域,并进行高亮显示,本实施例对设定阈值的大小不进行限定。
S130、根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
可选的,可以通过图像分割模型结合多模态特征对应的图像和响应热度图确定分割结果。图像分割模型用于确定目标物体的分割结果,该模型的结构可以根据需要设定,例如可以包括卷积层和上采样层,卷积层用于对输入的图像执行卷积操作,上采样层用于对卷积结果进行上采样,得到分割结果,分割结果所对应图像的大小与原始图像中真实分割结果的大小相同。本实施例的图像分割模型以多模态特征对应的图像和响应热度图为输入,应用之前,可以将多模态特征对应的图像和响应热度图输入图像分割模型,对图像分割模型进行训练,以调整卷积层和上采样层的参数。可选的,可以确定图像分割模型输出的分割结果相对于原始图像对应的真实分割结果的损失值,当损失值小于设定阈值时,训练结束,并将损失值小于设定阈值时所对应的模型作为本实施例用于分割目标物体的图像分割模型。
本公开实施例一提供一种图像分割方法,通过融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,所述描述语言用于指定所述原始图像中待分割的目标物体;根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。该方法对图像分割过程进行了分解,首先基于融合得到的多模态特征对应的图像确定目标物体的视觉区域,得到响应热度图,然后根据多模态特征对应的图像和响应热度图确定分割结果,有效消除了图像和描述语言之间的语义鸿沟,分割出了描述语言指定的目标物体。
实施例二
图2为本公开实施例二提供的一种图像分割方法的流程图,本实施例是在上述实施例的基础上进行说明,参考图2,该方法可以包括如下步骤:
S210、融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征。
视觉特征的数量可以是一个或多个,为了提高分割结果的准确度,可以包括在至少两个分辨率下分别从原始图像中提取的视觉特征。视觉特征的数量过少会影响分割结果的准确度,过多会增加计算量。本实施例以三个视觉特征为例,在提高分割结果准确度的同时可以减小计算量。本实施例的视觉特征可以包括原始图像(R H×W×3)在第一分辨率下提取的第一视觉特征
Figure PCTCN2021120815-appb-000001
原始图像在第二分辨率下提取的第二视觉特征
Figure PCTCN2021120815-appb-000002
和原始图像在第三分辨率下提取的第三视觉特征
Figure PCTCN2021120815-appb-000003
其中,第一分辨率<第二分辨率<第三分辨率。第一分辨率、第二分辨率和第三分辨率的数值可以根据情况设定,本实施例以第一分辨率为原始图像的分辨率的
Figure PCTCN2021120815-appb-000004
第二分辨率为原始图像的分辨率的
Figure PCTCN2021120815-appb-000005
第三分辨率为原始图像的分辨率的
Figure PCTCN2021120815-appb-000006
为例。H和W分别为原始图像的长和宽,d i为视觉特征所对应图像的维度,本实施例中i=1,2,3。
在一个示例中,可以通过如下方式得到多模态特征:
按照分辨率的大小顺序排列所述至少两个视觉特征,得到排序结果;通过映射矩阵将所述文本特征映射到所述排序结果中的第一分辨率对应的第一视觉特征所在的特征空间,所述第一分辨率的值最小;拼接所述第一视觉特征和映射后的文本特征,得到第一拼接特征;对所述第一拼接特征进行上采样,并将上采样后的第一拼接特征与所述排序结果中的第二分辨率对应的第二视觉特征 进行拼接,得到第二拼接特征,循环执行上采样和拼接的操作,直至将上采样后的拼接特征与所述排序结果中的分辨率最大的视觉特征拼接,得到多模态特征为止,所述第二分辨率大于所述第一分辨率,所述第二分辨率为除所述第一分辨率之外最小的分辨率。
考虑到文本特征和视觉特征的长度不同,为了保证融合效果,可以先将文本特征和视觉特征映射到同一个特征空间。例如可以将视觉特征映射到文本特征所在的空间,也可以将文本特征映射到视觉特征所在的空间,还可以将文本特征和视觉特征映射到其他特征空间。实施例以将文本特征映射到视觉特征所在的特征空间为例,可以简化映射过程,降低计算量。
当视觉特征存在多个时,其对应的分辨率也不同,对应的图像大小也不同,为了保证拼接的有效性,本实施例基于多个视觉特征的分辨率由小到大的顺序对多个视觉特征进行排序,得到排序结果。在拼接特征时,按照排序结果中的视觉特征的分辨率由小到大的顺序进行,即先拼接分辨率最小的视觉特征和文本特征,然后对拼接结果进行上采样,再和排序结果中下一个分辨率对应的视觉特征进行拼接,直至和分辨率最大的视觉特征拼接结束为止。
以视觉特征包括三个为例,分别为第一视觉特征、第二视觉特征和第三视觉特征,第一视觉特征、第二视觉特征和第三视觉特征分别对应第一分辨率、第二分辨率和第三分辨率,其中,第一分辨率<第二分辨率<第三分辨率。可以通过叉乘的方式计算第一视觉特征和映射后的文本特征对应位置的值,实现第一视觉特征和映射后的文本特征的拼接,得到第一拼接特征。如上所述,第一视觉特征的分辨率<第二视觉特征的分辨率<第三视觉特征的分辨率,也即第一拼接特征的分辨率小于第二视觉特征的分辨率,为了实现相同分辨率下的特征拼接,本实施例对第一拼接特征进行上采样,使上采样得到的拼接特征的分辨率与第二视觉特征的分辨率相同,然后执行与第一拼接特征类似的拼接操作,拼接第二视觉特征和上采样后的第一拼接特征,得到第二拼接特征,然后对第二拼接特征进行上采样,并将上采样后的第二拼接特征与第三视觉特征进行拼接,得到第三拼接特征,也即多模态特征。本实施例将文本特征与视觉特征融合,实现了特征的跨模态表示,在后续分割目标物体时,可以提高分割结果的准确度。
S220、对所述多模态特征对应的图像进行相关性滤波,得到所述目标物体的视觉区域。
在一个示例中,可以根据文本特征生成一个卷积核,根据该卷积核对多模态特征对应的图像进行卷积操作,实现对多模态特征图像的相关性滤波,得到每个区域对应的响应值,响应值越大的区域表示存在目标物体的可能性越大, 根据响应值即可得到对应的视觉区域。
S230、将所述视觉区域对应的图像记为响应热度图。
S240、将所述多模态特征对应的图像和所述响应热度图输入目标图像分割模型,获取所述目标图像分割模型输出的结果,作为所述目标物体的分割结果。
本实施例中,根据视觉区域,设计了一个初始图像分割模型,以获取更加准确的分割结果。示例性的,参考图3,图3为本公开实施例二提供的一种初始图像分割模型的结构示意图,该图像分割模型包括输入层、并行的第一卷积层、拼接层、第二卷积层、上采样层和输出层,第一卷积层的数量可以根据情况设定,图3以5个第一卷积层为例,从而可以更好地捕捉图像在不同尺度上的内容。每个第一卷积层对应一个采样率,即分别对输入图像执行5种不同采样率的第一卷积操作,得到5种卷积结果。拼接层用于拼接这5种卷积结果。第二卷积层用于对拼接后的结果再次执行卷积操作。上采样层用于保证图像分割模型输出的分割结果的分辨率与原始图像真实分割结果的分辨率一致。
在应用图像分割模型之前,可以对初始图像分割模型中第一卷积层、拼接层、第二卷积层和上采样层的参数进行训练,得到目标图像分割模型。训练过程如下:
获取样本图像和样本描述语言,并提取所述样本图像的样本视觉特征和所述样本描述语言的样本文本特征;融合所述样本视觉特征和所述样本文本特征,得到样本多模态特征;根据所述样本多模态特征对应的图像确定所述样本目标物体的样本视觉区域,并将所述样本视觉区域对应的图像记为样本响应热度图;根据所述样本多模态特征对应的图像和所述样本响应热度图训练初始图像分割模型,得到所述目标图像分割模型。
本实施例对样本图像和样本描述语言的数量不进行限定,为了提高图像分割模型的准确度,可以选择多组样本图像和多组样本描述语言;然后提取并拼接样本图像的样本视觉特征和样本描述语言的样本文本特征,得到样本多模态特征;对多模态特征进行相关性滤波得到样本响应热度图;由此可以根据样本多模态特征对应的图像和样本响应热度图训练初始图像分割模型,得到目标图像分割模型。特征的提取、拼接和滤波过程可以参考上述实施例,此处不再赘述。
可以按照图3所示的结构,将样本多模态特征对应的图像和样本响应热度图输入初始图像分割模型,得到样本多模态特征对应的图像和样本响应热度图的多个第一卷积结果;拼接多个第一卷积结果,得到拼接结果;对拼接结果进行第二卷积操作,得到第二卷积结果;对第二卷积结果进行上采样,得到样本 分割结果;确定样本分割结果相对样本图像的真实分割结果的损失值;当损失值小于设定阈值时,停止训练初始图像分割模型,并将损失值小于设定阈值的图像分割模型作为目标图像分割模型;当损失值不小于设定阈值时,继续训练初始图像分割模型,直至损失值小于设定阈值。可选的,可以采用如下损失函数确定样本分割结果相对样本图像的真实分割结果的损失值:
Figure PCTCN2021120815-appb-000007
其中,L为样本分割结果相对样本图像的真实分割结果的损失值,y l为原始图像下采样后的真实分割结果中每个区域的元素值,p l为样本分割结果中每个区域的元素值。设定阈值的大小可以根据情况设定,例如可以为0.5,即当L<0.5时,训练结束。
示例性的,参考图4,图4为本公开实施例二提供的一种图像分割方法的实现流程图。首先获取原始图像和描述语言,然后提取原始图像在不同层级的视觉特征和描述语言对应的文本特征,图4以三个层级为例,分别对应三种分辨率,然后拼接第一视觉特征F v1与映射后的文本特征得到第一拼接特征F m1,对第一拼接特征F m1上采样之后与第二视觉特征F v2进行拼接,得到第二拼接特征F m2,对第二拼接特征F m2上采样之后与第三视觉特征F v3进行拼接,得到第三拼接特征F m3,也即多模态特征。然后对多模态特征F m3进行相关性滤波,得到响应热度图,将响应热度图和多模态特征F m3对应的图像输入目标图像分割模型,即可得到目标物体的分割结果,简单有效。
示例性的,参考图5,图5为本公开实施例二提供的一种原始图像的示意图,假定语言文本为“Pizza Nearest”,即分割出距离最近的披萨,按照上述实施例提供的图像分割方法,可以得到如图6所示的分割结果。利用传统方法得到的分割结果如图7所示。示例性的,参考图8,图8为本公开实施例二提供的一种利用本公开实施例提供的图像分割方法和相关技术对同一图像进行分割的结果对比示意图。其中,第一列分别为三种原始图像,第二列分别为采用本公开实施例提供的方法分割得到的物体,第三列和第四列分别为采用相关技术分割得到的物体。由图6和8可以看出,利用本公开实施例提供的图像分割方法分割得到的物体更接近其真实结果,提高了图像分割结果的准确度。
本公开实施例二提供一种图像分割方法,在上述实施例的基础上,对图像分割的过程进行分解,先初步确定目标物体的视觉区域,然后构建初始图像分割模型,简化了初始图像分割模型的复杂度,利用多模态特征图像和响应热度图训练初始图像分割模型,得到目标图像分割模型,进而利用目标图像分割模型得到分割结果,有效消除了图像和描述语言之间的语义鸿沟,也在一定程度 上提高了分割结果的准确度。
实施例三
图9为本公开实施例三提供的一种图像分割装置的结构图,该装置可以执行上述实施例所述的图像分割方法,该装置可以集成在电子设备中,参考图9,该装置可以包括:
融合模块31,设置为融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,所述描述语言用于指定所述原始图像中待分割的目标物体;视觉区域确定模块32,设置为根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;分割结果确定模块33,设置为根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
本公开实施例提供一种图像分割装置,通过融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,所述描述语言用于指定所述原始图像中待分割的目标物体;根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。该装置对图像分割过程进行了分解,首先基于融合得到的多模态特征对应的图像确定目标物体的视觉区域,得到响应热度图,然后根据多模态特征对应的图像和响应热度图确定分割结果,有效消除了图像和描述语言之间的语义鸿沟,分割出了描述语言指定的目标物体。
在上述实施例的基础上,视觉区域确定模块32,包括:
滤波单元,设置为对所述多模态特征对应的图像进行相关性滤波,得到所述目标物体的视觉区域。
在上述实施例的基础上,滤波单元,设置为:
根据所述文本特征确定卷积核;根据所述卷积核对所述多模态特征对应的图像进行卷积操作,得到所述目标物体的视觉区域。
在上述实施例的基础上,所述视觉特征包括在至少两个分辨率下分别从所述原始图像中提取的视觉特征。
在上述实施例的基础上,融合模块31,设置为:
按照分辨率的大小顺序排列所述至少两个视觉特征,得到排序结果;通过映射矩阵将所述文本特征映射到所述排序结果中的第一分辨率对应的第一视觉特征所在的特征空间,所述第一分辨率的值最小;拼接所述第一视觉特征和映 射后的文本特征,得到第一拼接特征;对所述第一拼接特征进行上采样,并将上采样后的第一拼接特征与所述排序结果中的第二分辨率对应的第二视觉特征进行拼接,得到第二拼接特征,循环执行上采样和拼接的操作,直至将上采样后的拼接特征与所述排序结果中的分辨率最大的视觉特征拼接,得到多模态特征为止,所述第二分辨率大于所述第一分辨率,所述第二分辨率为除所述第一分辨率之外最小的分辨率。
在上述实施例的基础上,分割结果确定模块33,设置为:
将所述多模态特征对应的图像和所述响应热度图输入目标图像分割模型,获取所述目标图像分割模型输出的结果,作为所述目标物体的分割结果。
在上述实施例的基础上,所述目标图像分割模型的训练过程如下:
获取样本图像和样本描述语言,并提取所述样本图像的样本视觉特征和所述样本描述语言的样本文本特征,所述样本描述语言用于指定所述样本图像中待分割的样本目标物体;融合所述样本视觉特征和所述样本文本特征,得到样本多模态特征;根据所述样本多模态特征对应的图像确定所述样本目标物体的样本视觉区域,并将所述样本视觉区域对应的图像记为样本响应热度图;根据所述样本多模态特征对应的图像和所述样本响应热度图训练初始图像分割模型,得到所述目标图像分割模型。
在上述实施例的基础上,所述根据所述样本多模态特征对应的图像和所述样本响应热度图训练初始图像分割模型,得到所述目标图像分割模型,包括:
将所述样本多模态特征对应的图像和所述样本响应热度图输入所述初始图像分割模型,得到所述样本多模态特征对应的图像和所述样本响应热度图的多个第一卷积结果,所述多个第一卷积结果由对所述样本多模态特征对应的图像和所述样本响应热度图在不同采样率下执行第一卷积操作得到;拼接所述多个第一卷积结果,得到拼接结果;对所述拼接结果进行第二卷积操作,得到第二卷积结果;对所述第二卷积结果进行上采样,得到样本分割结果;确定所述样本分割结果相对所述样本图像的真实分割结果的损失值;当所述损失值小于设定阈值时,停止训练所述初始图像分割模型,并将损失值小于设定阈值的图像分割模型作为所述目标图像分割模型;当所述损失值不小于设定阈值时,继续训练所述初始图像分割模型,直至损失值小于设定阈值。
本公开实施例提供的图像分割装置与上述实施例提供的图像分割方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例具备执行图像分割方法相同的效果。
实施例四
下面参考图10,其示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的电子设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(PAD)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字(Television,TV)、台式计算机以及服务器等等的固定终端。图10示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图10所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(Read-Only Memory,ROM)602中的程序或者从存储装置608加载到随机访问存储器(Random Access Memory,RAM)603中的程序而执行多种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的多种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(Input/Output,I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图10示出了具有多种装置的电子设备600,但是并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
实施例五
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只 读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,所述描述语言用于指定所述原始图像中待分割的目标物体;根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连 接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开多种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在一种情况下并不构成对该模块本身的限定,例如,拼接模块还可以被描述为“拼接原始图像对应的视觉特征和语言文本对应的文本特征,得到多模态特征的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM或快闪存储器、光纤、CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,本公开提供了一种图像分割方法,包括:
融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,其中,所述描述语言用于指定所述原始图像中待分割的目标物体;根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对 应的图像记为响应热度图;根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
根据本公开的一个或多个实施例,本公开提供的图像分割方法中,所述根据所述多模态特征对应的图像确定所述目标物体的视觉区域,包括:
对所述多模态特征对应的图像进行相关性滤波,得到所述目标物体的视觉区域。
根据本公开的一个或多个实施例,本公开提供的图像分割方法中,所述对所述多模态特征对应的图像进行相关性滤波,得到所述目标物体的视觉区域,包括:
根据所述文本特征确定卷积核;根据所述卷积核对所述多模态特征对应的图像进行卷积操作,得到所述目标物体的视觉区域。
根据本公开的一个或多个实施例,本公开提供的图像分割方法中,所述视觉特征包括在至少两个分辨率下分别从所述原始图像中提取的视觉特征。
根据本公开的一个或多个实施例,本公开提供的图像分割方法中,所述融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,包括:
按照分辨率的大小顺序排列所述至少两个视觉特征,得到排序结果;通过映射矩阵将所述文本特征映射到所述排序结果中的第一分辨率对应的第一视觉特征所在的特征空间,所述第一分辨率的值最小;拼接所述第一视觉特征和映射后的文本特征,得到第一拼接特征;对所述第一拼接特征进行上采样,并将上采样后的第一拼接特征与所述排序结果中的第二分辨率对应的第二视觉特征进行拼接,得到第二拼接特征,循环执行上采样和拼接的操作,直至将上采样后的拼接特征与所述排序结果中的分辨率最大的视觉特征拼接,得到多模态特征为止,所述第二分辨率大于所述第一分辨率,所述第二分辨率为除所述第一分辨率之外最小的分辨率。
根据本公开的一个或多个实施例,本公开提供的图像分割方法中,所述根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果,包括:
将所述多模态特征对应的图像和所述响应热度图输入目标图像分割模型,获取所述目标图像分割模型输出的结果,作为所述目标物体的分割结果。
根据本公开的一个或多个实施例,本公开提供的图像分割方法中,所述目标图像分割模型的训练过程如下:
获取样本图像和样本描述语言,并提取所述样本图像的样本视觉特征和所述样本描述语言的样本文本特征,其中,所述样本描述语言用于指定所述样本图像中待分割的样本目标物体;融合所述样本视觉特征和所述样本文本特征,得到样本多模态特征;根据所述样本多模态特征对应的图像确定所述样本目标物体的样本视觉区域,并将所述样本视觉区域对应的图像记为样本响应热度图;根据所述样本多模态特征对应的图像和所述样本响应热度图训练初始图像分割模型,得到所述目标图像分割模型。
根据本公开的一个或多个实施例,本公开提供的图像分割方法中,所述根据所述样本多模态特征对应的图像和所述样本响应热度图训练初始图像分割模型,得到所述目标图像分割模型,包括:
将所述样本多模态特征对应的图像和所述样本响应热度图输入所述初始图像分割模型,得到所述样本多模态特征对应的图像和所述样本响应热度图的多个第一卷积结果,其中,所述多个第一卷积结果由对所述样本多模态特征对应的图像和所述样本响应热度图在不同采样率下执行第一卷积操作得到;拼接所述多个第一卷积结果,得到拼接结果;对所述拼接结果进行第二卷积操作,得到第二卷积结果;对所述第二卷积结果进行上采样,得到样本分割结果;确定所述样本分割结果相对所述样本图像的真实分割结果的损失值;当所述损失值小于设定阈值时,停止训练所述初始图像分割模型,并将损失值小于设定阈值的图像分割模型作为所述目标图像分割模型;当所述损失值不小于设定阈值时,继续训练所述初始图像分割模型,直至损失值小于设定阈值。
根据本公开的一个或多个实施例,本公开提供了一种图像分割装置,包括:
融合模块,设置为融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,其中,所述描述语言用于指定所述原始图像中待分割的目标物体;视觉区域确定模块,设置为根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;分割结果确定模块,设置为根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
根据本公开的一个或多个实施例,本公开提供了一种电子设备,包括:
一个或多个处理器;存储器,设置为存储一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行时实现如本公开任一实施例提供的图像分割方法。
根据本公开的一个或多个实施例,本公开提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开任一实施例提供 的图像分割方法。
此外,虽然采用特定次序描绘了多个操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了多个实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的一些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的多种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。

Claims (11)

  1. 一种图像分割方法,包括:
    融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,其中,所述描述语言用于指定所述原始图像中待分割的目标物体;
    根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;
    根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
  2. 根据权利要求1所述的方法,其中,所述根据所述多模态特征对应的图像确定所述目标物体的视觉区域,包括:
    对所述多模态特征对应的图像进行相关性滤波,得到所述目标物体的视觉区域。
  3. 根据权利要求2所述的方法,其中,所述对所述多模态特征对应的图像进行相关性滤波,得到所述目标物体的视觉区域,包括:
    根据所述文本特征确定卷积核;
    根据所述卷积核对所述多模态特征对应的图像进行卷积操作,得到所述目标物体的视觉区域。
  4. 根据权利要求1所述的方法,其中,所述视觉特征包括在至少两个分辨率下分别从所述原始图像中提取的视觉特征。
  5. 根据权利要求4所述的方法,其中,所述融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,包括:
    按照分辨率的大小顺序排列所述至少两个视觉特征,得到排序结果;通过映射矩阵将所述文本特征映射到所述排序结果中的第一分辨率对应的第一视觉特征所在的特征空间,其中,所述第一分辨率的值最小;
    拼接所述第一视觉特征和映射后的文本特征,得到第一拼接特征;
    对所述第一拼接特征进行上采样,并将上采样后的第一拼接特征与所述排序结果中的第二分辨率对应的第二视觉特征进行拼接,得到第二拼接特征,循环执行上采样和拼接的操作,直至将上采样后的拼接特征与所述排序结果中的分辨率最大的视觉特征拼接,得到所述多模态特征为止,其中,所述第二分辨率大于所述第一分辨率,所述第二分辨率为除所述第一分辨率之外最小的分辨率。
  6. 根据权利要求1-5中任一项所述的方法,其中,所述根据所述多模态特 征对应的图像和所述响应热度图,确定所述目标物体的分割结果,包括:
    将所述多模态特征对应的图像和所述响应热度图输入目标图像分割模型,获取所述目标图像分割模型输出的结果,作为所述目标物体的分割结果。
  7. 根据权利要求6所述的方法,其中,所述目标图像分割模型的训练过程如下:
    获取样本图像和样本描述语言,并提取所述样本图像的样本视觉特征和所述样本描述语言的样本文本特征,其中,所述样本描述语言用于指定所述样本图像中待分割的样本目标物体;
    融合所述样本视觉特征和所述样本文本特征,得到样本多模态特征;
    根据所述样本多模态特征对应的图像确定所述样本目标物体的样本视觉区域,并将所述样本视觉区域对应的图像记为样本响应热度图;
    根据所述样本多模态特征对应的图像和所述样本响应热度图训练初始图像分割模型,得到所述目标图像分割模型。
  8. 根据权利要求7所述的方法,其中,所述根据所述样本多模态特征对应的图像和所述样本响应热度图训练初始图像分割模型,得到所述目标图像分割模型,包括:
    将所述样本多模态特征对应的图像和所述样本响应热度图输入所述初始图像分割模型,得到所述样本多模态特征对应的图像和所述样本响应热度图的多个第一卷积结果,其中,所述多个第一卷积结果由对所述样本多模态特征对应的图像和所述样本响应热度图在不同采样率下执行第一卷积操作得到;
    拼接所述多个第一卷积结果,得到拼接结果;
    对所述拼接结果进行第二卷积操作,得到第二卷积结果;
    对所述第二卷积结果进行上采样,得到样本分割结果;
    确定所述样本分割结果相对所述样本图像的真实分割结果的损失值;
    在所述损失值小于设定阈值的情况下,停止训练所述初始图像分割模型,并将损失值小于设定阈值的图像分割模型作为所述目标图像分割模型;在所述损失值不小于设定阈值的情况下,继续训练所述初始图像分割模型,直至损失值小于所述设定阈值。
  9. 一种图像分割装置,包括:
    融合模块,设置为融合原始图像对应的视觉特征和描述语言对应的文本特征,得到多模态特征,其中,所述描述语言用于指定所述原始图像中待分割的 目标物体;
    视觉区域确定模块,设置为根据所述多模态特征对应的图像确定所述目标物体的视觉区域,并将所述视觉区域对应的图像记为响应热度图;
    分割结果确定模块,设置为根据所述多模态特征对应的图像和所述响应热度图,确定所述目标物体的分割结果。
  10. 一种电子设备,包括:
    至少一个处理器;
    存储器,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行时实现如权利要求1-8中任一项所述的图像分割方法。
  11. 一种计算机可读存储介质,存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-8中任一项所述的图像分割方法。
PCT/CN2021/120815 2020-10-30 2021-09-27 图像分割方法、装置、设备及存储介质 WO2022089115A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023525962A JP2023547917A (ja) 2020-10-30 2021-09-27 画像分割方法、装置、機器および記憶媒体
US18/251,228 US20230394671A1 (en) 2020-10-30 2021-09-27 Image segmentation method and apparatus, and device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011197790.9A CN112184738B (zh) 2020-10-30 2020-10-30 一种图像分割方法、装置、设备及存储介质
CN202011197790.9 2020-10-30

Publications (1)

Publication Number Publication Date
WO2022089115A1 true WO2022089115A1 (zh) 2022-05-05

Family

ID=73917279

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120815 WO2022089115A1 (zh) 2020-10-30 2021-09-27 图像分割方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US20230394671A1 (zh)
JP (1) JP2023547917A (zh)
CN (1) CN112184738B (zh)
WO (1) WO2022089115A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091984A (zh) * 2023-04-12 2023-05-09 中国科学院深圳先进技术研究院 视频目标分割方法、装置、电子设备及存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112184738B (zh) * 2020-10-30 2022-09-13 北京有竹居网络技术有限公司 一种图像分割方法、装置、设备及存储介质
CN112651982A (zh) * 2021-01-12 2021-04-13 杭州智睿云康医疗科技有限公司 基于影像与非影像信息的图像分割方法及系统
CN112418199B (zh) * 2021-01-25 2022-03-01 北京明略昭辉科技有限公司 多模态信息提取方法、装置、电子设备及存储介质
CN114827482B (zh) * 2021-01-28 2023-11-03 抖音视界有限公司 图像亮度的调整方法、装置、电子设备及介质
CN112818955B (zh) * 2021-03-19 2023-09-15 北京市商汤科技开发有限公司 一种图像分割方法、装置、计算机设备以及存储介质
CN113515886B (zh) * 2021-04-28 2023-11-24 上海科技大学 基于地标特征卷积的视觉定位方法、系统、终端及介质
CN113592881B (zh) * 2021-08-03 2023-11-03 深圳思谋信息科技有限公司 图片指代性分割方法、装置、计算机设备和存储介质
CN113962859B (zh) * 2021-10-26 2023-05-09 北京有竹居网络技术有限公司 一种全景图生成方法、装置、设备及介质
CN117437516A (zh) * 2022-07-11 2024-01-23 北京字跳网络技术有限公司 语义分割模型训练方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268548A1 (en) * 2017-03-14 2018-09-20 Adobe Systems Incorporated Automatically segmenting images based on natural language phrases
CN110390289A (zh) * 2019-07-17 2019-10-29 苏州大学 基于指称理解的视频安防检测方法
CN110555337A (zh) * 2018-05-30 2019-12-10 腾讯科技(深圳)有限公司 一种指示对象的检测方法、装置以及相关设备
CN112184738A (zh) * 2020-10-30 2021-01-05 北京有竹居网络技术有限公司 一种图像分割方法、装置、设备及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436583B (zh) * 2011-09-26 2013-10-30 哈尔滨工程大学 基于对标注图像学习的图像分割方法
US10424064B2 (en) * 2016-10-18 2019-09-24 Adobe Inc. Instance-level semantic segmentation system
CN108230329B (zh) * 2017-12-18 2021-09-21 孙颖 基于多尺度卷积神经网络的语义分割方法
CN109948707B (zh) * 2019-03-20 2023-04-18 腾讯科技(深圳)有限公司 模型训练方法、装置、终端及存储介质
CN110533676B (zh) * 2019-09-06 2022-08-16 青岛海信医疗设备股份有限公司 肿瘤图像分割方法、装置及终端设备
CN110765916B (zh) * 2019-10-17 2022-08-30 北京中科原动力科技有限公司 一种基于语义和实例分割的农田苗垄识别方法及系统
CN110782462B (zh) * 2019-10-30 2022-08-09 浙江科技学院 一种基于双流特征融合的语义分割方法
CN111104962B (zh) * 2019-11-05 2023-04-18 北京航空航天大学青岛研究院 图像的语义分割方法、装置、电子设备及可读存储介质
CN111126451A (zh) * 2019-12-01 2020-05-08 复旦大学 一种对偶式语义分割方法
CN110929696A (zh) * 2019-12-16 2020-03-27 中国矿业大学 一种基于多模态注意与自适应融合的遥感图像语义分割方法
CN110930419A (zh) * 2020-02-13 2020-03-27 北京海天瑞声科技股份有限公司 图像分割方法、装置、电子设备及计算机存储介质
CN111275721B (zh) * 2020-02-14 2021-06-08 推想医疗科技股份有限公司 一种图像分割方法、装置、电子设备及存储介质
CN111667483B (zh) * 2020-07-03 2022-08-30 腾讯科技(深圳)有限公司 多模态图像的分割模型的训练方法、图像处理方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268548A1 (en) * 2017-03-14 2018-09-20 Adobe Systems Incorporated Automatically segmenting images based on natural language phrases
CN110555337A (zh) * 2018-05-30 2019-12-10 腾讯科技(深圳)有限公司 一种指示对象的检测方法、装置以及相关设备
CN110390289A (zh) * 2019-07-17 2019-10-29 苏州大学 基于指称理解的视频安防检测方法
CN112184738A (zh) * 2020-10-30 2021-01-05 北京有竹居网络技术有限公司 一种图像分割方法、装置、设备及存储介质

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GEN LUO; YIYI ZHOU; XIAOSHUAI SUN; LIUJUAN CAO; CHENGLIN WU; CHENG DENG; RONGRONG JI: "Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 March 2020 (2020-03-19), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081624822 *
RONGHANG HU; MARCUS ROHRBACH; TREVOR DARRELL: "Segmentation from Natural Language Expressions", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 March 2016 (2016-03-20), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080690843 *
WANG FEI; JIANG MENGQING; QIAN CHEN; YANG SHUO; LI CHENG; ZHANG HONGGANG; WANG XIAOGANG; TANG XIAOOU: "Residual Attention Network for Image Classification", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOCIETY, US, 21 July 2017 (2017-07-21), US , pages 6450 - 6458, XP033250009, ISSN: 1063-6919, DOI: 10.1109/CVPR.2017.683 *
YA JING; TAO KONG; WEI WANG; LIANG WANG; LEI LI; TIENIU TAN: "Locate then Segment: A Strong Pipeline for Referring Image Segmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 March 2021 (2021-03-30), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081919270 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091984A (zh) * 2023-04-12 2023-05-09 中国科学院深圳先进技术研究院 视频目标分割方法、装置、电子设备及存储介质
CN116091984B (zh) * 2023-04-12 2023-07-18 中国科学院深圳先进技术研究院 视频目标分割方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
JP2023547917A (ja) 2023-11-14
CN112184738A (zh) 2021-01-05
US20230394671A1 (en) 2023-12-07
CN112184738B (zh) 2022-09-13

Similar Documents

Publication Publication Date Title
WO2022089115A1 (zh) 图像分割方法、装置、设备及存储介质
CN111767371B (zh) 一种智能问答方法、装置、设备及介质
WO2022252881A1 (zh) 图像处理方法、装置、可读介质和电子设备
US11443438B2 (en) Network module and distribution method and apparatus, electronic device, and storage medium
CN110399847B (zh) 关键帧提取方法、装置及电子设备
CN111784712B (zh) 图像处理方法、装置、设备和计算机可读介质
WO2022105779A1 (zh) 图像处理方法、模型训练方法、装置、介质及设备
WO2023138314A1 (zh) 对象属性识别方法、装置、可读存储介质及电子设备
WO2023103897A1 (zh) 图像处理方法、装置、设备及存储介质
CN111738316A (zh) 零样本学习的图像分类方法、装置及电子设备
WO2022171036A1 (zh) 视频目标追踪方法、视频目标追踪装置、存储介质及电子设备
CN113610034B (zh) 识别视频中人物实体的方法、装置、存储介质及电子设备
WO2021227953A1 (zh) 图像特效配置方法、图像识别方法、装置及电子设备
CN111292406B (zh) 模型渲染方法、装置、电子设备及介质
CN111311609B (zh) 一种图像分割方法、装置、电子设备及存储介质
WO2023138468A1 (zh) 虚拟物体的生成方法、装置、设备及存储介质
WO2023138441A1 (zh) 视频生成方法、装置、设备及存储介质
WO2023130925A1 (zh) 字体识别方法、装置、可读介质及电子设备
CN110852242A (zh) 基于多尺度网络的水印识别方法、装置、设备及存储介质
CN113807056B (zh) 一种文档名称序号纠错方法、装置和设备
WO2022052889A1 (zh) 图像识别方法、装置、电子设备和计算机可读介质
CN112381184B (zh) 图像检测方法、装置、电子设备和计算机可读介质
CN115346145A (zh) 重复视频的识别方法、设备、储存介质及计算机程序产品
WO2023109385A1 (zh) 图标点击的检测方法、装置、设备及存储介质
CN112346630B (zh) 状态确定方法、装置、设备和计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884845

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023525962

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21884845

Country of ref document: EP

Kind code of ref document: A1