CN116958533A - Image content segmentation method, device, apparatus, storage medium and program product - Google Patents

Image content segmentation method, device, apparatus, storage medium and program product Download PDF

Info

Publication number
CN116958533A
CN116958533A CN202211504318.4A CN202211504318A CN116958533A CN 116958533 A CN116958533 A CN 116958533A CN 202211504318 A CN202211504318 A CN 202211504318A CN 116958533 A CN116958533 A CN 116958533A
Authority
CN
China
Prior art keywords
image
content
target image
representation
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211504318.4A
Other languages
Chinese (zh)
Inventor
祁忠琪
张瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211504318.4A priority Critical patent/CN116958533A/en
Publication of CN116958533A publication Critical patent/CN116958533A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses an image content segmentation method, an image content segmentation device, a storage medium and a program product, and relates to the field of image processing. The method comprises the following steps: acquiring a target image to be subjected to designated image content segmentation; extracting features of the target image to obtain image feature representation; classifying and identifying the target image based on the image characteristic representation to obtain an image classification result; performing dimension conversion on the image classification result to obtain classification characteristic representation; fusing the image characteristic representation and the classification characteristic representation to obtain a fused characteristic representation; and dividing the appointed image content in the target image based on the fusion characteristic representation to obtain an appointed content dividing result. Therefore, misleading of image content with higher confusion in the target image to the image segmentation process can be effectively avoided by means of the classification information in the fusion characteristic representation, and the accuracy of the specified content segmentation result is improved. The method can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic and the like.

Description

Image content segmentation method, device, apparatus, storage medium and program product
Technical Field
Embodiments of the present application relate to the field of image processing, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for dividing image content.
Background
Image segmentation techniques are techniques for separating specified content in an image from the image, and with the rise of intelligent devices, the techniques are often applied to image segmentation scenes, such as: a portrait background blurring scene, a background replacing scene, a matting scene and the like.
In the related art, the distribution of keypoints in an image to be analyzed is generally analyzed by means of a semantic segmentation network, so as to predict whether the image contains specified content according to the outline of the keypoints. For example: the image has a key point outline corresponding to the portrait, so that the portrait content (appointed content) is judged to be included in the image.
When the image is analyzed by adopting the method, although the area surrounded by the key points in the image is similar to the outline of the key point corresponding to the appointed content, the appointed content still can not exist in the image. For example: the key points of the three-dimensional humanoid clothes hanger contained in the image are highly similar to the key point outlines corresponding to the portrait contents, but the portrait contents do not exist in the three-dimensional humanoid clothes hanger, so that a large error exists in judgment of a semantic segmentation network, and the accuracy of an analysis result is affected.
Disclosure of Invention
The embodiment of the application provides an image content segmentation method, an image content segmentation device, an image content segmentation equipment, a storage medium and a program product, which can effectively avoid misleading of image content with higher confusion in a target image to an image segmentation process by means of classification information in fusion characteristic representation, and improve the accuracy of a specified content segmentation result. The technical scheme is as follows.
In one aspect, there is provided an image content segmentation method, the method comprising:
acquiring a target image to be subjected to designated image content segmentation;
extracting features of the target image to obtain an image feature representation corresponding to the target image;
classifying and identifying the target image based on the image characteristic representation to obtain an image classification result corresponding to the target image, wherein the image classification result is used for indicating the inclusion condition of the specified image content in the target image;
performing dimension conversion on the image classification result to obtain classification characteristic representation;
fusing the image characteristic representation with the classification characteristic representation to obtain a fusion characteristic representation;
and dividing the appointed image content in the target image based on the fusion characteristic representation to obtain an appointed content dividing result.
In another aspect, there is provided an image content segmentation apparatus, the apparatus including:
the acquisition module is used for acquiring a target image to be subjected to specified image content segmentation;
the extraction module is used for extracting the characteristics of the target image to obtain an image characteristic representation corresponding to the target image;
the classification module is used for classifying and identifying the target image based on the image characteristic representation to obtain an image classification result corresponding to the target image, wherein the image classification result is used for indicating the inclusion condition of the specified image content in the target image;
the conversion module is used for carrying out dimension conversion on the image classification result to obtain classification characteristic representation;
the fusion module is used for fusing the image characteristic representation with the classification characteristic representation to obtain a fusion characteristic representation;
and the segmentation module is used for segmenting the specified image content in the target image based on the fusion characteristic representation to obtain a specified content segmentation result.
In another aspect, a computer device is provided, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement the image content segmentation method according to any one of the embodiments of the present application.
In another aspect, a computer readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the image content segmentation method according to any one of the embodiments of the application described above.
In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the image content segmentation method according to any one of the above embodiments.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
after extracting the characteristics of the obtained target image to obtain an image characteristic representation, classifying and identifying the target image based on the image characteristic representation, taking the image classification result as prior information, and fusing the classification characteristic representation after dimension conversion of the image classification result with the image characteristic representation when the appointed image content is segmented from the target image, so as to obtain a fused characteristic representation containing image information and classification information; the fusion characteristic representation focuses on the image information of the channel corresponding to the image classification result, can separate the appointed image content from the target image more accurately, and can effectively avoid misleading of the image content with higher confusion in the target image to the image segmentation process by means of the classification information in the fusion characteristic representation, thereby fully improving the accuracy of the appointed content segmentation result.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;
FIG. 2 is a flow chart of a method for image content segmentation provided in an exemplary embodiment of the present application;
fig. 3 is a flowchart of an image content segmentation method provided by another exemplary embodiment of the present application;
FIG. 4 is a schematic diagram of an application image content segmentation method provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a result of image segmentation of a target image according to an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of a result of image segmentation of a target image according to another exemplary embodiment of the present application;
fig. 7 is a flowchart of an image content segmentation method provided by yet another exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a training target image segmentation model provided in an exemplary embodiment of the present application;
FIG. 9 is a flowchart of training and obtaining a portrait segmentation model provided by an exemplary embodiment of the present application;
FIG. 10 is a schematic diagram of processing image segmentation content provided by an exemplary embodiment of the present application;
fig. 11 is a block diagram showing the structure of an image content dividing apparatus according to an exemplary embodiment of the present application;
fig. 12 is a block diagram showing the structure of an image content dividing apparatus according to another exemplary embodiment of the present application;
fig. 13 is a block diagram of a terminal according to an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
First, a brief description will be given of terms involved in the embodiments of the present application.
Artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Image segmentation technique (Image Segmentation): is an important research direction in the field of computer vision and is an important ring of image semantic understanding. Image segmentation refers to the process of dividing an image into a plurality of regions with similar properties, and the image regions obtained by the division do not intersect with each other. With the development of deep learning technology, techniques such as portrait background segmentation, scene object segmentation, and three-dimensional reconstruction, which are developed by image segmentation technology, have been widely applied in various fields such as unmanned driving field, augmented reality field, security protection field, etc.
In the related art, the distribution of keypoints in an image to be analyzed is generally analyzed by means of a semantic segmentation network, so as to predict whether the image contains specified content according to the outline of the keypoints. For example: the image has a key point outline corresponding to the portrait, so that the portrait content (appointed content) is judged to be included in the image. When the image is analyzed by adopting the method, although the area surrounded by the key points in the image is similar to the outline of the key point corresponding to the appointed content, the appointed content still can not exist in the image. For example: the key points of the three-dimensional humanoid clothes hanger contained in the image are highly similar to the key point outlines corresponding to the portrait contents, but the portrait contents do not exist in the three-dimensional humanoid clothes hanger, so that a large error exists in judgment of a semantic segmentation network, and the accuracy of an analysis result is affected.
In the embodiment of the application, a portrait background segmentation technique in an image segmentation technique is described. When predicting the existence of the portrait content in the target image, firstly extracting the characteristics of the target image so as to obtain the image characteristic representation corresponding to the target image; then, classifying and identifying the target image based on the image characteristic representation to obtain an image classification result of predicting whether the image content exists in the target image; then, carrying out dimension conversion on the image classification result to obtain classification characteristic representation; and carrying out feature fusion on the image feature representation and the classification feature representation to obtain a fusion feature representation. On the basis of classification prior, the fusion characteristic representation is used for carrying out content segmentation on the portrait content in the target image, so that a portrait content segmentation result with higher accuracy is obtained. Such as: predicting that no portrait content exists in the target image; or predicting that the portrait content exists in the target image; or, the position condition of the portrait content existing in the target image is predicted, and the like.
The embodiment of the application provides an image content segmentation method, which can effectively avoid misleading of image content with higher confusion in a target image to an image segmentation process by means of classification information in fusion characteristic representation and improve the accuracy of a specified content segmentation result. The image content segmentation method obtained by training comprises at least one of various image segmentation scenes such as a portrait background blurring scene, a background replacing scene, a matting scene, an image generation special effect scene and the like when the image content segmentation method is applied. It should be noted that the above application scenario is merely an illustrative example, and the image content segmentation method provided in the present embodiment may also be applied to other scenarios, which is not limited in the embodiment of the present application.
It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, the image data such as the target image in the present application is acquired with sufficient authorization.
Next, an implementation environment according to an embodiment of the present application will be described. The image content segmentation method provided by the embodiment of the application can be implemented by the terminal alone or by the server, or by the terminal and the server through data interaction, and the embodiment of the application is not limited to the above. Alternatively, an example in which the terminal individually performs the image content dividing method and analyzes the target image is explained.
Referring to fig. 1 for illustrative purposes, the implementation environment is described with respect to terminal 110. In some embodiments, an application program with an image acquisition function is installed in the terminal 110, and is used for acquiring a target image, so that a subsequent image content segmentation method is performed based on the target image, wherein the target image is an image to be subjected to specified image content segmentation.
Alternatively, the terminal 110 performs feature extraction on the target image, thereby obtaining an image feature representation corresponding to the target image. Then, the terminal 110 performs classification recognition on the target image based on the image feature representation to predict whether or not the specified image content exists in the target image, and obtains an image classification result. Namely: the image classification result is used to indicate the inclusion of the specified image content in the target image. Subsequently, the terminal 110 performs dimension conversion on the image classification result to obtain a classification characteristic representation; and carrying out feature fusion on the image feature representation and the classification feature representation to obtain a fusion feature representation.
The terminal 110 performs content segmentation on the specified image content in the target image based on the fusion feature representation, resulting in a specified content segmentation result. Alternatively, the terminal 110 displays the specified content division result.
It should be noted that the above-mentioned terminals include, but are not limited to, mobile terminals such as mobile phones, tablet computers, portable laptop computers, intelligent voice interaction devices, intelligent home appliances, vehicle-mounted terminals, and the like, and may also be implemented as desktop computers and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms.
In some embodiments, the servers described above may also be implemented as nodes in a blockchain system.
The image content segmentation method provided by the present application is described with reference to the above noun introduction and application scenario, and is applied to a server, for example, as shown in fig. 2, and the method includes the following steps 210 to 260.
Step 210, obtaining a target image to be subjected to specified image content segmentation.
Illustratively, the image content is designated as preselected image content, such as: pre-selecting portrait content as designated image content; alternatively, a building is selected in advance as the specified image content; alternatively, animal a is selected in advance as the specified image content or the like.
The target image is an image obtained by any means. Schematically, obtaining a target image through an image acquisition device; or obtaining a target image from a pre-stored image library; or obtaining a target image through a drawing tool; alternatively, the target image or the like is synthesized by the image synthesis application.
The target image is used for image content segmentation to identify whether specified image content exists in the target image. Alternatively, when the specified image content exists in the target image, position information of the specified image content in the target image is determined, and the like.
In some embodiments, the image acquisition device is used to acquire the target image, and the portrait content is preset as the designated image content, so that when the image content is segmented, the portrait content in the target image is segmented.
And 220, extracting the characteristics of the target image to obtain an image characteristic representation corresponding to the target image.
Schematically, after the target image is obtained, the target image is passed through a feature extraction network to obtain an image feature representation corresponding to the target image. For example: and extracting the characteristics of the target image by adopting a depth residual error network (Deep Residual Network, resNet), a visual geometry group (Visual Geometry Group, VGG) and other backbone neural networks.
Alternatively, the image feature representation is a deep feature representation, i.e.: the image characteristic representation is a characteristic representation with small resolution and large information quantity obtained after multiple convolution operations in the neural network.
In some embodiments, the feature extraction process is performed on the target image by a predetermined encoder. Illustratively, the encoder is formed by serially connecting depth separable convolution as a basic structure, a target image is used as input of the encoder, and convolution operation and downsampling operation are continuously carried out on the target image in the encoder, so that image characteristic representation with small resolution and rich semantic information is extracted from the target image.
And 230, classifying and identifying the target image based on the image characteristic representation to obtain an image classification result corresponding to the target image.
Wherein the image classification result is used for indicating the inclusion condition of the specified image content in the target image.
Illustratively, after the image feature representation is obtained, the target image is classified and identified based on the image feature representation to predict whether specified image content exists in the target image, so as to obtain an image classification result. Namely: the image classification result includes: the target image includes specified image content, and the target image does not include specified image content.
In some embodiments, the image feature representation is input into a classification network by means of which classification predictions are made for the image feature representation. The classification network is schematically composed of a global pooling layer and a full-connection layer, the dimension of the image characteristic representation is reduced through the global pooling layer in the classification network, and the characteristic representation after the dimension change is input into the full-connection layer in the classification network, so that the characteristic representation is integrated, and an image classification result for classifying the target image is obtained.
And step 240, performing dimension conversion on the image classification result to obtain classification characteristic representation.
The dimension conversion is used for fusing the image classification result and the image characteristic representation.
In an alternative embodiment, the image classification result is dimension-converted based on the feature dimensions of the image feature representation, resulting in a classification feature representation having the same dimensions as the image feature representation.
Illustratively, the feature dimension of the image feature representation is determined, and the received image classification result is converted into a classification feature representation identical to the feature dimension of the image feature representation based on the feature dimension of the image feature representation.
For example: the image feature represents a size of 1×c×h×w, wherein 1 is used to indicate the target image (one image); c is used for indicating the number of channels represented by the image characteristics; h is used to indicate the high of the image feature representation; w is used to indicate the width of the image feature representation. And converting the image classification result into a three-dimensional form based on the dimension of the image feature representation being the three-dimensional form (C.times.H.times.W), wherein the image classification result is a prediction result of whether specified image content exists in the image and is expressed as 0 or 1.
Illustratively, the image classification result is described as 1, and the dimension conversion is performed on the image classification result, so that the obtained classification features are expressed as follows: the classification feature representation is a feature representation corresponding to the target image (first 1) of the current analysis, and the high-speed width of the classification feature representation is the same as the high-speed width of the image feature representation, so that the feature fusion of the image feature representation and the classification feature representation with the same dimension is facilitated.
And step 250, fusing the image characteristic representation and the classification characteristic representation to obtain a fused characteristic representation.
Illustratively, after the classification feature representation corresponding to the image classification result is obtained, the image feature representation and the classification feature representation are fused, so that a feature representation fusion process is realized, and a fusion feature representation corresponding to the target image is obtained, wherein the fusion feature representation not only comprises image information (information contained in the image feature representation) corresponding to the target image, but also comprises classification information (classification information contained in the classification feature representation) corresponding to the target image.
In an alternative embodiment, the image feature representation is stitched with the classification feature representation along the channel dimension hierarchy to obtain a fused feature representation.
Illustratively, after the image feature representation and the classification feature representation are obtained, the classification feature representation and the image feature representation are spliced in the channel dimension, so that a fusion feature representation corresponding to the target image is obtained.
For example: the size of the classification characteristic representation is: and 1 x H x W, wherein the size of the image feature representation is 1 x C x H x W, and the classification feature representation and the image feature representation are spliced in the channel dimension along the channel dimension, so as to obtain a fusion feature representation, and the size of the fusion feature representation is 1 x (c+1) x H x W.
And 260, dividing the specified image content in the target image based on the fusion characteristic representation to obtain a specified content division result.
Illustratively, after obtaining the fusion characteristic representation containing the image information and the category information of the target image, performing an up-sampling process on the fusion characteristic representation to adjust the resolution, and dividing the designated image content in the target image based on the obtained output characteristic representation, thereby determining the existence condition and the position condition of the designated image content in the target image on the basis of considering whether the designated image content exists in the target image or not, and obtaining the designated content dividing result.
Optionally, when the specified image content exists in the target image, taking the position result of the specified image content in the target image as a specified content segmentation result corresponding to the target image; or when the specified image content does not exist in the target image, obtaining a prediction result that the specified image content does not exist in the target image, and taking the prediction result as a specified content segmentation result corresponding to the target image.
In an alternative embodiment, the image feature representation corresponding to the target image is obtained by an encoder, and the fusion feature representation is decoded by a decoder, so that analysis prediction is performed on the specified image content in the target image based on the fusion feature representation.
Illustratively, the network structure in the decoder corresponds to the network structure in the encoder, and is composed of a plurality of convolution layers, and the plurality of convolution layers perform upsampling processing on the fusion feature representation by using a bilinear interpolation method to obtain a specified content segmentation result with the same resolution as the target image.
For example: the specified content division result exists in the form of an indication image for indicating the presence of the specified image content in the target image, the indication image having the same resolution as the target image.
In some embodiments, the specified content segmentation result is an image region resulting from segmentation of the specified image content from the target image. For example: separating the specified image content from the target image, realizing the matting effect on the specified image content, taking the specified image content in the obtained target image as the specified content segmentation result and the like.
In an alternative embodiment, the portrait content is taken as the appointed image content, the portrait content in the target image is separated by an image content dividing method, so that the portrait content is separated from the background content except the portrait content in the target image in different forms, and the image which prominently shows the portrait content is taken as the appointed content dividing result, and the like.
It should be noted that the above is only an illustrative example, and the embodiments of the present application are not limited thereto.
In summary, after extracting features of an obtained target image to obtain an image feature representation, classifying and identifying the target image based on the image feature representation, and taking an image classification result as prior information, when dividing specified image content from the target image, fusing the classified feature representation after dimension conversion of the image classification result with the image feature representation to obtain a fused feature representation containing image information and classification information; the fusion characteristic representation focuses on the image information of the channel corresponding to the image classification result, can separate the appointed image content from the target image more accurately, and can effectively avoid misleading of the image content with higher confusion in the target image to the image segmentation process by means of the classification information in the fusion characteristic representation, thereby fully improving the accuracy of the appointed content segmentation result.
In an alternative embodiment, the fused feature representation is processed using a jump connection method in combination with an encoding layer that obtains the image feature representation and a decoding layer that processes the fused feature representation. Illustratively, as shown in FIG. 3, the embodiment shown in FIG. 2 described above may also be implemented as steps 310 through 380 described below.
Step 310, a target image to be subjected to specified image content segmentation is acquired.
Illustratively, the image content is designated as preselected image content, such as: pre-selecting portrait content as designated image content; alternatively, a building is selected in advance as the specified image content or the like.
The target image is an image obtained by any means. Schematically, obtaining a target image through an image acquisition device; alternatively, the target image or the like is acquired from a pre-stored image library.
Step 320, the target image is passed through multiple convolution layers, resulting in a first processed feature representation of the output of each convolution layer.
Optionally, after obtaining the target image, passing the target image through multiple convolution layers to perform reduction adjustment on the resolution of the target image, and obtaining an intermediate feature representation output by each convolution layer as the first processing feature representation.
And 330, performing downsampling processing on the first processing feature representation based on the multi-layer convolution layers to obtain an image feature representation corresponding to the target image.
Illustratively, the first processing feature representation output by the previous convolution layer is downsampled by the plurality of convolution layers in the encoder, so that the resolution corresponding to the first processing feature representation output by the previous convolution layer is reduced until the image feature representation corresponding to the target image is output by the last convolution layer in the plurality of convolution layers.
Schematically, as shown in fig. 4, a schematic diagram of the segmentation process is performed on the target image. After the target image is obtained, the target image is passed through an encoder, and downsampling processing is performed on the target image by means of a plurality of convolution layers in the encoder. The convolution layers are used for reducing the resolution of the target image, and each convolution layer outputs to obtain a first processing characteristic representation.
For example: inputting the target image into a first convolution layer 411 in the encoder to perform half-reduction (1/2) on the resolution of the target image and obtain a first processed feature representation of the output of the first convolution layer; subsequently, the first processing feature output by the first convolution layer is input to the second convolution layer 412, so that the resolution corresponding to the first processing feature representation output by the first convolution layer 411 is reduced by half (1/2*1/2=1/4), and the first processing feature output by the second convolution layer 412 is obtained; subsequently, the first processing feature output by the second convolution layer 412 is input to the third convolution layer 413, so that the resolution corresponding to the first processing feature output by the second convolution layer representation is reduced by half (1/4*1/2=1/8), and the first processing feature output by the third convolution layer 413 is obtained; subsequently, the first processing feature output from the third convolution layer 413 is input to the fourth convolution layer 414, so that the resolution corresponding to the first processing feature representation output from the fourth convolution layer 414 is reduced by half (1/8*1/2=1/16), and the first processing feature output from the fourth convolution layer 414 is obtained, and so on.
In an alternative embodiment, the encoder and decoder are connected using a jump connection method such that a portion of the intermediate features obtained by the encoder are directly feature-represented fusion with the corresponding feature representations of the same resolution in the decoder.
Optionally, the first processed feature representation of the output of each convolution layer is mapped onto a deconvolution layer having the same resolution as each convolution layer, the first processed feature representation being used for feature fusion with the second processed feature representation obtained on the deconvolution layer.
Illustratively, the encoder that obtains the image feature representation and the decoder that processes the fusion feature representation have corresponding network structures, such as: the encoder comprises four convolution layers, and is used for performing half-reduction on the resolution of the target image; correspondingly, the decoder comprises four deconvolution layers for doubly expanding the resolution of the fusion feature representation. The number of convolution and deconvolution layers is merely illustrative, and embodiments of the present application are not limited in this respect.
Illustratively, as shown in fig. 4, a first convolution layer 411 in the encoder is connected to a fourth deconvolution layer 424 in the decoder, so as to map a first processing feature obtained by processing the target image by the first convolution layer 411 to the fourth deconvolution layer 424; similarly, the second convolution layer 412 in the encoder is connected to the third deconvolution layer 423 in the decoder, so that the first processed feature representation obtained by processing the first convolution layer 411 by the second convolution layer 412 is mapped to the third deconvolution layer 423; similarly, the third convolution layer 413 in the encoder is connected to the second deconvolution layer 422 in the decoder, so that the first processed feature representation obtained by processing the second convolution layer 412 by the third convolution layer 413 is mapped to the second deconvolution layer 422; similarly, the fourth convolution layer 414 in the encoder is connected to the first deconvolution layer 421 in the decoder such that the first processed feature representation resulting from the processing of the third convolution layer 413 by the fourth convolution layer 414 is mapped to the first deconvolution layer 421.
And step 340, classifying and identifying the target image based on the image characteristic representation to obtain an image classification result corresponding to the target image.
Wherein the image classification result is used for indicating the inclusion condition of the specified image content in the target image.
Optionally, after the image feature representation is obtained, the target image is classified and identified based on the image feature representation to predict whether specified image content exists in the target image, so as to obtain an image classification result. Namely: the image classification result includes: the target image includes specified image content, and the target image does not include specified image content.
Illustratively, as shown in FIG. 4, after the image feature representation is obtained, the image feature representation is passed through a category prediction branch 430 to predict whether the specified image content is included in the target image based on the image feature representation.
For example: the specified image content is preselected portrait content, the image content distinguished from the portrait content is background content, and when the specified image content is included in the prediction target image, that is, whether the portrait content is included in the prediction target image, that is, whether the prediction target image is a purely background image (an image having only background content and no portrait content).
In an alternative embodiment, the target image is classified and identified based on the image feature representation, and a candidate classification result corresponding to the target image is obtained.
Wherein the candidate classification result comprises a first probability value specifying that the image content is present in the target image and a second probability value specifying that the image content is not present in the target image.
Illustratively, as shown in fig. 4, the class prediction branch 430 structurally includes a global pooling layer and a fully connected layer, the class prediction branch 430 receives an image feature representation with a feature representation size of 1×c×h×w output by an encoder, and takes the image feature representation as a network input of the class prediction branch 430, changes the feature size of the image feature representation to 1×c×1 through the global pooling layer in the class prediction branch 430, then obtains an intermediate feature representation of 1*C after a size change, and sends the intermediate feature representation to the fully connected layer in the class prediction branch 430 to obtain a candidate classification result with a feature representation size of 1*2.
Wherein, the two values in the candidate classification result respectively represent the scores of the target image containing the appointed image content (such as portrait content) and the target image not containing the appointed image content. Such as: the candidate classification result includes a first probability value for the target image including the specified image content and a second probability value for the target image not including the specified image content.
In an alternative embodiment, the classification result with a high value of the first probability value and the second probability value is taken as the image classification result.
Illustratively, if the first probability value is greater than the second probability value, the probability that the specified image content is included in the target image is greater than the probability that the specified image content is not included in the target image, i.e.: the target image has a higher possibility that the designated image content exists; or if the first probability value is smaller than the second probability value, the probability of containing the specified image content in the representative target image is smaller than the probability of not containing the specified image content in the target image, namely: there is a greater likelihood that the specified image content is not present in the target image.
Optionally, if the first probability value is greater than the second probability value, taking a result of the specified image content contained in the target image represented by the first probability value as an image classification result; and if the second probability value is larger than the first probability value, taking a result which does not contain the specified image content in the target image represented by the second probability value as an image classification result.
And step 350, performing dimension conversion on the image classification result to obtain classification characteristic representation.
The dimension conversion is used for fusing the image classification result and the image characteristic representation, and the process of fusing is dimension conversion.
In an alternative embodiment, the image classification result is dimension-converted based on the feature dimensions of the image feature representation, resulting in a classification feature representation having the same dimensions as the image feature representation.
Illustratively, the feature dimension of the image feature representation is determined, and the received image classification result is converted into a classification feature representation identical to the feature dimension of the image feature representation based on the feature dimension of the image feature representation.
In some embodiments, after the image classification result is obtained, a matrix channel corresponding to the image classification result is determined, and dimension conversion is performed on the image classification result.
Schematically, if the first probability value is greater than the second probability value, taking a result of the specified image content contained in the target image represented by the first probability value as an image classification result, determining a first matrix channel for obtaining the first probability value, converting the first matrix channel into a matrix channel with the same dimension as that of a matrix corresponding to the image feature representation through dimension conversion, and obtaining classification feature representation with the same dimension as that of the image feature representation by means of the matrix channel;
or if the second probability value is greater than the first probability value, taking a result which does not contain the designated image content in the target image represented by the second probability value as an image classification result, determining a second matrix channel for obtaining the second probability value, converting the second matrix channel into a matrix channel with the same dimension as the matrix corresponding to the image feature representation through dimension conversion, and obtaining the classification feature representation with the same dimension as the image feature representation by means of the matrix channel.
Step 360, fusing the image feature representation with the classification feature representation to obtain a fused feature representation.
Illustratively, when the image feature representation and the classification feature representation are fused, different dimensions of the image feature representation and corresponding dimensions of the classification feature representation are spliced along a channel dimension hierarchy to obtain a fused feature representation.
For example: when the first probability value is larger than the second probability value, determining a first matrix channel for obtaining the first probability value, performing dimension conversion on the first matrix channel based on the feature matrix dimension corresponding to the image feature representation, so as to obtain a matrix channel with the same matrix dimension as the feature matrix dimension, and performing channel splicing on the classification feature representation obtained by the matrix channel and the image feature representation to obtain a fusion feature representation; or when the second probability value is larger than the first probability value, determining a second matrix channel for obtaining the second probability value, performing dimension conversion on the second matrix channel based on the feature matrix dimension corresponding to the image feature representation, so as to obtain a matrix channel with the same matrix dimension as the feature matrix dimension, and performing channel splicing on the classification feature representation obtained by the matrix channel and the image feature representation to obtain the fusion feature representation.
And 370, performing multi-layer deconvolution on the fusion characteristic representation to perform up-sampling processing on the fusion characteristic representation to obtain an output characteristic representation.
Illustratively, an image feature representation corresponding to the target image is obtained by an encoder, and the fused feature representation is decoded by a decoder corresponding to the encoder.
Optionally, the decoder includes a plurality of deconvolution layers, the plurality of deconvolution layers corresponding to the convolution layers in the encoder, and the method is used for performing resolution amplification processing on the fusion characteristic representation after performing resolution reduction processing on the target image, so as to obtain a processed image with the same resolution as the target image.
Illustratively, as shown in fig. 4, after the class prediction branch 430 outputs the image classification result, the fused feature representation obtained by fusing the classification feature representation corresponding to the image classification result and the image feature representation output by the encoder is sent to the decoder, and the fused feature representation is up-sampled by a plurality of deconvolution layers in the decoder.
In an alternative embodiment, the fused feature representation is passed through a first deconvolution layer to obtain a second processed feature representation.
Illustratively, the first deconvolution layer (e.g., first deconvolution layer 421 shown in fig. 4) after the fused feature representation is input to the decoder is used as a first deconvolution layer, and the intermediate feature representation after the upsampling process is performed on the fused feature representation by the first deconvolution layer is used as a second processed feature representation.
Optionally, receiving a first processed feature representation of a first convolutional layer map having the same resolution as the first deconvolution layer; and fusing the first processing characteristic representation with the second processing characteristic representation, and performing up-sampling processing through a plurality of subsequent deconvolution layers to obtain an output characteristic representation.
Illustratively, based on the jump connection method, each deconvolution layer in the decoder receives the first processing feature representation of the convolution layer map with the same resolution in the encoder, so that after the fusion feature representation is subjected to up-sampling processing to obtain a second processing feature representation, the second processing feature representation and the received first processing feature representation are subjected to feature fusion processing, the feature representation obtained after fusion is subjected to subsequent deconvolution, and the up-sampling processing and the feature fusion processing are repeated until an output feature representation is output from the last deconvolution layer (such as a fourth deconvolution layer 424 shown in fig. 4) in the decoder.
Illustratively, as shown in FIG. 4, based on the skip-join approach, the first deconvolution layer 421 receives the first processed feature representation (i.e., the image feature representation described above) mapped by the fourth convolution layer 414; the second deconvolution layer 422 receives the first processed feature representation mapped by the third convolution layer 413; the third deconvolution base layer 423 receives the first processed feature representation mapped by the second convolution layer 412; the fourth deconvolution layer 424 receives the first processed feature representation mapped by the first convolution layer 411.
The first deconvolution layer 421 upsamples the fused feature representation to obtain a second processed feature representation corresponding to the first deconvolution layer 421, performs feature fusion processing on the received first processed feature representation mapped by the fourth convolution layer 414 and the second processed feature representation generated by the first deconvolution layer 421, and inputs the fused feature representation into the second deconvolution layer 422;
the second deconvolution layer 422 performs up-sampling processing on the fused feature representation to obtain a second processed feature representation corresponding to the second deconvolution layer 422, performs feature fusion processing on the received first processed feature representation mapped by the third deconvolution layer 413 and the second processed feature representation generated by the second deconvolution layer 422, and inputs the fused feature representation into the third deconvolution layer 423;
The third deconvolution layer 423 performs up-sampling processing on the fused feature representation to obtain a second processed feature representation corresponding to the third deconvolution layer 423, performs feature fusion processing on the received first processed feature representation mapped by the second deconvolution layer 412 and the second processed feature representation generated by the third deconvolution layer 423, and inputs the fused feature representation into the fourth deconvolution layer 424;
the fourth deconvolution layer 424 performs up-sampling processing on the fused feature representation to obtain a second processed feature representation corresponding to the fourth deconvolution layer 424, and performs feature fusion processing on the received first processed feature representation mapped by the first deconvolution layer 411 and the second processed feature representation generated by the fourth deconvolution layer 424, so as to obtain an output feature representation output by the fourth deconvolution layer 424.
It should be noted that the above is only an illustrative example, and the embodiments of the present application are not limited thereto.
Step 380, predicting the specified image area of the specified image content in the target image based on the output characteristic representation, and taking the result of indicating the specified image area as the specified content segmentation result.
Illustratively, after obtaining the fusion characteristic representation containing the image information and the category information of the target image, performing an up-sampling process on the fusion characteristic representation to adjust the resolution, and dividing the designated image content in the target image based on the obtained output characteristic representation, thereby determining the existence condition and the position condition of the designated image content in the target image on the basis of considering whether the designated image content exists in the target image or not, and obtaining the designated content dividing result.
In an alternative embodiment, the region identification of the specified image content is performed on the target image based on the output characteristic representation, so as to obtain the prediction probability that each pixel point in the target image is the pixel point of the specified image content.
Optionally, after the output feature representation is obtained by the decoder, probability prediction is performed on each pixel in the target image based on the output feature representation to predict whether each pixel belongs to a pixel of the specified image content, so as to obtain a prediction probability corresponding to each pixel.
In an alternative embodiment, in some embodiments, the specified image area corresponding to the specified image content in the target image is separated based on the prediction probability, so as to obtain the specified content segmentation result.
Illustratively, after the prediction probability corresponding to each pixel point is obtained, each pixel point in the target image is subjected to distinguishing processing based on the prediction probability, so that the designated image content is distinguished from the rest of image contents in the target image, and the designated image area corresponding to the designated image content is separated from the target image.
In some embodiments, the assigned pixel point is assigned with the first pixel value in response to the predicted probability corresponding to the assigned pixel point reaching a preset probability threshold.
The preset probability threshold is a preset probability threshold, and the designated pixel point is any one of a plurality of pixel points in the target image.
Optionally, after obtaining the prediction probability corresponding to each pixel point, comparing the prediction probability corresponding to each pixel point with a preset probability threshold, so as to perform distinguishing processing on each pixel value. The explanation will be given taking the case of performing the distinguishing process for the specified pixel points.
Illustratively, after obtaining a prediction probability corresponding to a specified pixel point, comparing the prediction probability with a preset probability threshold, and when the prediction probability reaches the preset probability threshold, assigning a value to the specified pixel point by a preset first pixel value. For example: the preset probability threshold is 0.5, the prediction probability corresponding to the appointed pixel point is 0.8, namely the preset probability reaches the preset probability threshold, and the appointed pixel point is assigned by the first pixel value, for example: the designated pixel point is assigned a value of 255 such that the designated pixel point is displayed as white.
In some embodiments, in response to the predicted probability corresponding to the specified pixel not reaching the preset probability threshold, the specified pixel is assigned with the second pixel value.
Illustratively, when the prediction probability corresponding to the designated pixel point does not reach the preset probability threshold, the designated pixel point is assigned with a preset second pixel value. For example: the preset probability threshold is 0.5, the prediction probability corresponding to the appointed pixel point is 0.35, namely, the preset probability does not reach the preset probability threshold, and the appointed pixel point is assigned by the second pixel value, for example: the designated pixel point is assigned a value of 0 so that the designated pixel point is displayed as black.
It is noted that the first pixel value and the second pixel value are used for distinguishing the specified image content from the rest of the image content in the target image, and when the first pixel value and the second pixel value are set, the specified image content and the rest of the image content are intended to be distinguished by way of distinguishing presentation. Therefore, the first pixel value is 255, the second pixel value is 0 is only an illustrative example, and the first pixel value may be 0, and the second pixel value may be 255; alternatively, the first pixel value is set to 246, the second pixel value is set to 2, and so on, that is: the first pixel value and the second pixel value are set in a manner of larger numerical difference. The above is merely illustrative and the embodiments of the present application are not limited in this respect.
In some embodiments, an image area where a plurality of pixel points corresponding to the first pixel value are located is taken as a designated image area, and a designated content segmentation result is obtained.
Illustratively, the pixel point assigned to the first pixel value is used for indicating that the prediction probability corresponding to the pixel point reaches a preset probability threshold, that is: the pixel point assigned as the first pixel value has a high probability of belonging to the pixel point corresponding to the specified image content.
Optionally, after the above-mentioned differential pixel value assignment process is performed on the multiple pixel points in the target image, a pixel value corresponding to each pixel point is obtained, where a part of the pixel points have a first pixel value and a part of the pixel points have a second pixel value.
Illustratively, a plurality of pixel points corresponding to the first pixel value are determined, an image area where the plurality of pixel points exist is determined, and the image area is taken as a specified image area containing specified image content.
In an alternative embodiment, the image assigned by the different pixel values is taken as the specified content segmentation result.
Illustratively, by designating the content division result, a designated image area where the pixel point assigned to the first pixel value is located and the remaining image areas where the pixel point assigned to the second pixel value is located may be determined, so that the designated image content is distinguished from the remaining image content in the designated content division result corresponding to the target image.
Schematically, as shown in fig. 5, a target image 510 is processed by the image content segmentation method described above. The target image 510 is an image obtained by photographing a woman, the image content is designated as a preset portrait content, and the target image 510 is processed by an image content segmentation method, namely: the portrait content is separated from the background content (content other than the portrait content in the target image) from the target image 510.
For example: for the target image 510, the image content segmentation method combines the image information and the classification information, and performs a differential pixel value assignment process on different pixels in the target image, so that a first pixel value (for example, the first pixel value is set to 255) is used for assigning pixels with larger prediction probability, a second pixel value (for example, the second pixel value is set to 0) is used for assigning pixels with smaller prediction probability, so that the pixels assigned to the first pixel value are used for representing a human image region corresponding to human image content, and the pixels assigned to the second pixel value are used for representing a background region corresponding to background content, thereby obtaining the specified content segmentation result 520. The white area is an area covered by the pixel points with the first pixel value of 255, and is a portrait area corresponding to portrait content; the black area, i.e., the area covered by the pixel point with the second pixel value of 0, is the background area corresponding to the background content, so that the image content and the back content are effectively distinguished by the specified content segmentation result 520.
Similarly, as shown in fig. 6, the target image 610 is processed by the image content segmentation method described above. The target image 610 is an image obtained by photographing a piece of clothing on a seat, the image content is designated as a preset portrait content, and the target image 610 is processed by an image content segmentation method, namely: the portrait content is separated from the background content (content other than the portrait content in the target image) from the target image 610.
For example: although the human eye can determine that the target image 610 does not contain the portrait content, if only the image information is considered, the contour of the clothing on the seat is very similar to the shape of the human body, so that the target image 610 is easily misjudged as the image with the portrait content, and the image segmentation process is wrong. After the image feature representation corresponding to the target image 610 is obtained through the image content segmentation method, the target image 610 is classified based on the image feature representation, and the classification feature representation corresponding to the classification information and the image feature representation corresponding to the image information are fused, so that the assignment process of the pixel values of different pixel points in the target image is performed differently based on the fusion feature representation.
If the prediction probability corresponding to each pixel point in the target image 610 is smaller than the preset probability threshold, a second pixel value is used to assign a value to each pixel point, for example: the second pixel value is set to 0, and then the pixel points corresponding to the second pixel value are all the pixel points in the target image 610, that is: the target image 610 is presented as a black image, which is the specified content segmentation result 620 obtained by performing image content segmentation on the target image 610.
In an alternative embodiment, the specified image area corresponding to the pixel point having the first pixel value is separated from the target image, and the separated specified image area is used as the specified content segmentation result.
It should be noted that the above is only an illustrative example, and the embodiments of the present application are not limited thereto.
In summary, the fused feature representation obtained after feature extraction and classification recognition of the target image focuses on the image information of the channel corresponding to the image classification result, so that the designated image content can be separated from the target image more accurately, misguidance of the image content with higher confusion in the target image on the image segmentation process can be effectively avoided by means of the classification information in the fused feature representation, and the accuracy of the designated content segmentation result is fully improved.
In an embodiment of the application, a jump connection between an encoder and a decoder is described. In the process of obtaining the image characteristic representation, the first processing characteristic representation obtained by the convolution layer in the encoder is mapped to the deconvolution layer corresponding to the decoder with the same resolution, so that the problems of gradient disappearance and network forgetting are reduced when the fusion characteristic representation is subjected to up-sampling processing, and the accuracy of image segmentation of a target image is improved.
In an alternative embodiment, the image segmentation process is performed on the target image through a target image segmentation model, wherein the target image segmentation model comprises a classification prior network, and the classification prior network is used for providing assistance for the content segmentation process of the target image segmentation model. Illustratively, as shown in fig. 7, the training process of training the image segmentation model, thereby obtaining the target image segmentation model for processing the target image, may be implemented as the following steps 710 to 740.
Step 710, the sample image is input into the image segmentation model.
The image segmentation model is obtained through training of a first sample image containing specified image content.
Schematically, a candidate segmentation model is obtained, the candidate segmentation model is a universal segmentation model with certain image content segmentation capability, the candidate segmentation model is trained through a first sample image containing specified image content until the training of the candidate segmentation model achieves a training effect, and the content segmentation model after the training of the candidate segmentation model is obtained. For example: and if the loss value in the candidate segmentation model training process is not reduced, taking the candidate segmentation model corresponding to the loss value as the content segmentation model.
Optionally, the image content is designated as pre-selected portrait content, and the first sample image is a sample image containing portrait content. For example: the first sample image contains at least one clearly visible portrait subject.
In an alternative embodiment, after training the candidate segmentation model, a content segmentation model is obtained, and a classification prior network is added to the content segmentation model, so as to obtain an image segmentation model, namely: after the candidate segmentation model is trained by means of a first sample image containing specified image content to obtain a content segmentation model, a classification priori network is added into the content segmentation model to obtain an embryonic image segmentation model of the target image segmentation model for processing the target image.
Schematically, as shown in fig. 8, a training process of an image segmentation model including a classified prior network is schematically shown.
Wherein, the plurality of sample images 810 are input into an image segmentation model, and the plurality of sample images 810 are analyzed by the image segmentation model, respectively.
The sample image 810 includes at least one of a first sample image 811 and a second sample image 812 that does not include the specified image content (the illustration is merely illustrative, and there may be a plurality of first sample images and second sample images). Namely: the image segmentation model is comprehensively trained using a first sample image containing specified image content and a second sample image not containing specified sample image content.
Illustratively, designating the image content as preselected portrait content, and then the first sample image as a sample image containing portrait content; the second sample image is a sample image that does not contain portrait content. Optionally, the first sample image and the second sample image are respectively labeled with a sample label with the same resolution as the corresponding sample image, wherein the first sample image is correspondingly labeled with a first sample label with the same resolution as the first sample image; the second sample image is correspondingly labeled with a second sample label having the same resolution as the second sample image.
In some alternative embodiments, sample labels corresponding to sample images are differentially pixel assigned based on differences in specified image content and remaining image content in the sample images.
Alternatively, taking the image content as the pre-selected portrait content as an example, a first sample image in the sample images is a sample image containing portrait content, and a second sample image is a sample image (only containing background content) not containing portrait content. When pixel assignment is carried out on the sample label based on the difference between the human image content and the background content, determining a human image area corresponding to the human image content in the first sample image, and setting the pixel value of a point corresponding to the human image area to 255 from the first sample label; a background area other than the portrait area in the first sample image is determined, and a pixel value of a point corresponding to the background area is set to 0 from the first sample tag. For example: the position of the point (x, y) in the first sample image is located within the portrait area, then the point position (x, y) to be corresponding to the portrait area is determined from the first sample tag, and the pixel value at the (x, y) position in the corresponding first sample tag is set to 255.
Similarly, in a second sample image (i.e., a pure background image) that does not contain portrait content, the pixel value of a point in a second sample tag corresponding to the second sample image is set to 0. Namely: the pixel values of all points in the second sample label corresponding to the second sample image are set to 0.
And step 720, carrying out category prediction on the sample image through the classification prior network to obtain a prediction category corresponding to the sample image.
Schematically, as shown in fig. 8, an encoder 820 is present in the image segmentation model, and is configured to perform feature extraction on a sample image, so as to obtain a sample feature representation corresponding to the sample image, and the sample feature representation is analyzed by the classification prior network 830 in the image segmentation model and by the classification prior network 830, so as to implement a process of performing category prediction on the sample image, and obtain a prediction category corresponding to the sample image.
In an alternative embodiment, the classification prior network includes a class information extraction branch and a class prediction branch.
The category information extraction branch is used for analyzing the sample image so as to obtain a sample category label corresponding to the sample image which is analyzed currently.
Schematically, as shown in fig. 8, in the category information extraction branch 831, a sample label corresponding to the sample image that is currently analyzed is determined, a pixel average value of all pixel values in the sample label is calculated, and the pixel average value is compared with a preset pixel threshold value, so as to obtain a sample category label corresponding to the sample image. The sample category label is used for indicating the image category corresponding to the sample image, namely: the sample image belongs to a first sample image containing the specified image content, or the sample image belongs to a second sample image not containing the specified image content.
Alternatively, the image content is designated as the portrait content as an example. If the pixel mean value of all pixel values in the sample label corresponding to the sample image to be analyzed is smaller than a preset pixel threshold value, judging a second sample image which does not contain the portrait content; if the pixel mean value of all the pixel values in the sample label corresponding to the sample image to be analyzed is not smaller than (greater than or equal to) the preset pixel threshold value, judging that the sample image contains the portrait content. For example: if the pixel mean value of the labeling data (i.e., all pixel values of the sample label labeled by the sample image) corresponding to the sample image analyzed at present is smaller than 1e -5 (1*10 -5 ) And if the sample image is a second sample image (pure background image) without the portrait content, otherwise, the sample image is a first sample image with the portrait content.
The category prediction branch is used for analyzing the sample image so as to predict whether the sample image analyzed currently contains specified image content.
Illustratively, the image content is designated as the portrait content. As shown in FIG. 8, the class prediction branch 832 structurally includes a global pooling layer and a fully connected layer. The class prediction branch 832 receives the sample feature representation output by the encoder 820, takes the sample feature representation as an input to the class prediction branch 832, and outputs an image classification result corresponding to the sample image, i.e., the image classification result indicates that the class prediction branch 832 determines whether the sample image contains portrait content.
The global pooling layer is configured to change the received sample feature representation with a feature size of 1×c×h×w into a feature size of 1×c×1, obtain a feature representation of 1*C after a dimensional change, and send the feature representation to the full-connection layer, and finally obtain a result with a size of 1*2, where "2" (two values) in the result respectively represent a score for judging that the sample image contains portrait content and a score for judging that the sample image does not contain portrait content.
Alternatively, a first score containing portrait content (specified image content) and a second score not containing portrait content in the sample image are compared, and a result of the high score is used as a prediction category after classification prediction is performed on the sample image. For example: if the first score is greater than the second score, taking the result of the inclusion of the portrait content in the sample image as a prediction category, e.g. using a 1; or if the first score is smaller than the second score, the result that the portrait content is not included in the sample image is taken as a prediction category, such as a "0" representation.
Step 730, obtaining a class loss value based on the difference between the predicted class and the sample class label corresponding to the sample image.
Illustratively, after obtaining a sample category label corresponding to the prediction category and the sample image, comparing the difference between the prediction category and the sample category label to obtain a category loss value.
Alternatively, as shown in fig. 8, a sample category label corresponding to the sample image is obtained by means of a category information extraction branch 831; the class prediction branch 832 obtains a predicted class after analysis of the sample feature representation, and obtains a class loss value (class supervision loss) based on the sample class label and the predicted class.
Illustratively, a cross entropy loss function is used to calculate a class loss value between the predicted class and the sample class label.
And 740, training the classification priori network by using the classification loss value, and obtaining a trained target image segmentation model.
Illustratively, the class loss value is obtained, and the class prior network is trained by the class loss value until the trained class prior network is obtained.
In an alternative embodiment, the image segmentation model predicts a specified image region corresponding to the specified image content in the sample image.
Illustratively, as shown in fig. 8, after obtaining a sample feature representation corresponding to a sample image, the sample feature representation is fused with a prediction class obtained by classifying and predicting the sample image, and is input to a decoder 840 in an image segmentation model. For example: the prediction class is dimension-converted to obtain a prediction class feature representation, and the prediction class feature representation and the sample feature representation are feature-spliced in the channel dimension, so that the obtained sample fusion feature representation is input to a decoder 840 in the image segmentation model.
Optionally, the encoder 820 and the decoder 840 in the image segmentation model are connected in a jump connection manner, so that intermediate feature representations obtained by different convolution layers in the encoder 820 can be mapped onto deconvolution layers with resolutions corresponding to the decoder 840, so that when the sample fusion feature representations are subjected to upsampling, the problems of gradient disappearance and network forgetting are reduced, and the accuracy of analyzing the sample images is improved.
The sample fusion characteristic representation is fused with the image information corresponding to the sample image and the classification information obtained after classification prediction, so that the classification result in the sample fusion characteristic representation input into the decoder is used as prior information to assist in carrying out image content segmentation task on the sample image.
In some embodiments, after analysis of the sample fusion feature representation by the decoder, a segmentation result image is obtained, the segmentation result image including a predicted specified image region corresponding to the specified image content.
Optionally, an image region tag of the sample image annotation is obtained. Wherein the image region tag is used to indicate the region position of the specified image content in the sample image.
And acquiring and obtaining a region loss value based on the difference between the designated image region and the image region label.
Illustratively, the cross entropy loss shown in equation one below is employed as a loss function and a region loss value (segmentation supervision loss as shown in fig. 8) is obtained.
Wherein,,for indicating a region loss value; />The image region label is used for indicating the pre-marked image region label, and a point (a pixel point) corresponding to the image content is designated in the sample image; alpha i For indicating a corresponding point (corresponding pixel point) in the specified image area; i is used to indicate the pixel point in the sample image.
In some embodiments, the image segmentation model is trained with region loss values and class loss values, and a target image segmentation model is obtained.
Illustratively, the decoder is trained by the region loss values; training the classification priori network through the classification loss value, so that the image segmentation model is trained by means of the region loss value and the classification loss value, and a target image segmentation model is obtained.
It should be noted that the above is only an illustrative example, and the embodiments of the present application are not limited thereto.
In summary, the fused feature representation obtained after feature extraction and classification recognition of the target image focuses on the image information of the channel corresponding to the image classification result, so that the designated image content can be separated from the target image more accurately, misguidance of the image content with higher confusion in the target image on the image segmentation process can be effectively avoided by means of the classification information in the fused feature representation, and the accuracy of the designated content segmentation result is fully improved.
In the embodiment of the application, a target image segmentation model for performing image segmentation on a target image is obtained through training. Inputting a sample image into an image segmentation model to train the image segmentation model, adding a classification priori network into the image segmentation model, and training the classification priori network by means of a class loss value between a predicted class output by the classification priori network and a sample class label corresponding to the sample image; the image segmentation model is trained by means of the region loss value between the appointed image content output by the image segmentation model and the image region label corresponding to the sample image, so that the image segmentation model is comprehensively trained from two aspects of the image classification process and the image prediction process, the target image segmentation model obtained through training has good image segmentation capability and good image category prediction capability, and the appointed image content can be separated from the target image more accurately.
In an alternative embodiment, the above image content segmentation method is referred to as a background erroneous segmentation suppression method based on classification prior, to designate image content as portrait content, and an image segmentation model applied to the portrait segmentation field is referred to as a portrait segmentation model, and an algorithm flow of the image content segmentation method is described. Illustratively, as shown in fig. 9, the embodiment shown in fig. 2 described above may also be implemented as steps 910 through 970 below.
Step 910, conventional segmentation training data and plain background training data are prepared.
Wherein the conventional segmentation training data is used for indicating training data containing portrait content; the plain background training data is used to indicate training data that does not contain portrait content. Namely: the two batches of training data prepared are divided into conventional segmentation training data containing portrait content and pure background training data without portrait content.
Each image in the conventional segmentation training data at least comprises a clear visible portrait main body, the pixel value corresponding to the portrait content in the annotation data corresponding to the conventional segmentation training data is 255, and the pixel value corresponding to the rest part is 0; each image in the pure background training data does not contain a portrait main body, and the pixel value of the labeling data corresponding to the pure background training data is 0.
Step 920, training the portrait segmentation model with conventional segmentation training data.
The method is characterized in that the conventional segmentation training data containing the portrait content is adopted to train a portrait segmentation model, the portrait segmentation model is a deep learning model, and the real-time segmentation model of the mobile terminal is small in calculated amount, high in speed and high in precision and adopts a depth separable convolution stacking design.
The image segmentation model comprises an encoder and a decoder, wherein the encoder is formed by serially connecting depth separable convolution as a basic structure, receives a color image as input, continuously performs convolution and downsampling operations, and extracts deep features with small resolution and rich semantic information; the decoder is formed by combining bilinear interpolation upsampling and convolution, receives as input the deep features of the encoder output on the content, and finally outputs a prediction result having the same resolution as the original input color image.
The characteristic fusion is carried out between the encoder and the decoder through jump connection on the structure, meanwhile, the design is carried out by combining a separable convolution structure as a basic module, the parameter quantity and the calculated quantity can be greatly reduced while the segmentation effect is ensured, and the output of the real-time effect of the mobile terminal is realized.
Namely: during the training of this step 920, conventional segmentation training data is used as the training set and cross entropy loss is used as the segmentation supervision loss for optimization.
Optionally, after the portrait segmentation model is obtained by using the conventional segmentation training data set, the segmentation supervision loss and the preset network structure, the pre-training model at the moment is used as an initialization parameter.
Step 930, adding a classification prior network to the image segmentation model.
Illustratively, after the portrait segmentation model is trained by means of conventional segmentation training data containing portrait content, a classification prior network is added to the trained portrait segmentation model.
The classification prior network mainly comprises a category information extraction branch and a category prediction branch.
The class information extraction firstly calculates the pixel mean value of the labeling data corresponding to the input image so as to obtain the class label of the input image. Illustratively, a mean value less than 1e-5 is determined to be pure background data without portraits, whereas a mean value less than 1e-5 is determined to be conventional training data with portraits.
The class prediction branch structurally comprises a global pooling layer and a full-connection layer, receives deep bits output by an encoder as input and outputs a classification result, wherein the classification result indicates whether an original input image contains a portrait or not by a network, and then the classification result and the output result of the encoder are fused and then sent to a decoder, and the classification result is used as prior assistance to carry out a segmentation task.
The global pooling changes the received characteristic of 1×c×h×w into 1×c×1, then obtains the characteristic of 1*C after the dimensional change, and sends the characteristic to the full-connection layer to finally obtain a result of 1*2, and two values in the result are resolved to represent the score for judging whether the original input image contains or does not contain the portrait.
Step 940, training and optimizing the complete portrait segmentation model by using all training data.
All training data are used to indicate the conventional segmentation training data and plain background data; the complete portrait segmentation model is used to indicate the model structure including the classified prior network.
Illustratively, the process of training and optimizing the complete portrait segmentation model is summarized by adopting all training data, and besides the segmentation supervision loss in the step 920, the classification supervision loss is added at the same time, and the classification supervision loss and the segmentation supervision loss are consistent in form, and cross entropy loss is also adopted as supervision.
Optionally, the method comprises the step of. When the complete portrait segmentation model is trained and optimized by means of all training data, the parameter weight of an encoder is fixed, and the parameters of a decoder and a category prediction branch are updated. Namely: after the human image segmentation model is trained through conventional segmentation training data containing human image content, fixing the parameter weight of the encoder; and then updating the parameters of the decoder and the class prediction branch in the portrait segmentation model when the portrait segmentation model containing the classification prior network is trained through all training data.
In step 950, an input image of the object to be used is acquired.
After the image segmentation model is trained in steps 910 to 940, the trained image segmentation model is applied, and the image segmentation model includes a classification prior network.
Illustratively, an input image of a subject to be used is acquired, wherein the input image may or may not have portrait content, that is: implemented as a background image, etc.
Step 960, forward reasoning predicts the portrait foreground area by using the complete portrait segmentation model.
The training image segmentation model comprising the classifying prior network is used for forward reasoning on the input image input by using the object, wherein the forward reasoning is used for indicating that the input image is analyzed layer by means of the network composition structure of the image segmentation model. For example: after the input image is input into the portrait segmentation model, the input image is analyzed based on the network composition structure of the portrait segmentation model, and the result sent by the upper layer of network is input into the lower layer of network, so that the intermediate analysis result of the input image is sequentially analyzed. Such as: the method comprises the steps of inputting an input image into a human image segmentation model comprising a classification priori network, performing a downsampling process through a network in an encoder, inputting a result output by the encoder into the classification priori network to perform a classification prediction process, fusing the result output by the classification priori network and the result output by the encoder (a feature fusion process), and inputting the fused result into the network in a decoder to perform an upsampling process, so that the result output by the decoder is used as a predicted human image foreground area.
The portrait foreground area is an image area corresponding to portrait content.
In step 970, the portrait foreground area post-processing outputs the final result.
Schematically, after the post-processing process is performed on the portrait foreground area, a final result capable of indicating the portrait foreground area is output.
Optionally, the truncation is performed according to the confidence level of 0.5 to reduce noise and false detection, and a final portrait segmentation probability map is obtained as a final result. As shown in fig. 5 and 6, the pixel value corresponding to the pixel point whose prediction probability (probability of containing portrait content) is 0 to 0.5 is assigned to 0, that is: the non-portrait area excluding portrait content appears black; and assigning the pixel value corresponding to the pixel point with the prediction probability of 0.5 to 1 as 255, namely: the portrait area including the portrait content appears white, wherein the larger the probability value of the prediction probability, the higher the likelihood that this pixel is the pixel corresponding to the portrait content.
In an alternative embodiment, the image content segmentation method described above can be applied to multiple scenes such as a matting scene, a background replacement scene, an image generation special effect scene, and the like. Illustratively, taking an example that the image content segmentation method is applied to an image generation special effect scene.
The image generation special effect is rapidly developed because of the fact that an image processing algorithm is used as technical support, wherein the special effect related to a human body is widely paid attention to because of the characteristics of strong playability, differential expression, rich content and the like. The portrait segmentation algorithm is used as an important algorithm scheme and aims to separate a portrait part and a background part in an image and provide a manufacturing material for a subsequent creative playing method. As shown in fig. 10, a schematic diagram of a portrait segmentation method is shown, wherein portrait content 1010 is an image area separated from an input image input by a user; the background area 1020 is a background content selected based on preference using the object, and may further include special effects added using the object, such as: text effects, symbol effects, etc.
The creative playing method related to the human body is based on the secondary creation of the earlier portrait segmentation result, so that the requirement on the accuracy of portrait segmentation is high, and the problem of early termination of the production of subsequent invalid materials due to the fact that the portrait is mistakenly segmented can be avoided to a great extent when a portrait scene does not exist by the image content segmentation method.
It should be noted that the above is only an illustrative example, and the embodiments of the present application are not limited thereto.
In summary, the fused feature representation obtained after feature extraction and classification recognition of the target image focuses on the image information of the channel corresponding to the image classification result, so that the designated image content can be separated from the target image more accurately, misguidance of the image content with higher confusion in the target image on the image segmentation process can be effectively avoided by means of the classification information in the fused feature representation, and the accuracy of the designated content segmentation result is fully improved.
In the embodiment of the present application, a case where specified image content is implemented as portrait content is described. The input image input by the object is segmented by means of the portrait segmentation model, so that the processing of the portrait content obtained by segmentation by the object is facilitated, for example: pasting the segmented portrait content to a region selected by a user; or, special effects are added to the obtained portrait contents to generate other special effect diagrams and the like, so that the application universality and interestingness of the image segmentation method are improved.
Fig. 11 is a block diagram showing the structure of an image content dividing apparatus according to an exemplary embodiment of the present application, and as shown in fig. 11, the apparatus includes:
An acquisition module 1110, configured to acquire a target image to be subjected to specified image content segmentation;
the extracting module 1120 is configured to perform feature extraction on the target image to obtain an image feature representation corresponding to the target image;
the classification module 1130 is configured to perform classification and identification on the target image based on the image feature representation, to obtain an image classification result corresponding to the target image, where the image classification result is used to indicate a content of the specified image in the target image;
the conversion module 1140 is configured to perform dimension conversion on the image classification result to obtain a classification feature representation;
a fusion module 1150, configured to fuse the image feature representation with the classification feature representation to obtain a fused feature representation;
and a segmentation module 1160, configured to segment the specified image content in the target image based on the fusion feature representation, to obtain a specified content segmentation result.
In an alternative embodiment, the conversion module 1140 is further configured to dimension convert the image classification result based on a feature dimension of the image feature representation, to obtain the classification feature representation having the same dimension as the image feature representation.
In an alternative embodiment, the fusion module 1150 is further configured to splice different dimensions of the image feature representation with corresponding dimensions of the classification feature representation along a channel dimension hierarchy to obtain the fusion feature representation.
In an alternative embodiment, the segmentation module 1160 is further configured to subject the fused feature representation to a multi-layer deconvolution layer to perform an upsampling process on the fused feature representation to obtain an output feature representation; and predicting a specified image area of the specified image content in the target image based on the output characteristic representation, and taking a result indicating the specified image area as the specified content segmentation result.
In an alternative embodiment, the segmentation module 1160 is further configured to subject the target image to multiple convolution layers to obtain a first processed feature representation of the output of each convolution layer; and carrying out downsampling processing on the first processing feature representation based on the multi-layer convolution layer to obtain the image feature representation corresponding to the target image.
In an alternative embodiment, the partitioning module 1160 is further configured to pass the fused feature representation through a first deconvolution layer to obtain a second processed feature representation; receiving a first processed feature representation of a first convolutional layer map having the same resolution as the first deconvolution layer; and fusing the first processing characteristic representation with the second processing characteristic representation, and performing up-sampling processing through a plurality of subsequent deconvolution layers to obtain the output characteristic representation.
In an alternative embodiment, the fusing module 1150 is further configured to map the first processing feature representation output by each of the convolution layers onto a deconvolution layer having the same resolution as each of the convolution layers, where the first processing feature representation is used to perform feature fusion with a second processing feature representation obtained on the deconvolution layer.
In an alternative embodiment, the classification module 1130 is further configured to perform classification on the target image based on the image feature representation to obtain a candidate classification result corresponding to the target image, where the candidate classification result includes a first probability value that the specified image content is present in the target image, and a second probability value that the specified image content is not present in the target image; and taking the classification result with high numerical value in the first probability value and the second probability value as the image classification result.
In an optional embodiment, the segmentation module 1160 is further configured to perform area identification of the specified image content on the target image based on the fusion feature representation, so as to obtain a prediction probability that each pixel point in the target image is a pixel point of the specified image content; and separating a specified image area corresponding to the specified image content in the target image based on the prediction probability to obtain the specified content segmentation result.
In an optional embodiment, the partitioning module 1160 is further configured to assign a value to the specified pixel with the first pixel value in response to the prediction probability corresponding to the specified pixel reaching a preset probability threshold; or, in response to the prediction probability corresponding to the appointed pixel point not reaching the preset probability threshold, assigning a value to the appointed pixel point by using a second pixel value; and taking the image area with the plurality of pixel points corresponding to the first pixel value as the appointed image area to obtain the appointed content segmentation result.
In an alternative embodiment, the target image is subjected to segmentation processing through a target image segmentation model, and the image segmentation model comprises a classification prior network;
as shown in fig. 12, the apparatus further includes:
a training module 1170 for inputting a sample image into the image segmentation model, the image segmentation model being a segmentation model trained from a first sample image comprising specified image content, the sample image comprising at least one of the first sample image and a second sample image not comprising the specified image content; carrying out category prediction on the sample image through the classification prior network to obtain a prediction category corresponding to the sample image; obtaining a class loss value based on the difference between the predicted class and the sample class label correspondingly marked by the sample image; and training the classification priori network by using the class loss value, and obtaining the trained target image segmentation model.
In an alternative embodiment, the training module 1170 is further configured to predict a specified image region in the sample image corresponding to the specified image content with the image segmentation model; acquiring an image area tag of the sample image label, wherein the image area tag is used for indicating the area position of the specified image content in the sample image; acquiring an area loss value based on the difference between the designated image area and the image area label; and training the image segmentation model according to the region loss value and the category loss value, and obtaining the target image segmentation model.
In summary, the fused feature representation obtained after feature extraction and classification recognition of the target image focuses on the image information of the channel corresponding to the image classification result, so that the designated image content can be separated from the target image more accurately, misguidance of the image content with higher confusion in the target image on the image segmentation process can be effectively avoided by means of the classification information in the fused feature representation, and the accuracy of the designated content segmentation result is fully improved.
It should be noted that: the image content dividing apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the image content segmentation apparatus and the image content segmentation method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Fig. 13 shows a block diagram of an electronic device 1300 according to an exemplary embodiment of the application. The electronic device 1300 may be a portable mobile terminal such as: smart phones, car terminals, tablet computers, MP3 players (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) players, notebook computers or desktop computers. The electronic device 1300 may also be referred to by other names as user device, portable terminal, laptop terminal, desktop terminal, etc.
In general, the electronic device 1300 includes: a processor 1301, and a memory 1302.
Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Processor 1301 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). Processor 1301 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, processor 1301 may integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen needs to display. In some embodiments, the processor 1301 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. Memory 1302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1302 is used to store at least one instruction for execution by processor 1301 to implement the animation display method in a virtual scene provided by the method embodiments of the present application.
In some embodiments, the electronic device 1300 also includes one or more sensors. The one or more sensors include, but are not limited to: proximity sensor, gyro sensor, pressure sensor.
A proximity sensor, also referred to as a distance sensor, is typically provided on the front panel of the electronic device 1300. The proximity sensor is used to capture the distance between the user and the front of the electronic device 1300.
The gyro sensor may detect a body direction and a rotation angle of the electronic device 1300, and the gyro sensor may cooperate with the acceleration sensor to collect a 3D motion of the user on the electronic device 1300. Processor 1301 can implement the following functions based on the data collected by the gyro sensor: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
The pressure sensor may be disposed on a side frame of the electronic device 1300 and/or on an underlying layer of the display screen. When the pressure sensor is disposed on the side frame of the electronic device 1300, a holding signal of the electronic device 1300 by the user may be detected, and the processor 1301 may perform left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor. When the pressure sensor is disposed at the lower layer of the display screen, the processor 1301 controls the operability control on the UI interface according to the pressure operation of the user on the display screen. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
In some embodiments, electronic device 1300 also includes other component parts, and those skilled in the art will appreciate that the structure shown in FIG. 13 is not limiting of electronic device 1300, and may include more or less components than those illustrated, or may combine certain components, or employ a different arrangement of components.
Embodiments of the present application also provide a computer apparatus including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the image content segmentation method provided in the above-mentioned method embodiments.
Embodiments of the present application also provide a computer readable storage medium having stored thereon at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the image content segmentation method provided by the above-mentioned method embodiments.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the image content segmentation method according to any one of the above embodiments.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims (16)

1. A method of image content segmentation, the method comprising:
Acquiring a target image to be subjected to designated image content segmentation;
extracting features of the target image to obtain an image feature representation corresponding to the target image;
classifying and identifying the target image based on the image characteristic representation to obtain an image classification result corresponding to the target image, wherein the image classification result is used for indicating the inclusion condition of the specified image content in the target image;
performing dimension conversion on the image classification result to obtain classification characteristic representation;
fusing the image characteristic representation with the classification characteristic representation to obtain a fusion characteristic representation;
and dividing the appointed image content in the target image based on the fusion characteristic representation to obtain an appointed content dividing result.
2. The method according to claim 1, wherein said dimension converting the image classification result to obtain a classification feature representation comprises:
and carrying out dimension conversion on the image classification result based on the feature dimension of the image feature representation to obtain the classification feature representation with the same dimension as the image feature representation.
3. The method of claim 2, wherein fusing the image feature representation with the classification feature representation to obtain a fused feature representation comprises:
And splicing different dimensions of the image feature representation with corresponding dimensions of the classification feature representation along a channel dimension level to obtain the fusion feature representation.
4. A method according to any one of claims 1 to 3, wherein the segmenting the specified image content in the target image based on the fused feature representation to obtain a specified content segmentation result comprises:
the fusion characteristic representation is subjected to multi-layer deconvolution to carry out up-sampling treatment on the fusion characteristic representation, so that an output characteristic representation is obtained;
and predicting a specified image area of the specified image content in the target image based on the output characteristic representation, and taking a result indicating the specified image area as the specified content segmentation result.
5. The method of claim 4, wherein the performing feature extraction on the target image to obtain an image feature representation corresponding to the target image comprises:
the target image passes through a plurality of convolution layers to obtain a first processing characteristic representation output by each convolution layer;
and carrying out downsampling processing on the first processing feature representation based on the multi-layer convolution layer to obtain the image feature representation corresponding to the target image.
6. The method of claim 5, wherein upsampling the fused feature representation through a plurality of deconvolution layers to obtain an output feature representation comprises:
the fusion characteristic representation passes through a first deconvolution layer to obtain a second processing characteristic representation;
receiving a first processed feature representation of a first convolutional layer map having the same resolution as the first deconvolution layer;
and fusing the first processing characteristic representation with the second processing characteristic representation, and performing up-sampling processing through a plurality of subsequent deconvolution layers to obtain the output characteristic representation.
7. The method of claim 5, wherein the method further comprises:
and mapping the first processing characteristic representation output by each convolution layer onto a deconvolution layer with the same resolution as each convolution layer, wherein the first processing characteristic representation is used for carrying out characteristic fusion with the second processing characteristic representation obtained on the deconvolution layer.
8. A method according to any one of claims 1 to 3, wherein said classifying and identifying the target image based on the image feature representation to obtain an image classification result corresponding to the target image comprises:
Classifying and identifying the target image based on the image characteristic representation to obtain a candidate classification result corresponding to the target image, wherein the candidate classification result comprises a first probability value that the specified image content exists in the target image and a second probability value that the specified image content does not exist in the target image;
and taking the classification result with high numerical value in the first probability value and the second probability value as the image classification result.
9. A method according to any one of claims 1 to 3, wherein the segmenting the specified image content in the target image based on the fused feature representation to obtain a specified content segmentation result comprises:
performing region identification of specified image content on the target image based on the fusion characteristic representation to obtain prediction probability of each pixel point in the target image as the pixel point of the specified image content;
and separating a specified image area corresponding to the specified image content in the target image based on the prediction probability to obtain the specified content segmentation result.
10. The method according to claim 9, wherein the separating the specified image area corresponding to the specified image content in the target image based on the prediction probability to obtain the specified content segmentation result includes:
Assigning a first pixel value to the appointed pixel point in response to the prediction probability corresponding to the appointed pixel point reaching a preset probability threshold; or, in response to the prediction probability corresponding to the appointed pixel point not reaching the preset probability threshold, assigning a value to the appointed pixel point by using a second pixel value;
and taking the image area with the plurality of pixel points corresponding to the first pixel value as the appointed image area to obtain the appointed content segmentation result.
11. A method according to any one of claims 1 to 3, wherein the target image is segmented by a target image segmentation model comprising a classification prior network;
the method further comprises the steps of:
inputting a sample image into the image segmentation model, wherein the image segmentation model is a segmentation model obtained by training a first sample image containing specified image content, and the sample image comprises at least one of the first sample image and a second sample image not containing the specified image content;
carrying out category prediction on the sample image through the classification prior network to obtain a prediction category corresponding to the sample image;
Obtaining a class loss value based on the difference between the predicted class and the sample class label correspondingly marked by the sample image;
and training the classification priori network by using the class loss value, and obtaining the trained target image segmentation model.
12. The method of claim 11, wherein training the classification prior network with the class loss values and obtaining a trained target image segmentation model comprises:
predicting a specified image area corresponding to the specified image content in the sample image by using the image segmentation model;
acquiring an image area tag of the sample image label, wherein the image area tag is used for indicating the area position of the specified image content in the sample image;
acquiring an area loss value based on the difference between the designated image area and the image area label;
and training the image segmentation model according to the region loss value and the category loss value, and obtaining the target image segmentation model.
13. An image content segmentation apparatus, the apparatus comprising:
the acquisition module is used for acquiring a target image to be subjected to specified image content segmentation;
The extraction module is used for extracting the characteristics of the target image to obtain an image characteristic representation corresponding to the target image;
the classification module is used for classifying and identifying the target image based on the image characteristic representation to obtain an image classification result corresponding to the target image, wherein the image classification result is used for indicating the inclusion condition of the specified image content in the target image;
the conversion module is used for carrying out dimension conversion on the image classification result to obtain classification characteristic representation;
the fusion module is used for fusing the image characteristic representation with the classification characteristic representation to obtain a fusion characteristic representation;
and the segmentation module is used for segmenting the specified image content in the target image based on the fusion characteristic representation to obtain a specified content segmentation result.
14. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the image content segmentation method according to any one of claims 1 to 12.
15. A computer-readable storage medium, in which at least one program is stored, the at least one program being loaded and executed by a processor to implement the image content segmentation method according to any one of claims 1 to 12.
16. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the image content segmentation method according to any one of claims 1 to 12.
CN202211504318.4A 2022-11-28 2022-11-28 Image content segmentation method, device, apparatus, storage medium and program product Pending CN116958533A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211504318.4A CN116958533A (en) 2022-11-28 2022-11-28 Image content segmentation method, device, apparatus, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211504318.4A CN116958533A (en) 2022-11-28 2022-11-28 Image content segmentation method, device, apparatus, storage medium and program product

Publications (1)

Publication Number Publication Date
CN116958533A true CN116958533A (en) 2023-10-27

Family

ID=88455386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211504318.4A Pending CN116958533A (en) 2022-11-28 2022-11-28 Image content segmentation method, device, apparatus, storage medium and program product

Country Status (1)

Country Link
CN (1) CN116958533A (en)

Similar Documents

Publication Publication Date Title
CN111047551B (en) Remote sensing image change detection method and system based on U-net improved algorithm
US11463631B2 (en) Method and apparatus for generating face image
CN112434721A (en) Image classification method, system, storage medium and terminal based on small sample learning
CN111260666B (en) Image processing method and device, electronic equipment and computer readable storage medium
CN108122239A (en) Use the object detection in the image data of depth segmentation
CN111914843B (en) Character detection method, system, equipment and storage medium
CN111652181B (en) Target tracking method and device and electronic equipment
CN110196917B (en) Personalized LOGO format customization method, system and storage medium
CN114332473B (en) Object detection method, device, computer apparatus, storage medium, and program product
CN112508989B (en) Image processing method, device, server and medium
CN114359775A (en) Key frame detection method, device, equipment, storage medium and program product
CN117611774A (en) Multimedia display system and method based on augmented reality technology
CN116152334A (en) Image processing method and related equipment
CN114972016A (en) Image processing method, image processing apparatus, computer device, storage medium, and program product
CN117252947A (en) Image processing method, image processing apparatus, computer, storage medium, and program product
CN118230081A (en) Image processing method, apparatus, electronic device, computer readable storage medium, and computer program product
CN114495916A (en) Method, device, equipment and storage medium for determining insertion time point of background music
CN114332894A (en) Image text detection method and device
KR20180008345A (en) Device and method for producing contents, and computer program thereof
CN113570615A (en) Image processing method based on deep learning, electronic equipment and storage medium
CN112884702A (en) Polyp identification system and method based on endoscope image
CN117351192A (en) Object retrieval model training, object retrieval method and device and electronic equipment
CN112528760A (en) Image processing method, image processing apparatus, computer device, and medium
CN116824291A (en) Remote sensing image learning method, device and equipment
CN116958533A (en) Image content segmentation method, device, apparatus, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication