WO2021233031A1 - 图像处理方法、装置、设备、存储介质以及图像分割方法 - Google Patents

图像处理方法、装置、设备、存储介质以及图像分割方法 Download PDF

Info

Publication number
WO2021233031A1
WO2021233031A1 PCT/CN2021/087579 CN2021087579W WO2021233031A1 WO 2021233031 A1 WO2021233031 A1 WO 2021233031A1 CN 2021087579 W CN2021087579 W CN 2021087579W WO 2021233031 A1 WO2021233031 A1 WO 2021233031A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
probability
model
category
unknown category
Prior art date
Application number
PCT/CN2021/087579
Other languages
English (en)
French (fr)
Inventor
冯洁
穆亚东
王帅
田贵宇
白一鸣
魏祥野
欧歌
吴琼
Original Assignee
京东方科技集团股份有限公司
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 北京大学 filed Critical 京东方科技集团股份有限公司
Priority to US17/754,158 priority Critical patent/US12039766B2/en
Publication of WO2021233031A1 publication Critical patent/WO2021233031A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/84Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/10Recognition assisted with metadata

Definitions

  • This application relates to an image processing method, device, equipment, computer readable storage medium, and image segmentation method.
  • Image segmentation technology is one of the core issues in the field of computer vision. This technology aims to perform pixel-level semantic annotation on images.
  • the input of the image segmentation model is generally an ordinary image or video frame, and the output is the semantic label of each pixel (the label category is usually specified in advance).
  • an image processing method including: acquiring an image data set, the image data set containing an image and accompanying text related to an unknown category in the image; using an unknown category acquisition model to generate unknown The probability and/or distribution of the category, the probability and/or distribution of the unknown category include the probability that each pixel in the image comes from the unknown category, the probability that the unknown category exists in the image, and the The partition probability after subdividing into multiple regions.
  • the unknown category acquisition model includes a local branch, a semi-global branch, and a global branch, wherein the local branch is configured to generate each pixel in the image based on the annotation information of the known category.
  • the probability of the unknown category, the global branch is configured to generate a probability that the unknown category exists in the image based on the accompanying text, and the semi-global branch is configured to be based on the annotation information and the accompanying text , To generate the partition probability after the image is subdivided into multiple regions.
  • the global branch is based on the accompanying text and uses a text semantic extraction model to generate the probability that the unknown category exists in the image.
  • the text semantic extraction model is a bidirectional encoding representation BERT model from a transformer, wherein the probability that the unknown category is present in the image generated by the BERT model is expressed as:
  • H o ( ⁇ ) represents a freely defined function, and its output is the probability that an unknown category appears in the image without being processed by the sigmoid function
  • represents the BERT model
  • caption(x) represents the accompanying text of the image
  • EOS is natural
  • c represents an unknown category
  • description(c) represents a keyword or text description of the unknown category c.
  • the local branch uses a first model to generate the probability of each pixel in the image from the unknown category, wherein the first model is obtained through the annotation information training.
  • the annotation information includes the coordinates of pixels of a known category
  • the first model is trained in the following manner: selecting one of the known categories of an image in the image data set is known Pixels of a category are used as one piece of verification data in the verification set; pixels of other categories in the multiple known categories are selected as one piece of training data in the training set; and pixels based on the known categories in the verification set and the training set The coordinates of training the first model.
  • the semi-global branch uses a second model to generate the partition probability, and the second model is obtained through training of the accompanying text and the annotation information.
  • the partition probability includes that each pixel in each image subdivision area of the plurality of image subdivision areas generated after the image is subdivided into a plurality of areas is from the first unknown category.
  • the second model is trained in the following manner: the image is subdivided into a plurality of regions along the vertical direction or the horizontal direction; based on the accompanying text, the unknown category is generated in the A first training probability distribution in each image subdivision area; based on the label information, generating a second training in which each pixel in each image subdivision area of the plurality of image subdivision areas comes from the unknown category Probability distribution; construct a loss function according to the first training probability distribution and the second training probability distribution; train the second model through the loss function.
  • the constructing a loss function according to the first training probability distribution and the second training probability distribution includes: based on the difference between the first training probability distribution and the second training probability distribution Euclidean distance is used to construct the loss function for image processing.
  • the accompanying text includes user comments and/or image titles.
  • an image segmentation method including: acquiring a first image; processing the first image using an image segmentation model to generate a segmented second image, wherein the image segmentation model uses The first training set is obtained by training the original image segmentation network, the first training set includes the probability and/or distribution of the unknown category obtained by the above image processing method, wherein the second image includes multiple regions corresponding to different categories .
  • an image processing device including: an acquisition unit for acquiring an image data set, the image data set including an image and accompanying text related to an unknown category in the image; a generating unit , Used to obtain the probability and/or distribution of the unknown category using the unknown category acquisition model, the probability and/or distribution of the unknown category including the probability that each pixel in the image comes from the unknown category, and the unknown category exists in The probability in the image and the partition probability after the image is subdivided into multiple regions.
  • the unknown category acquisition model includes a local branch, a semi-global branch, and a global branch, wherein the local branch is configured to generate each pixel in the image based on the annotation information of the known category.
  • the probability of the unknown category, the global branch is configured to generate a probability that the unknown category exists in the image based on the accompanying text, and the semi-global branch is configured to be based on the annotation information and the accompanying text , To generate the partition probability after the image is subdivided into multiple regions.
  • an image processing device including: a processor; and a memory in which computer-readable instructions are stored, wherein the image processing method is executed when the computer-readable instructions are executed by the processor
  • the method includes: acquiring an image data set, the image data set containing an image and accompanying text related to an unknown category in the image; using an unknown category acquisition model to generate the probability and/or distribution of the unknown category, the unknown
  • the category probability and/or distribution includes the probability that each pixel in the image comes from the unknown category, the probability that the unknown category exists in the image, and the partition probability after the image is subdivided into multiple regions.
  • a computer-readable storage medium for storing a computer-readable program, which causes a computer to execute the above-mentioned image processing method.
  • Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of an example of image accompanying text according to an embodiment of the present disclosure
  • Fig. 3 shows a schematic diagram of an unknown category labeling method according to an embodiment of the present disclosure
  • FIG. 4 shows a flowchart of the operation of training the first model according to an embodiment of the present disclosure
  • FIG. 5 shows a flowchart of the operation of training a second model according to an embodiment of the present disclosure
  • FIG. 6 shows a schematic diagram of the effect of a semi-global branch according to an embodiment of the present disclosure
  • Fig. 7 shows a flowchart of an image segmentation method according to an embodiment of the present disclosure
  • FIG. 8 shows a schematic diagram of a segmented image generated by an image segmentation model according to an embodiment of the present disclosure
  • Fig. 9 shows a schematic diagram of a small sample image segmentation method according to an embodiment of the present disclosure
  • Fig. 10 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure
  • FIG. 11 shows a block diagram of an image processing device according to an embodiment of the present disclosure.
  • FIG. 12 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.
  • first”, “second” and similar words used in the present disclosure do not indicate any order, quantity, or importance, but are only used to distinguish different components.
  • “including” or “including” and other similar words mean that the element or item appearing before the word covers the elements or items listed after the word and their equivalents, and does not exclude other elements or items.
  • Similar words such as “connected” or “connected” are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up”, “Down”, “Left”, “Right”, etc. are only used to indicate the relative position relationship. When the absolute position of the described object changes, the relative position relationship may also change accordingly.
  • the image segmentation model is obtained by collecting a large number of training images in advance, and performing semantic annotation at the pixel level, and then obtaining the optimal parameters of the model by means of machine learning.
  • Semantic annotation in the image segmentation task is very labor intensive and severely restricts the scale of training data for this task.
  • When deploying an image segmentation model to a new application scenario it usually encounters a new unknown class (or called a low-shot or zero sample). Semantic annotations of these unknown categories are extremely rare and may be completely missing in some cases.
  • the small sample image segmentation task (or called the unknown category image segmentation task) aims to obtain an image segmentation model that can handle new categories from small sample (or zero sample) data.
  • the present disclosure provides an image processing method that uses an unknown category acquisition model including a local branch, a semi-global branch, and a global branch to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as
  • the training data trains the image segmentation network, so that the image segmentation network is used to automatically label the unknown category in the image without providing the pixel-level semantic annotation of the unknown category, thereby saving a lot of labor costs and time.
  • At least one embodiment of the present disclosure provides an image processing method, an image processing apparatus, an image processing device, and a computer-readable storage medium.
  • the following is a non-limiting description of the image processing method provided according to at least one embodiment of the present disclosure through several examples and embodiments. As described below, these specific examples and embodiments are different if they do not conflict with each other. The features can be combined with each other to obtain new examples and embodiments, and these new examples and embodiments also fall within the protection scope of the present disclosure.
  • FIGS. 1-6 an image processing method according to an embodiment of the present disclosure will be described with reference to FIGS. 1-6.
  • This method can be automatically completed by a computer or the like.
  • the image processing method can be implemented in software, hardware, firmware, or any combination thereof, and is loaded and executed by a processor in a device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, and a network server.
  • the image processing method is suitable for a computing device that includes any electronic device with computing functions, such as a mobile phone, a laptop, a tablet, a desktop computer, a network server, etc., which can load and execute the image processing Method, the embodiment of the present disclosure does not limit this.
  • the computing device may include a central processing unit (CPU) or a graphics processing unit (Graphics Processing Unit, GPU) and other forms of processing units, storage units, etc. that have data processing capabilities and/or instruction execution capabilities,
  • the computing device is also installed with an operating system, an application programming interface (for example, OpenGL (Open Graphics Library), Metal, etc.), etc., and the image processing method provided by the embodiment of the present disclosure is implemented by running code or instructions.
  • the computing device may also include a display component, such as a liquid crystal display (LCD), an organic light emitting diode (OLED) display, and a quantum dot light emitting diode (Quantum Dot Light Emitting).
  • a display component such as a liquid crystal display (LCD), an organic light emitting diode (OLED) display, and a quantum dot light emitting diode (Quantum Dot Light Emitting).
  • Diode, QLED display screens, projection components, VR head-mounted display devices (for example, VR helmets, VR glasses), etc., which are not limited in the embodiments of the present disclosure.
  • the display part can display the object to be displayed.
  • the image processing method includes the following steps S101 to S102.
  • the image processing described in the present disclosure may include image digitization, image encoding, image enhancement, image restoration, image segmentation, image analysis, etc., which are not limited here.
  • the present disclosure takes image segmentation as an example for description.
  • step S101 an image data set is obtained, the image data set including an image and accompanying text related to an unknown category in the image.
  • an unknown category acquisition model is used to generate the probability and/or distribution of the unknown category.
  • the probability and/or distribution of the unknown category include the probability that each pixel in the image comes from the unknown category and the existence of the unknown category.
  • the probabilities in the image and the partition probability image processing after the image is subdivided into multiple regions.
  • the image data set usually contains some kind of accompanying text, such as user comments under the image of a social networking site, image title, and so on.
  • the accompanying text in the method described in the present disclosure takes an image caption as an example to show the use of the accompanying text for small sample image processing. It should be understood that the present disclosure may include other forms of image accompanying text, which is not limited here.
  • the image title "The person wearing black short sleeves is playing guitar” is related to the unknown category "Guitar”
  • the image title "The person wearing black short sleeves is playing the piano” is related to the unknown category " "Guitar” has nothing to do.
  • the image title "A man in black short sleeves is playing a musical instrument” may be related to the unknown category "Guitar”.
  • Figure 2 shows some examples of image titles.
  • the image title is usually a sentence describing the most critical semantic content in the image.
  • the image title is useful in the following situations: 1) The title directly contains keywords of the unknown category; 2) The title can implicitly infer that the unknown category exists in the title. Probability in the image.
  • the unknown category acquisition model may include local branches, semi-global branches, and global branches.
  • Local branches, semi-global branches, and global branches may correspond to different modules.
  • the local branch may be configured to generate the probability that each pixel in the image comes from the unknown category based on the annotation information of the known category
  • the global branch may be configured to generate the unknown category based on the accompanying text.
  • the semi-global branch may be configured to generate a partition probability after the image is subdivided into multiple regions based on the annotation information and the accompanying text.
  • Fig. 3 is a schematic diagram of an unknown category labeling method according to an embodiment of the present disclosure.
  • the image processing method of the present disclosure reuses the existing annotation information 31 of the known category, and at the same time uses the accompanying text 32 of the image, and uses the image including the local branch 33, the semi-global branch 34, and the global branch 35.
  • the model is processed to generate the probability of the existence of unknown categories at different levels (e.g., pixel level, image subdivision area level, image global).
  • the local branch 33 generates the probability (pixel-level probability 34) that each pixel in the image comes from the unknown category based on the label information 31 of the known category
  • the global branch 37 generates the probability based on the accompanying text 32.
  • the probability that the unknown category exists in the image image global probability 38
  • the semi-global branch 35 generates the partition probability after the image is subdivided into multiple regions based on the annotation information 31 and the accompanying text 32 36.
  • the global branch may be based on the accompanying text, using a text semantic extraction model to generate the probability that the unknown category exists in the image.
  • a context-sensitive pre-trained text semantic extraction model such as bidirectional encoder representations from transformers (BERT) from a transformer, can be used to deal with the following contextual entailment questions (CEQ) in the accompanying text:
  • CEQ contextual entailment questions
  • x represents a specific image.
  • caption(x) represents the text caption of the image.
  • EOS is the end of sentence in natural language processing.
  • c means unknown category.
  • description(c) represents the keyword or text description of the unknown category c.
  • the training process of the BERT model includes tasks related to context-based entailment relations between sentences. Therefore, after the above-mentioned CEQ is sent to a deep network model such as BERT, its high-level output includes the judgment of the implication relationship.
  • a pair of premise and hypothesis sentences can be divided into three categories: contradiction, neutral, and entailment.
  • a football match involving many men means “some men are participating in a sport”, which contradicts "no man moving in the image”.
  • the goal of the above-mentioned CEQ is to predict the relationship between the premise and the hypothesis, which can be either an implicit relationship or a contradictory relationship. If it is judged to be a highly implicit relationship, it means that the unknown category c is consistent with the semantics of the image title.
  • the judgment of the above-mentioned implication relationship can be controlled by introducing parameters.
  • the range of CEQ can be widened to [0,1], and the relationship between the premise and the hypothesis can be predicted by converting it into a binary classification of confidence modulation.
  • This can be achieved by attaching a fully connected head (denoted as H o ( ⁇ )) to the backbone of the BERT model.
  • H o ( ⁇ ) a fully connected head
  • H o ( ⁇ ) represents a freely defined function, which is not limited here, and its output is the probability of a specific category (without sigmoid()) appearing in the image, and ⁇ represents the BERT model.
  • the output of the activation function Sigmoid() is in the interval [0,1], as a probability output, x represents the input image of the BERT model. It should be realized that the above activation function Sigmoid() is only an example, and activation functions such as softmax, tanh, etc. can also be used, and there is no limitation here.
  • the binary cross entropy loss can be used to optimize the head H o and ⁇ based on the known category S, as shown below:
  • a known category can be randomly simulated as an unknown category as a verification data in the verification set, and other categories in the known category can be used as training in the training set.
  • the BERT model is trained based on the unknown category (formed by simulation of the known category) in the verification set and the known category in the training set.
  • the neural network can be trained through the loss function of equation (2) to obtain a neural network model based on BERT, and the probability of an unknown category appearing in the image can be obtained through the neural network model.
  • BERT model is only an example, and other suitable text semantic extraction models may also be used in the present disclosure, which is not limited here.
  • the present disclosure will describe the operation of the local branch (the local branch 33 in FIG. 3) according to the embodiment of the present disclosure.
  • the local branch may use a first model to generate the probability that each pixel in the image comes from the unknown category, wherein the first model is obtained through the annotation information training.
  • the first model proposed in the present disclosure may be implemented as a multilayer perceptron network, for example, which may be obtained by training on label information.
  • the specific description of the first model is as follows: (1) The training set contains a certain number of known categories. Most of these categories have sufficient pixel-level semantic annotations, and standard machine learning models (such as encoding-decoding networks based on convolution and pooling operations, etc.) can be used to obtain high-quality image processing models. In other words, for a given image, each pixel can be provided with a high-confidence probability of a known category. (2) By using word embedding technology (such as word2vec), the keywords of each category can be feature vectorized. (3) The first model can be trained using the label information of the known category to generate the probability that each pixel in the image comes from the unknown category.
  • word2vec word2vec
  • FIG. 4 is a flowchart of an operation 200 of training a first model according to an embodiment of the present disclosure.
  • the operation of training the first model includes the following steps S201 to S203.
  • step S201 a pixel of a known category among a plurality of known categories in an image in the image data set is selected as a verification data in the verification set.
  • step S202 pixels of other categories among the multiple known categories are selected as one of the training data in the training set.
  • step S203 the first model is trained based on the coordinates of the pixels of the known category in the verification set and the training set.
  • the annotation information includes the coordinates of pixels of a known category.
  • the probability that each pixel in the image comes from an unknown category can be generated by the following first model M:
  • the pixel-level first model M of the present disclosure samples the source pixels s of the known category from all the labeled pixels x′ of the known category and the unlabeled target pixel t.
  • e s represents the category of the source pixel s. Since the source pixel s is known to belong to the known category in the first model, e s ⁇ S, S represents the known category, and U represents the unknown category.
  • position(p) represents the two-dimensional coordinates of pixel p, and its size is [0, 1].
  • w e ⁇ R d is the word embedding related to category e (that is, the characteristic vector after passing through a model such as word2vec), Is the word embedding related to the category e s of the source pixel s, and w u is the word embedding related to the category u (u ⁇ U).
  • the spatial distribution of the unknown category u(u ⁇ U) can be obtained by integrating the prediction results obtained from all the marked pixels:
  • the first model M can be trained by labeling information of known categories. For example, in each iteration, a pixel of a known category can be randomly selected to simulate a pixel of an unknown category as a verification data in the verification set, and pixels of other categories in a known category can be selected as a training data in the training set. The first model M is trained based on the coordinates of the pixels of the known category in the verification set and the training set.
  • the probability that each pixel in the image comes from the unknown category can be generated. It should be realized that the above-mentioned first model M is only an example, and the present disclosure may also adopt other suitable first models, which are not limited here.
  • training can be performed by labeling information of known categories, and the spatial distribution of unknown categories can be generated without providing unknown category labels, thereby saving a lot of labor costs and time.
  • the present disclosure will describe the operation of the semi-global branch (the local branch 36 in FIG. 3) according to the embodiment of the present disclosure.
  • the spatial arrangement of different objects is very important for image processing. For example, at least two hints can be used to guess the position of an object in the image.
  • the first hint is the structural arrangement between objects. For example, “people” are usually observed at the “desk”, and “giraffes” are rarely observed at the “desk”.
  • certain objects or concepts often have a concentrated spatial distribution, such as the top area of the image. Often see the "sky”.
  • the context in the pre-trained text semantic extraction model in the global branch implies that the accompanying text of the image (which contains global semantic information) is used as input, while the pixel-level first model in the local branch takes the known category of Pixel-level annotations (which contain local category information) are used as input.
  • the present disclosure proposes to use consistency loss to jointly train global branches and local branches.
  • the semi-global branch is configured to generate a partition probability after the image is subdivided into multiple regions based on the annotation information and the accompanying text.
  • the semi-global branch may use a second model to generate the partition probability, and the second model is obtained through training of the accompanying text and the annotation information.
  • the partition probability includes the first probability distribution of each pixel in each image subdivision area from the unknown category and the unknown The category exists in the second probability distribution in the subdivided area of each image.
  • FIG. 5 is a flowchart of an operation 300 of training a second model according to an embodiment of the present disclosure.
  • the operation of training the second model includes the following steps S301 to S305.
  • step S301 the image is subdivided into a plurality of regions along the vertical direction or the horizontal direction.
  • step S302 based on the accompanying text, a first training probability distribution in which the unknown category exists in the subdivided regions of each image is generated.
  • step S303 based on the annotation information, generate a second training probability distribution in which each pixel in each image subdivision area of the plurality of image subdivision areas is from the unknown category.
  • step S304 a loss function is constructed according to the first training probability distribution and the second training probability distribution.
  • step S305 the second model is trained through the loss function.
  • the first training probability distribution can be generated based on the following model.
  • the present disclosure can generate image category-specific spatial distribution from image titles. Assume that the complex context in the title can roughly tell the location of the object. The realization of this idea is still based on the customization of the BERT model. In most cases, the image and its vertically flipped version can be described with the same title, but this may complicate the prediction of the horizontal position of the object. Therefore, preferably, the model of the present disclosure only focuses on vertically positioning certain objects in the image. In particular, all images will be divided into vertical regions of equal length. It should be understood that the image can also be subdivided into multiple regions of unequal sizes, and there is no limitation here.
  • Another head H s ( ⁇ ) can be attached to the backbone of the BERT model, and a softmax of K output can be placed at the end of the BERT model, so that the BERT model can be designed to estimate a certain value in the image x
  • the spatial distribution of the unknown category c (that is, the distribution on the subdivision area obtained by processing the image accompanying text through the BERT model) is also called the first training probability distribution:
  • H s ( ⁇ ) represents a freely defined function, and there is no restriction here.
  • Softmax activation function is only an example, and activation functions such as sigmoid, tanh, etc. can also be used, and there is no limitation here.
  • the BERT model can be trained through the following loss function L.
  • L loss function
  • the loss function L s can be realized through the objective of information entropy:
  • H o ( ⁇ ) and H s ( ⁇ ) controlled by L o +L s are complementary to each other.
  • Is the number of image pixels classified as unknown category c in the kth (k 1...K) region of image x
  • the model for generating the first training probability distribution of the unknown category existing in the subdivision area of each image is not limited to this, and other suitable models can be used to generate the first training. Probability distribution, there is no restriction here.
  • a second training probability distribution can be generated based on the following model.
  • step S304 for example, according to the above-mentioned first training probability distribution (Equation (6)) (It should be recognized that in this disclosure, both c and u(u ⁇ U) represent unknown categories, so here Can also be expressed as ) And the second training probability distribution (Equation (9)) L 2 distance (Euclidean distance) between (Euclidean distance) to construct the following loss function:
  • step S305 the second model constructed is trained through the aforementioned loss function.
  • the above-mentioned model for generating the second training probability distribution of each pixel in each image subdivision area of the plurality of image subdivision areas from the unknown category based on the annotation information is not limited to this, and may Other suitable models are used to generate the second training probability distribution, and there is no restriction here.
  • Fig. 6 is a schematic diagram of the effect of the semi-global branch according to an embodiment of the present disclosure.
  • Fig. 6 shows the spatial distribution of different categories in the obtained image after all the images are divided into vertical regions of equal length according to the above-mentioned second model. It can be seen that for the same category of Frisbee, the second model of the present disclosure can obtain different results according to different image titles.
  • the two images on the left side of Fig. 6 are divided into 5 regions in the vertical direction, and the distribution map on the right side of Fig. 6 shows the corresponding spatial distribution after each image is subdivided into 5 regions.
  • the distribution map on the right side of Fig. 6 shows the corresponding spatial distribution after each image is subdivided into 5 regions.
  • first model and the second model according to the embodiments of the present disclosure may adopt different neural network structures, including but not limited to convolutional neural networks, recurrent neural networks (RNN), and the like.
  • the convolutional neural network includes but is not limited to U-Net neural network, ResNet, DenseNet, etc.
  • the probability and/or distribution of the unknown category generated by the unknown category acquisition model including local branches, semi-global branches and global branches.
  • the probability and/or distribution of unknown categories can be obtained for each image, including pixel level, image Subdivide area level and global probability.
  • the above-mentioned different levels of probability information can be used as the training set, and by using a deep network such as U-Net as the main body of the model, the optimization objective function of the image segmentation model of the unknown category can be constructed, so that the image segmentation model can be trained by the image segmentation model. Divide, thereby obtaining a divided image.
  • a deep network such as U-Net
  • the neural network model in the present disclosure may include various neural network models, such as but not limited to: Convolutional Neural Networks (CNN) (including GoogLeNet, AlexNet, VGG networks, etc.), regions with convolutional neural networks ( R-CNN), Region Proposal Network (RPN), Recurrent Neural Network (RNN), Stack-based Deep Neural Network (S-DNN), Deep Belief Network (DBN), Restricted Boltzmann Machine (RBM), Complete Convolutional networks, long short-term memory (LSTM) networks and classification networks.
  • CNN Convolutional Neural Networks
  • R-CNN Region Proposal Network
  • RNN Region Proposal Network
  • RNN Recurrent Neural Network
  • S-DNN Stack-based Deep Neural Network
  • DNN Deep Belief Network
  • RBM Restricted Boltzmann Machine
  • LSTM long short-term memory
  • the neural network model for performing a task may include a sub-neural network, and the sub-neural network may include
  • Fig. 7 shows a flowchart of an image segmentation method according to an embodiment of the present disclosure. As shown in FIG. 7, the image segmentation method includes the following steps S401 to S402.
  • step S401 a first image is acquired.
  • step S402 the first image is processed using an image segmentation model to generate a segmented second image.
  • the first image is the input image of the image segmentation model.
  • the image segmentation model may be obtained by training the original image segmentation network using the first training set, the first training set containing the probability and/or distribution of the unknown category obtained by the image processing method shown in FIG. 1, where The second image includes multiple regions corresponding to different categories.
  • the image segmentation model of the present disclosure may be a convolutional neural network, a recurrent neural network (RNN), etc., which can be trained by constructing a loss function L:
  • L is the loss function of the image segmentation model
  • is the weighting factor, used to balance the loss function L SEG of the known category and the loss function L RS of the unknown category.
  • the loss function L SEG of the known category can be obtained by the currently known technology, which will not be described in detail here.
  • the loss function L RS of the unknown category for example, it can be constructed based on the probability of the unknown category obtained by the above-mentioned semi-global branch and global branch.
  • the present disclosure may use pair-wise ranking loss to utilize probability information of unknown categories.
  • f ⁇ R h ⁇ w ⁇ d where h ⁇ w defines the spatial resolution
  • d is the length of the extracted feature
  • the prediction in the image segmentation task is performed in a pixel-by-pixel manner.
  • the truth label map y since the truth label map y can be accessed, the truth label map of course only contains the pixel-level annotations in the known category S, so it is assumed that the unknown category will only appear in the unlabeled part.
  • Y can be expressed as a collection of unlabeled pixel positions:
  • CNN model Given a pair of images x1 and x2, CNN model can be used Obtain coding feature maps f 1 and f 2 . And the title annotations r 1 , r 2 can be used to generate the occurrence probabilities of specific categories s 1,e , s 2,e through the unknown category acquisition model of the present disclosure. if It can be considered that the image x1 is more likely to contain the category e u than the image x2. In other words, the unlabeled part Y1 of x1 is more likely to contain the unknown category eu (u ⁇ U) than the unlabeled part Y2 of x2. Therefore, the ranking loss can be written as:
  • the spatial distribution of a certain category (that is, the partition probability after the image is subdivided into multiple regions) can also be generated from the title.
  • this type of information can be used to trim the area where the category appears.
  • k ⁇ (1,2,...,N) is the index of the area divided in the vertical direction. Is the predicted spatial distribution of the category e u (that is, the partition probability obtained by the above-mentioned global branch).
  • the loss function of the unknown category can be constructed based on the probability of the unknown category obtained by the above-mentioned local branch, semi-global branch and global branch, and there is no limitation here.
  • the above-mentioned image segmentation model can be trained on the server side.
  • the trained model needs to be deployed to the client before it can be used.
  • the data set required for the training of the neural network model only needs to be stored and used on the server side, and does not need to be deployed on the client side.
  • the neural network model according to the embodiments of the present disclosure can adopt different network structures, including but not limited to convolutional neural network, recurrent neural network (RNN), and the like.
  • the convolutional neural network includes but is not limited to U-Net neural network, ResNet, DenseNet, etc.
  • FIG. 8 schematically depicts a schematic diagram of a segmented image generated by an image segmentation model according to an embodiment of the present disclosure.
  • the input image is the five pictures in the first row of Fig. 8, and each picture contains different categories (for example, for the first picture, it contains categories such as dog, frisbee, grass, etc.).
  • a true value image is a segmented image obtained after image segmentation using artificial tags. The segmented image contains regions represented by multiple colors corresponding to different categories. It can be seen that, compared with other types (for example, SPNet), the segmented image generated by the image segmentation model of the present disclosure (the last row of FIG. 8), the segmented image generated by the present disclosure is closer to the true value image. And the noise is smaller.
  • Fig. 9 is a schematic diagram of a small sample image segmentation method according to an embodiment of the present disclosure.
  • the present disclosure uses the unknown category acquisition model to generate the probability and/or distribution 51 of the unknown category.
  • the probability and/or distribution of the unknown category includes the probability that each pixel in the image generated based on the annotation information 53 of the known category comes from the unknown category, and the probability and/or distribution generated based on the accompanying text (contained in the image data set 55)
  • the probability that the unknown category exists in the image, and the partition probability after the image is subdivided into a plurality of regions generated based on the annotation information 53 and the accompanying text (included in the image data set 55).
  • the unknown category 54 is not labeled.
  • an image segmentation model 52 can be obtained, and the image segmentation model 52 can be used to segment the input image.
  • the present disclosure uses the unknown category acquisition model including local branches, semi-global branches and global branches to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as training data to train the image segmentation network.
  • the unknown category in the image is automatically marked, which reduces the cost of annotation and speeds up the development cycle, thereby saving a lot of labor costs and time.
  • the present disclosure uses the unknown category to obtain the model image processing to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as training data to train the image segmentation network, which can be realized when no pixels of the unknown category are provided.
  • the unknown categories in the image are automatically marked, thereby saving a lot of labor costs and time.
  • the present disclosure achieves the effect of increasing the image processing model for the same labeling cost by maximizing the information in all collected data, or for the same image processing model effect, reducing the labeling cost and accelerating the development cycle.
  • FIG. 10 is a functional block diagram illustrating an image processing apparatus according to an embodiment of the present disclosure.
  • the image processing apparatus 1000 according to an embodiment of the present disclosure includes an acquiring unit 1001 and a generating unit 1002.
  • the above-mentioned modules can respectively execute the steps of the image processing method according to the embodiment of the present disclosure as described above with reference to FIGS. 1 to 9.
  • these unit modules can be implemented in various ways by hardware alone, by software alone, or by a combination thereof, and the present disclosure is not limited to any one of them.
  • CPU central processing unit
  • GPU image processor
  • TPU tensor processor
  • FPGA field programmable logic gate array
  • the acquiring unit 1001 is configured to acquire an image data set, the image data set including an image and accompanying text related to an unknown category in the image.
  • the generating unit 1002 is configured to use the unknown category acquisition model to generate the probability and/or distribution of the unknown category.
  • the probability and/or distribution of the unknown category include the probability and/or distribution of each pixel in the image from the unknown category. The probability that the unknown category exists in the image, and the partition probability after the image is subdivided into multiple regions.
  • an image data set usually contains some kind of accompanying text, such as user comments and image titles under social networking site images.
  • the accompanying text in the method described in the present disclosure takes an image caption as an example to show the use of the accompanying text for small sample image processing. It should be understood that the present disclosure may include other forms of image accompanying text, which is not limited here.
  • the unknown category acquisition model may include local branches, semi-global branches, and global branches.
  • the local branch may be configured to generate the probability that each pixel in the image comes from the unknown category based on the annotation information of the known category
  • the global branch may be configured to generate the unknown category based on the accompanying text.
  • the semi-global branch may be configured to generate a partition probability after the image is subdivided into multiple regions based on the annotation information and the accompanying text.
  • the global branch may be based on the accompanying text, using a text semantic extraction model to generate the probability that the unknown category exists in the image.
  • the text semantic extraction model is a bidirectional encoding representation BERT model from a transformer, where the probability of using the BERT model to generate the unknown category in the image is expressed as:
  • H o ( ⁇ ) represents a freely defined function, and its output is the probability that an unknown category appears in the image without being processed by the sigmoid function
  • represents the BERT model
  • x represents the input image of the BERT model
  • caption(x) represents The accompanying text of the image
  • EOS is a sentence rest in natural language processing
  • c represents an unknown category
  • description(c) represents a keyword or text description of the unknown category c.
  • the local branch may use a first model to generate the probability that each pixel in the image comes from the unknown category, wherein the first model is obtained through the annotation information training.
  • the annotation information includes the coordinates of pixels of a known category
  • the first model can be trained in the following manner: selecting a pixel of a known category among multiple known categories in an image in the image dataset as One piece of validation data in the validation set; selecting pixels of other categories in the multiple known categories as one piece of training data in the training set; and training based on the coordinates of the pixels of the known categories in the validation set and the training set The first model.
  • first model M can be used to generate the probability that each pixel in the image comes from an unknown category:
  • the pixel-level first model M of the present disclosure samples the source pixels s of the known category from all the labeled pixels x′ of the known category and the unlabeled target pixel t.
  • e s represents the category of the source pixel s. Since the source pixel s is known to belong to the known category in the first model, e s ⁇ S, S represents the known category, and U represents the unknown category.
  • position(p) represents the two-dimensional coordinates of pixel p, and its size is [0, 1].
  • w e ⁇ R d is the word embedding related to category e (that is, the characteristic vector after passing through a model such as word2vec), Is the word embedding related to the category e s of the source pixel s, and w u is the word embedding related to the category u (u ⁇ U).
  • the spatial distribution of the unknown category u(u ⁇ U) can be obtained by integrating the prediction results obtained from all the marked pixels:
  • the first model M can be trained by labeling information of known categories. For example, in each iteration, a pixel of a known category can be randomly selected to simulate a pixel of an unknown category as a verification data in the verification set, and pixels of other categories in a known category can be selected as a training data in the training set. The first model M is trained based on the coordinates of the pixels of the known category in the verification set and the training set.
  • the probability that each pixel in the image comes from the unknown category can be generated. It should be realized that the above-mentioned first model M is only an example, and the present disclosure may also adopt other suitable first models, which are not limited here.
  • the semi-global branch may use a second model to generate the partition probability, and the second model is obtained through training of the accompanying text and the annotation information.
  • the partition probability may include a first probability distribution that each pixel in each image subdivision area of the plurality of image subdivision areas generated after the image is subdivided into a plurality of areas is from the unknown category, and The unknown category exists in a second probability distribution in the subdivision area of each image.
  • the second model may be trained in the following manner: subdivide the image into multiple regions along the vertical or horizontal direction; based on the accompanying text, generate the unknown category to exist in each image.
  • a first training probability distribution in a sub-region based on the label information, generating a second training probability distribution in which each pixel in each image sub-region of the plurality of image sub-regions is from the unknown category;
  • the first training probability distribution and the second training probability distribution are used to construct a loss function; and the second model is trained through the loss function.
  • the constructing a loss function according to the first training probability distribution and the second training probability distribution includes: constructing a loss based on the Euclidean distance between the first training probability distribution and the second training probability distribution function.
  • the image processing device of the present disclosure uses the unknown category to obtain the model image processing to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as training data to train the image segmentation network.
  • the unknown category in the image is automatically marked, thereby saving a lot of labor costs and time.
  • the image processing device of the present disclosure maximizes the use of information in all collected data to achieve the effect of increasing the image processing model for the same annotation cost, or for the same image processing model effect, reducing the annotation cost and accelerating the development cycle Effect.
  • FIG. 11 is a schematic diagram of an image processing apparatus 2000 according to an embodiment of the present disclosure. Since the image processing device of this embodiment has the same details as the method described above with reference to FIG. 1, a detailed description of the same content is omitted here for the sake of simplicity.
  • the image processing device 2000 includes a processor 210, a memory 220, and one or more computer program modules 221.
  • the processor 210 and the memory 220 are connected through a bus system 230.
  • one or more computer program modules 221 are stored in the memory 220.
  • one or more computer program modules 221 include instructions for executing the image processing method provided by any embodiment of the present disclosure.
  • instructions in one or more computer program modules 221 may be executed by the processor 210.
  • the bus system 230 may be a commonly used serial or parallel communication bus, etc., which is not limited in the embodiments of the present disclosure.
  • the processor 210 may be a central processing unit (CPU), a digital signal processor (DSP), an image processor (GPU), or other forms of processing units with data processing capabilities and/or instruction execution capabilities, and may be general-purpose processing units.
  • CPU central processing unit
  • DSP digital signal processor
  • GPU image processor
  • the memory 220 may include one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include random access memory (RAM) and/or cache memory (cache), for example.
  • the non-volatile memory may include read-only memory (ROM), hard disk, flash memory, etc., for example.
  • One or more computer program instructions may be stored on a computer-readable storage medium, and the processor 210 may run the program instructions to implement the functions (implemented by the processor 210) and/or other desired functions in the embodiments of the present disclosure, For example, image processing methods.
  • the computer-readable storage medium may also store various application programs and various data, such as the element characteristics of the image data set, the first model, and various data used and/or generated by the application program.
  • the embodiment of the present disclosure does not provide all the constituent units of the image processing device 2000.
  • those skilled in the art may provide and set other unshown component units according to specific needs, and the embodiments of the present disclosure do not limit this.
  • the image processing apparatus 1000 and the image processing apparatus 2000 can be used in various appropriate electronic devices.
  • FIG. 12 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
  • the storage medium 400 non-transitory stores computer-readable instructions 401, and when the non-transitory computer-readable instructions are executed by a computer (including a processor), it can execute any one of the embodiments of the present disclosure.
  • Image processing method As shown in FIG. 12, the storage medium 400 non-transitory stores computer-readable instructions 401, and when the non-transitory computer-readable instructions are executed by a computer (including a processor), it can execute any one of the embodiments of the present disclosure. Image processing method.
  • the storage medium may be any combination of one or more computer-readable storage media.
  • the computer when the program code is read by a computer, the computer can execute the program code stored in the computer storage medium, and execute, for example, the image processing method provided in any embodiment of the present disclosure.
  • the storage medium may include a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), Portable compact disk read-only memory (CD-ROM), flash memory, or any combination of the foregoing storage media may also be other suitable storage media.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • CD-ROM Portable compact disk read-only memory
  • flash memory or any combination of the foregoing storage media may also be other suitable storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

一种图像处理方法、装置、设备以及计算机可读存储介质。该方法包括:获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本(S101);利用未知类别获取模型生成未知类别的概率和/或分布(S102),所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。通过上述方法,可以节省大量的人工费用和时间。

Description

图像处理方法、装置、设备、存储介质以及图像分割方法
相关申请的交叉引用
本专利申请要求于2020年5月21日递交的中国专利申请第202010438187.9号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本申请涉及一种图像处理方法、装置、设备、计算机可读存储介质以及图像分割方法。
背景技术
图像分割(image segmentation)技术是计算机视觉领域的核心问题之一。该技术旨在对图像进行像素级别的语义标注。图像分割模型的输入一般为一张普通的图像或者视频帧,输出为每个像素的语义标签(标签的类别通常事先已经被指定)。
发明内容
根据本公开的一个方面,提供了一种图像处理方法,包括:获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
根据本公开的一个示例,所述未知类别获取模型包含局部分支、半全局分支和全局分支,其中,所述局部分支被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支被配置为基于所述伴随文本生成所述未知类别存在于所述图像中的概率,所述半全局分支被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
根据本公开的一个示例,所述全局分支基于所述伴随文本,利用文本语义提取模型生成所述未知类别存在于所述图像中的概率。
根据本公开的一个示例,所述文本语义提取模型为来自变换器的双向编码表示BERT模型,其中,利用BERT模型生成所述未知类别存在于所述图像中的概率表示为:
s x,c=sigmoid(H o(φ(caption(x);[EOS];description(c))))
其中,H o(·)表示自由定义的函数,其输出是未经过sigmoid函数处理的、未知类别出现在图像中的概率,φ表示BERT模型,caption(x)表示图像的伴随文本,EOS为自然语言处理中的语句休止符,c表示未知类别,description(c)表示所述未知类别c的关键字或者文字描述。
根据本公开的一个示例,所述局部分支利用第一模型来生成所述图像中每个像素来 自所述未知类别的概率,其中所述第一模型是通过所述标注信息训练获得的。
根据本公开的一个示例,所述标注信息包含已知类别的像素的坐标,所述第一模型通过以下方式进行训练:选择所述图像数据集中一个图像中多个已知类别中的一个已知类别的像素作为验证集中的一个验证数据;选择所述多个已知类别中的其他类别的像素作为训练集中的一个训练数据;以及基于所述验证集和所述训练集中的已知类别的像素的坐标,训练所述第一模型。
根据本公开的一个示例,所述半全局分支利用第二模型生成所述分区概率,所述第二模型是通过所述伴随文本和所述标注信息训练获得的。
根据本公开的一个示例,所述分区概率包括将图像细分为多个区域后生成的所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第一概率分布以及所述未知类别存在于所述每个图像细分区域中的第二概率分布。
根据本公开的一个示例,所述第二模型通过以下方式进行训练:沿垂直方向或水平方向将所述图像细分为多个区域;基于所述伴随本文,生成所述未知类别存在于所述每个图像细分区域中的第一训练概率分布;基于所述标注信息,生成所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第二训练概率分布;根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数;通过所述损失函数来训练所述第二模型。
根据本公开的一个示例,所述根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数包括:基于所述第一训练概率分布和所述第二训练概率分布之间的欧式距离来构建损失函数图像处理。
根据本公开的一个示例,所述伴随文本包括用户评论和/或图像标题。
根据本公开的一个方面,提供了一种图像分割方法,包括:获取第一图像;利用图像分割模型处理所述第一图像以生成分割后的第二图像,其中,所述图像分割模型是利用第一训练集对原始图像分割网络训练得到的,所述第一训练集包含利用上述图像处理方法得到的未知类别的概率和/或分布,其中所述第二图像包含对应不同类别的多个区域。
根据本公开的一个方面,提供了一种图像处理装置,包括:获取单元,用于获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;生成单元,用于利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
根据本公开的一个示例,所述未知类别获取模型包含局部分支、半全局分支和全局分支,其中,所述局部分支被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支被配置为基于所述伴随文本生成所述未知类别存在于所述图像中的概率,所述半全局分支被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
根据本公开的一个方面,提供了一种图像处理设备,包括:处理器;以及存储器,其 中存储计算机可读指令,其中,在所述计算机可读指令被所述处理器运行时执行图像处理方法,所述方法包括:获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
根据本公开的一个方面,提供了一种用于存储计算机可读程序的计算机可读存储介质,所述程序使得计算机执行上述图像处理方法。
要理解的是,前面的一般描述和下面的详细描述两者都是示例性的,并且意图在于提供要求保护的技术的进一步说明。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1示出了根据本公开实施例的图像处理方法的流程图;
图2示出了根据本公开实施例的图像伴随文本的示例的示意图;
图3示出了根据本公开实施例的未知类别标注方法的示意图;
图4示出了根据本公开实施例的训练第一模型的操作的流程图;
图5示出了根据本公开实施例的训练第二模型的操作的流程图;
图6示出了根据本公开实施例的半全局分支的效果示意图;
图7示出了根据本公开实施例的图像分割方法的流程图;
图8示出了根据本公开实施例图像分割模型生成分割后的图像的示意图;
图9示出了根据本公开实施例的小样本图像分割方法的示意图;
图10示出了根据本公开实施例的图像处理装置的框图;
图11示出了根据本公开实施例的图像处理设备的框图;以及
图12示出了根据本公开实施例的存储介质的示意图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排 除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
本申请中使用了流程图用来说明根据本申请的实施例的方法的步骤。应当理解的是,前面或后面的步骤不一定按照顺序来精确的进行。相反,可以按照倒序或同时处理各种步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步。
在标准的图像分割技术中,图像分割模型的获取是通过事先搜集大量的训练图像,并进行像素级别的语义标注,再通过机器学习的方式来获取模型的最优参数。图像分割任务中的语义标注非常耗费人力,严重制约了该任务的训练数据的规模。当将图像分割模型部署到新的应用场景时,通常会遇到新的未知类别(unseen class)(或者称为小样本(low-shot)或零样本)。这些未知类别的语义标注是极为稀有的,在某些情况下可能完全缺失。小样本图像分割任务(或称为未知类别图像分割任务)旨在从小样本(或零样本)数据中获取能够处理新类别的图像分割模型。
现有的图像分割模型通常基于机器学习的方式获得,严重依赖于像素级别的语义标注,需要消耗大量人力。当在新的应用场景中使用特定数据集训练后的图像分割模型时,需要对于场景中的新的未知类别重新进行像素级别的语义标注。
本公开提供了一种图像处理处理方法,其利用包含局部分支、半全局分支和全局分支的未知类别获取模型生成未知类别的概率和/或分布,并利用该未知类别的概率和/或分布作为训练数据训练图像分割网络,从而实现在没有提供该未知类别的像素级语义标注的情况下利用图像分割网络自动标注出图像中的未知类别,从而节省大量的人工费用和时间。
下面结合附图对本公开的实施例及其示例进行详细说明。
本公开的至少一个实施例提供了一种图像处理方法、图像处理装置、图像处理设备和计算机可读存储介质。下面通过几个示例和实施例对根据本公开的至少一个实施例提供的图像处理方法进行非限制性说明,如下面所描述的,在不相互抵触的情况下,这些具体示例和实施例中不同特征可以相互组合,从而得到新的示例和实施例,这些新的示例和实施例也都属于本公开保护的范围。
下面参照图1-6描述根据本公开实施例的图像处理方法。首先,参照图1来描述根据本公开实施例的图像处理方法。该方法可以由计算机等自动完成。例如,该图像处理方法可以以软件、硬件、固件或其任意组合的方式实现,由例如手机、平板电脑、笔记本电脑、桌面电脑、网络服务器等设备中的处理器加载并执行。
例如,该图像处理方法适用于一计算装置,该计算装置是包括具有计算功能的任何电子设备,例如可以为手机、笔记本电脑、平板电脑、台式计算机、网络服务器等,可以加载并执行该图像处理方法,本公开的实施例对此不作限制。例如,该计算装置可以包括中央处理单元(Central Processing Unit,CPU)或图形处理单元(Graphics Processing Unit,GPU)等具有数据处理能力和/或指令执行能力的其它形式的处理单元、存储单元等,该计 算装置上还安装有操作系统、应用程序编程接口(例如,OpenGL(Open Graphics Library)、Metal等)等,通过运行代码或指令的方式实现本公开实施例提供的图像处理方法。例如,该计算装置还可以包括显示部件,该显示部件例如为液晶显示屏(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light Emitting Diode,OLED)显示屏、量子点发光二极管(Quantum Dot Light Emitting Diode,QLED)显示屏、投影部件、VR头戴式显示设备(例如VR头盔、VR眼镜)等,本公开的实施例对此不作限制。例如,该显示部件可以显示待显示对象。
如图1所示,该图像处理方法包括以下的步骤S101-步骤S102。本公开所述的图像处理可以包括图像数字化、图像编码、图像增强、图像复原、图像分割和图像分析等,这里不做限制。接下来,本公开以图像分割为例进行说明。
在步骤S101,获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本。
在步骤S102,利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率图像处理。对于步骤S101,例如,图像数据集通常包含某种伴随文本,例如社交网站图像下的用户评论、图像标题等。本公开所描述的方法中的伴随文本以图像标题(caption)为例,来展示伴随文本对于小样本图像处理的用途。应当理解,本公开的可以包含其它形式的图像伴随文本,这里不做限制。
例如,当未知类别为吉他时,图像标题“穿黑色短袖的人在玩吉他”与该未知类别“吉他”有关,图像标题“穿黑色短袖的人在玩钢琴”则与该未知类别“吉他”无关,图像标题“穿黑色短袖的人在玩乐器”则可能与该未知类别“吉他”有关。
图2示出了一些图像标题的示例。图像标题通常为描述图像中最关键的语义内容的句子。当需要处理某些未知类别的图像处理时,图像标题中在如下情况中是有用的:1)标题中直接包含未知类别的关键字;2)标题中可以隐式地推出该未知类别存在于该图像中的概率。
对于步骤S102,例如,未知类别获取模型可以包含局部分支、半全局分支和全局分支。局部分支、半全局分支和全局分支可以对应于不同的模块。
例如,所述局部分支可以被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支可以被配置为基于所述伴随文本生成所述未知类别存在于所述图像中的概率,所述半全局分支可以被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
图3所示为根据本公开实施例的未知类别标注方法的示意图。如图3所示,本公开的图像处理方法通过复用现有的已知类别的标注信息31,同时使用图像的伴随文本32,利用包含局部分支33、半全局分支34和全局分支35的图像处理模型来生成不同级别(例如,像素级别、图像细分区域级别、图像全局)的未知类别存在的概率。例如,如图3所示, 局部分支33基于已知类别的标注信息31生成所述图像中每个像素来自所述未知类别的概率(像素级别概率34),全局分支37基于伴随文本32生成所述未知类别存在于所述图像中的概率(图像全局概率38),所述半全局分支35基于所述标注信息31和所述伴随文本32,生成将图像细分为多个区域后的分区概率36。
下面参照图4-5详细描述根据公开实施例的、利用包含局部分支33、半全局分支34和全局分支35的图像处理模型的未知类别标注方法。
首先,将描述根据本公开实施例的全局分支(图3中的全局分支37)的操作。
例如,全局分支可以基于伴随本文,利用文本语义提取模型生成所述未知类别存在于所述图像中的概率。
例如,可以采用上下文敏感的预训练文本语义提取模型,如来自变换器的双向编码表示(bidirectional encoder representations from transformers,BERT),来处理伴随文本中的下列上下文蕴含问题(contextual entailment question,CEQ):
CEQ(x,c):caption(x);[EOS];description(c).
在上述CEQ中,x代表特定图像。caption(x)表示该图像的文本标题。EOS为自然语言处理中的语句休止符(end of sentence)。c表示未知类别。description(c)表示该未知类别c的关键字或者文字描述。
BERT模型的训练过程中包含了基于上下文的句子间的蕴含(entailment)关系的相关任务。因此,将上述CEQ送入BERT等深度网络模型后,其高层输出包含了对于蕴含关系的判断。
例如,可以将一对前提和假设句子分为三类:矛盾(contradiction)、中立(neutral)和蕴含(entailment)。例如,“一场有多名男子参加的足球比赛”意味着“有些男子正在参加一项运动”,而与“没有男子在图像中移动”相矛盾。上述CEQ的目标为预测前提和假设之间的关系,其既可以是蕴含关系也可以是矛盾关系。如果判断为高度蕴含关系,则表示未知类别c与图像标题的语义一致。
此外,上述蕴含关系的判断可以引入参数进行控制,例如,在BERT模型中,我们可以将高层神经网络层输出的特征向量进行平均或最大池化,得到单一的特征向量,并通过额外的参数化网络层(如全连接层)来获得最终的蕴含关系的概率。
例如,可以将CEQ的范围放宽到[0,1]之间,通过将其转换为置信度调制的二进制分类来预测前提和假设之间的关系。可以通过在BERT模型的主干之上附加一个完全相连的头部(表示为H o(·))来实现此目的,设s x,c为未知类别c出现在图像x中的概率,其计算等式为:
s x,c=sigmoid(H o(φ(caption(x);[EOS];description(c))))   (1)
其中,H o(·)表示自由定义的函数,这里不做限定,其输出是(未经过sigmoid()的)特定类别出现在图像中的概率,φ表示BERT模型。激活函数Sigmoid()的输出位于[0,1]区间,作为概率输出,x表示BERT模型的输入图像。应当认识到,上述激活函数Sigmoid()仅仅是一个示例,还可以采用诸如softmax、tanh等的激活函数,这里不做限制。
例如,可以在已知类别S的基础上采用二进制交叉熵损失来优化头部H o和φ,如下所示:
L o=∑ xc∈s-[I(c∈y(x))·log(s x,c)+(1-I(c∈y(x))·log(1-s x,c))] (2)
其中y(x)是图像x的标签,S表示已知类别。如果未知类别c出现在该特定图像x中,则指示符函数I(c∈y(x)))返回1,否则返回0。
例如,在利用上述损失函数(2)训练BERT模型的过程中,可以随机将一个已知类别模拟为未知类别来作为验证集中的一个验证数据,将已知类别中的其他类别作为训练集中的训练数据,基于所述验证集中的未知类别(由已知类别模拟形成的)和所述训练集中的已知类别,训练该BERT模型。通过上述模型,可以生成未知类别存在于所述图像中的概率。
此外,可以通过等式(2)的损失函数训练神经网络得到基于BERT的神经网络模型,通过该神经网络模型可以得到未知类别出现在图像中的概率。应当认识到,上述BERT模型仅仅是一个示例,本公开还可以采用其他合适的文本语义提取模型,这里不做限制。
通过本公开的全局分支,可以实现在没有提供未知类别标注的情况下生成未知类别存在于所述图像中的概率,从而节省大量的人工费用和时间。
在描述完根据本公开实施例的全局分支的操作后,接下来,本公开将描述根据本公开实施例的局部分支(图3中的局部分支33)的操作。
例如,局部分支可以利用第一模型来生成所述图像中每个像素来自所述未知类别的概率,其中所述第一模型是通过所述标注信息训练获得的。
例如,不同语义类别之间可能存在语义相关性,例如“沙滩”和“海水”、“蓝天”和“白云”等。对于未知类别的标注,应当使用一切有用的信息来克服标注量的缺乏。本公开提出的第一模型例如可以实现为多层感知器网络,其可以通过标注信息进行训练获得。
例如,该第一模型的具体描述如下:(1)训练集中包含一定数目的已知类别。这些类别大多数具有充足的像素级语义标注,可以采用标准的机器学习模型(如基于卷积和池化操作的编码-解码网络等)来获得高质量的图像处理模型。换句话说,对于给定的某张图像,可以对其中每个像素提供高可信度的已知类别的概率。(2)通过采用词嵌入技术(如word2vec),可以将每个类别的关键字进行特征向量化。(3)可以利用已知类别的标注信息训练第一模型来生成所述图像中每个像素来自所述未知类别的概率。
图4为根据本公开实施例的训练第一模型的操作200的流程图。该训练第一模型的操作包括以下的步骤S201-步骤S203。
在步骤S201,选择所述图像数据集中一个图像中多个已知类别中的一个已知类别的像素作为验证集中的一个验证数据。
在步骤S202,选择所述多个已知类别中的其他类别的像素作为训练集中的一个训练数据。
在步骤S203,基于所述验证集和所述训练集中的已知类别的像素的坐标,训练所述第一模型。
例如,所述标注信息包含已知类别的像素的坐标。可以通过以下第一模型M来生成所述图像中每个像素来自未知类别的概率:
给定图像x,本公开的像素级别第一模型M从所有已知类别的已标记的像素x′和未标记的目标像素t采样已知类别的源像素s。e s表示源像素s的类别,由于在第一模型中已经知道源像素s属于已知类别,所以e s∈S,S表示已知类别,U表示未知类别。那么,未标记的目标像素t的类别属于未知类别(e t=u∈U)的概率为:
Figure PCTCN2021087579-appb-000001
其中position(p)表示像素p的二维坐标,其大小为[0,1]。w e∈R d是与类别e相关的词嵌入(即,经过诸如word2vec模型之后的特征化向量),
Figure PCTCN2021087579-appb-000002
是与源像素s的类别e s相关的词嵌入,w u是与类别u(u∈U)相关的词嵌入。
进一步,未知类别u(u∈U)的空间分布可以通过将从所有已标记的像素中得到的预测结果进行积分得到:
Figure PCTCN2021087579-appb-000003
其中|x′|是已标记的像素数,其可以作为重新缩放因子。通过这种方式,可以利用已知类别的像素级标注来生成某些未知类别的空间分布。
例如,该第一模型M可以通过已知类别的标注信息进行训练。例如,在每次迭代中,可以随机选择一个已知类别的像素来模拟为未知类别的像素作为验证集中的一个验证数据,选择已知类别中的其他类别的像素作为训练集中的一个训练数据,基于所述验证集和所述训练集中的已知类别的像素的坐标,训练该第一模型M。
通过上述第一模型M,可以生成所述图像中每个像素来自所述未知类别的概率。应当认识到,上述第一模型M仅仅是一个示例,本公开还可以采用其他合适的第一模型,这里不做限制。
通过本公开的局部分支,可以通过已知类别的标注信息进行训练,可以实现在没有提供未知类别标注的情况下生成未知类别的空间分布,从而节省大量的人工费用和时间。
在描述完根据本公开实施例的全局分支和局部分支的操作后,接下来,本公开将描述根据本公开实施例的半全局分支(图3中的局部分支36)的操作。
不同对象的空间排列对于图像处理至关重要。例如,至少可以使用两个提示来猜测图像中某个对象的位置。第一个提示是对象间的结构安排。例如,通常在“办公桌”前观察“人”,而很少在“办公桌”前观察““长颈鹿”。其次,某些对象或概念往往具有集中的空间分布,例如在图像的顶部区域更经常看到“天空”。
如上所述,全局分支中的预训练文本语义提取模型中的上下文蕴含将图像的伴随文本(其中包含全局语义信息)作为输入,而局部分支中的像素级第一模型,则将已知类别的像素级标注(其中包含局部类别信息)作为输入。为了使两种信息在不同的方式和不同的规模上能够相互补充,本公开提出了利用一致性损失来联合训练全局分支和局部分支。
如上所述,半全局分支被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
例如,所述半全局分支可以利用第二模型生成所述分区概率,所述第二模型是通过所述伴随文本和所述标注信息训练获得的。
例如,分区概率包括将图像细分为多个区域后生成的所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第一概率分布以及所述未知类别存在于所述每个图像细分区域中的第二概率分布。
图5是根据本公开实施例的训练第二模型的操作300的流程图。该训练第二模型的操作包括以下的步骤S301-步骤S305。
在步骤S301,沿垂直方向或水平方向将所述图像细分为多个区域。
在步骤S302,基于所述伴随本文,生成所述未知类别存在于所述每个图像细分区域中的第一训练概率分布。
在步骤S303,基于所述标注信息,生成所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第二训练概率分布。
在步骤S304,根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数。
在步骤S305,通过所述损失函数来训练所述第二模型。
对于步骤S302,可以基于以下模型生成第一训练概率分布。
例如,本公开可以从图像标题中生成特定于图像类别的空间分布。假设标题中的复杂上下文可以大致分辨出对象的位置。这个想法的实现仍然基于BERT模型的定制。在大多数情况下,图像及其垂直翻转的版本可以用相同的标题描述,但是这可能会复杂化对象的水平位置的预测。因此,优选地,本公开的模型仅专注于垂直定位图像中的某些对象。特别地,所有图像都将被分割成等长的垂直区域。应当理解,也可以将图像细分为大小不等的多个区域,这里不做限制。
例如,对于图像x,假设沿垂直方向将图像x细分为相等间隔的K个区域,令
Figure PCTCN2021087579-appb-000004
为第k(k=1…K)区域中被分类为未知类别c的图像像素数,从而可以得到所有细分区域上的未知类别c的分布V x,c,即
Figure PCTCN2021087579-appb-000005
其中,
Figure PCTCN2021087579-appb-000006
是归一化的
Figure PCTCN2021087579-appb-000007
此外,可以将另一个头部H s(·)附加到BERT模型的主干上,并在BERT模型的末端放入一个K输出的softmax,从而可以将该BERT模型设计用于估计图像x中某个未知类别c的空间分布(即通过BERT模型处理图像伴随文本得到的细分区域上的分布),也称为第一训练概率分布:
Figure PCTCN2021087579-appb-000008
其中,H s(·)表示自由定义的函数,这里不做限制。应当认识到,Softmax激活函数仅仅是一个示例,还可以采用诸如sigmoid、tanh等的激活函数,这里不做限制。
此外,可以通过以下损失函数L训练BERT模型。例如,通过对BERT模型进行微调,以统一的优化目标L来追求图像特定的类别视觉事件和空间分布:
L=L o+L s  (7)
其中,在训练过程中,通过最小化已知类别中所有已知类别的对应对V x,c
Figure PCTCN2021087579-appb-000009
之间的分布差异(通过构造损失函数L s),来迭代优化H s(·)。例如,可以通过信息熵目标实现损失函数L s
Figure PCTCN2021087579-appb-000010
这里,由L o+L s控制的H o(·)、H s(·)彼此互补。这里,
Figure PCTCN2021087579-appb-000011
为图像x中第k(k=1…K)区域中被分类为未知类别c的图像像素数,
Figure PCTCN2021087579-appb-000012
是归一化的
Figure PCTCN2021087579-appb-000013
为图像x中未知类别c在k(k=1…K)个区域中的空间分布(或第一训练概率分布)。
应当认识到,上述基于所述伴随本文,生成所述未知类别存在于所述每个图像细分区域中的第一训练概率分布的模型不限于此,可以采用其他合适的模型来生成第一训练概率分布,这里不做限制。
对于步骤S303,可以基于以下模型生成第二训练概率分布。
例如,对于某个未知类别u(u∈U)(也可以表示为上文中的c),通过等式(4)推断出基于第一模型M给出的未知类别u(u∈U)的空间分布,那么,接下来,可以对每个垂直细分区域中的像素求平均(例如,可以在第一模型的末端放入一个K输出的softmax函数),以获得第二训练概率分布
Figure PCTCN2021087579-appb-000014
Figure PCTCN2021087579-appb-000015
Figure PCTCN2021087579-appb-000016
表示第k个垂直细分区域的未知类别u的空间分布,strip(k)表示第k个垂直细分区域,k=1…K。
对于步骤S304,例如,可以根据上述第一训练概率分布
Figure PCTCN2021087579-appb-000017
(等式(6))(应当认识到,在本公开中,c和u(u∈U)都表示未知类别,因此在此
Figure PCTCN2021087579-appb-000018
也可以表示为
Figure PCTCN2021087579-appb-000019
)和第二训练概率分布
Figure PCTCN2021087579-appb-000020
(等式(9))之间的L 2距离(欧式距离,Euclidean distance)来构建以下的损失函数:
Figure PCTCN2021087579-appb-000021
最后,在步骤S305中,通过上述损失函数来训练构建的第二模型。
应当认识到,上述基于所述标注信息,生成所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第二训练概率分布的模型不限于此,可以采用其他合适的模型来生成第二训练概率分布,这里不做限制。
如图6所示为根据本公开实施例的半全局分支的效果示意图。
图6示出了在根据上述第二模型将所有图像都将被分割成等长的垂直区域后,得到的图像中不同类别的空间分布。可以看出,对于相同的类别飞盘,本公开的第二模型根据不同的图像标题可以得到不同的结果。
如图6所示,图6左侧的两个图像延垂直方向分为了5个区域,图6右侧的分布图表示每个图像被细分为5个区域后对应的空间分布。可以看出,对于相同的类别飞盘,针对图6左上侧的图像,其对应的空间分布(图6右上侧)显示了飞盘处于下边区域的概率更 大;针对图6左下侧的图,其对应的空间分布(图6右下侧)显示了飞盘处于上边区域的概率更大。
容易理解的是,根据本公开实施例的第一模型和第二模型可以采用不同的神经网络结构,包括但不限于卷积神经网络、循环神经网络(RNN)等。所述卷积神经网络包括但不限于U-Net神经网络、ResNet、DenseNet等。
以上详细描述了利用包含局部分支、半全局分支和全局分支的未知类别获取模型生成未知类别的概率和/或分布,通过该方法,可以获得每张图像包含未知类别的概率,包括像素级别、图像细分区域级别以及全局概率。
进一步,可以将上述不同级别的概率信息作为训练集,通过采用诸如U-Net之类的深度网络作为模型主体,构建未知类别的图像分割模型的优化目标函数,从而通过训练图像分割模型来进行图像分割,由此得到分割后的图像。
应当认识到,本公开中的神经网络模型可以包括各种神经网络模型,例如但不限于:卷积神经网络(CNN)(包括GoogLeNet、AlexNet、VGG网络等)、具有卷积神经网络的区域(R-CNN)、区域提议网络(RPN)、循环神经网络(RNN)、基于堆栈的深度神经网络(S-DNN)、深度信念网络(DBN)、受限玻尔兹曼机(RBM)、完全卷积网络、长短期记忆(LSTM)网络和分类网络。另外,执行一项任务的神经网络模型可以包括子神经网络,并且该子神经网络可以包括异构神经网络,并且可以用异构神经网络模型来实现。
图7示出了根据本公开实施例的图像分割方法的流程图。如图7所示,该图像分割方法包括以下的步骤S401-步骤S402。
在步骤S401,获取第一图像。
在步骤S402,利用图像分割模型处理所述第一图像以生成分割后的第二图像。
例如,该第一图像是图像分割模型的输入图像。
例如,图像分割模型可以是利用第一训练集对原始图像分割网络训练得到的,所述第一训练集包含利用上述图1所示的图像处理方法得到的未知类别的概率和/或分布,其中所述第二图像包含对应不同类别的多个区域。
例如,本公开的图像分割模型可以是卷积神经网络、循环神经网络(RNN)等,其可以通过构建损失函数L进行训练:
L=L SEG+λL RS    (11)
其中L是该图像分割模型的损失函数,λ是权重因子,用于平衡已知类别的损失函数L SEG和未知类别的损失函数L RS。例如,已知类别的损失函数L SEG可以通过目前已知的技术得到,这里不再详细说明。
对于未知类别的损失函数L RS,例如,可以基于上述半全局分支和全局分支得到的未知类别的概率来构建。
例如,本公开可以采用成对排名损失(pair-wise ranking loss)来利用未知类别的概率信息。给定图像x∈X,假定典型的此类网络的倒数第二层会生成特征图f=ψ(x),其中ψ(·)封装了所有涉及的神经操作。令f∈R h×w×d,其中h×w定义空间分辨率,d是提取的特征长度, 图像分割任务中的预测以逐像素方式执行。对于图像x,由于可以访问真值标签图y,该真值标签图当然只包含已知类别S中的像素级注释,因此假设未知类别只会出现在未标记的部分中。对于特征图中的像素(i,j),可以将Y表示为未标记像素位置的集合:
Figure PCTCN2021087579-appb-000022
给定一对图像x1、x2,可以使用CNN模型
Figure PCTCN2021087579-appb-000023
获得编码特征图f 1、f 2。并且可以使用标题注释r 1、r 2通过本公开的未知类别获取模型来生成特定类别s 1,e、s 2,e的出现概率。如果
Figure PCTCN2021087579-appb-000024
可以认为图像x1比图像x2更可能包含类别e u。也就是说,x1的未标记部分Y1比x2的未标注部分Y2更有可能包含未知类别e u(u∈U)。因此,排名损失可以写成:
Figure PCTCN2021087579-appb-000025
其中I(s 1,e,s 2,e)具有指示符功能。如果s 1,e>s 2,e,则I(s 1,e,s 2,e)=1,否则为-1。
Figure PCTCN2021087579-appb-000026
是与类别e∈S∪U相关的固定词嵌入(诸如利用word2vec模型),S表示已知类别,U表示未知类别,e u表示u(u∈U)的类别。
如上所述,还可以从标题中生成出某个类别的空间分布(即将图像细分为多个区域后的分区概率)。直观地讲,此类信息可用于修剪类别出现的区域。通过将空间分布视为每个划分区域的权重,可以将其细化为:
Figure PCTCN2021087579-appb-000027
其中,k∈(1,2,…,N)是沿垂直方向划分的区域的索引。
Figure PCTCN2021087579-appb-000028
是类别e u的预测空间分布(即通过上述本全局分支得到的分区概率)。
可替代地,例如,可以基于上述局部分支、半全局分支和全局分支得到的未知类别的概率构建未知类别的损失函数,这里不做限制。
例如,上述图像分割模型可以在服务器端进行训练。在部署阶段,需要将训练后的模型部署至客户端即可使用。神经网络模型进行训练所需的数据集,仅需在服务器端存储和使用,而无需在客户端进行部署。
容易理解的是,根据本公开实施例的神经网络模型可以采用不同的网络结构,包括但不限于卷积神经网络、循环神经网络(RNN)等。所述卷积神经网络包括但不限于U-Net神经网络、ResNet、DenseNet等。
图8示意性地描述根据本公开实施例的图像分割模型生成的分割后图像的示意图。
如图8所示,输入图像为图8第一行的五个图,每个图都包含不同的类别(例如,对于第一幅图,其包含有狗、飞盘、草地等类别)。真值图像是利用人工标记进行图像分割后得到的分割后图像,分割后的图像包含由对应不同类别的多个颜色表示区域。可以看出,通过本公开的图像分割模型生成的分割后的图像(图8的最后一行)与其他类型(例如,SPNet),相比,本公开生成的分割后的图像更接近真值图像,且噪声更小。
图9所示为根据本公开实施例的小样本图像分割方法的示意图。如图9所示,本公开利用未知类别获取模型生成未知类别的概率和/或分布51。所述未知类别的概率和/或分布包括基于已知类别的标注信息53生成的图像中每个像素来自所述未知类别的概率、基 于所述伴随文本(图像数据集55中包含的)生成的未知类别存在于所述图像中的概率、以及基于所述标注信息53和所述伴随文本(图像数据集55中包含的)生成的将图像细分为多个区域后的分区概率。在本公开中,没有对未知类别54进行标注。利用该未知类别的概率和/或分布作为训练数据训练图像分割网络,就可以得到图像分割模型52,该图像分割模型52可以用于分割输入的图像。
本公开利用包含局部分支、半全局分支和全局分支的未知类别获取模型生成未知类别的概率和/或分布,利用该未知类别的概率和/或分布作为训练数据训练图像分割网络,可以实现在没有提供该未知类别的像素级语义标注的情况下自动标注出图像中未知类别,降低标注成本并加快开发周期的效果,从而节省大量的人工费用和时间。
具体地,本公开利用未知类别获取模型图像处理生成未知类别的概率和/或分布,利用该未知类别的概率和/或分布作为训练数据训练图像分割网络,可以实现在没有提供该未知类别的像素级语义标注的情况下自动标注出图像中未知类别,从而节省大量的人工费用和时间。进一步,本公开通过最大化利用所有已收集数据中的信息,达到对于相同的标注成本提升图像处理模型的效果,或者对于相同的图像处理模型效果,降低标注成本并加快开发周期的效果。
以上,参照附图描述了根据本发明实施例的图像处理方法。以下,将描述根据本公开实施例的图像处理装置。
图10是图示根据本公开实施例的图像处理装置的功能框图。如图10所示,根据本公开实施例的图像处理装置1000包括获取单元1001和生成单元1002。上述各模块可以分别执行如上参照图1到图9描述的根据本公开的实施例的图像处理方法的各个步骤。本领域的技术人员理解:这些单元模块可以单独由硬件、单独由软件或者由其组合以各种方式实现,并且本公开不限于它们的任何一个。例如,可以通过中央处理单元(CPU)、图像处理器(GPU)、张量处理器(TPU)、现场可编程逻辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元以及相应计算机指令来实现这些单元。
例如,获取单元1001用于获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本。
例如,生成单元1002,用于利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
例如,图像数据集通常包含某种伴随文本,例如社交网站图像下的用户评论、图像标题等。本公开所描述的方法中的伴随文本以图像标题(caption)为例,来展示伴随文本对于小样本图像处理的用途。应当理解,本公开的可以包含其它形式的图像伴随文本,这里不做限制。
例如,未知类别获取模型可以包含局部分支、半全局分支和全局分支。例如,所述局部分支可以被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支可以被配置为基于所述伴随文本生成所述未知类别存在于所述图像 中的概率,所述半全局分支可以被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
例如,全局分支可以基于所述伴随本文,利用文本语义提取模型生成所述未知类别存在于所述图像中的概率。
例如,所述文本语义提取模型为来自变换器的双向编码表示BERT模型,其中,利用BERT模型生成所述未知类别存在于所述图像中的概率表示为:
s x,c=sigmoid(H o(φ(caption(x);[EOS];description(c))))   (18)
其中,H o(·)表示自由定义的函数,其输出是未经过sigmoid函数处理的、未知类别出现在图像中的概率,φ表示BERT模型,x表示BERT模型的输入图像,caption(x)表示图像的伴随文本,EOS为自然语言处理中的语句休止符,c表示未知类别,description(c)表示所述未知类别c的关键字或者文字描述。
例如,所述局部分支可以利用第一模型来生成所述图像中每个像素来自所述未知类别的概率,其中所述第一模型是通过所述标注信息训练获得的。
例如,所述标注信息包含已知类别的像素的坐标,所述第一模型可以通过以下方式进行训练:选择所述图像数据集中一个图像中多个已知类别中的一个已知类别的像素作为验证集中的一个验证数据;选择所述多个已知类别中的其他类别的像素作为训练集中的一个训练数据;以及基于所述验证集和所述训练集中的已知类别的像素的坐标,训练所述第一模型。
例如,可以通过以下第一模型M来生成所述图像中每个像素来自未知类别的概率:
给定图像x,本公开的像素级别第一模型M从所有已知类别的已标记的像素x′和未标记的目标像素t采样已知类别的源像素s。e s表示源像素s的类别,由于在第一模型中已经知道源像素s属于已知类别,所以e s∈S,S表示已知类别,U表示未知类别。那么,未标记的目标像素t的类别属于未知类别(e t=u∈U)的概率为:
Figure PCTCN2021087579-appb-000029
其中position(p)表示像素p的二维坐标,其大小为[0,1]。w e∈R d是与类别e相关的词嵌入(即,经过诸如word2vec模型之后的特征化向量),
Figure PCTCN2021087579-appb-000030
是与源像素s的类别e s相关的词嵌入,w u是与类别u(u∈U)相关的词嵌入。
进一步,未知类别u(u∈U)的空间分布可以通过将从所有已标记的像素中得到的预测结果进行积分得到:
Figure PCTCN2021087579-appb-000031
其中|x′|是已标记的像素数,其可以作为重新缩放因子。实际上,通过这种方式,可以利用已知类别的像素级标注来生成某些未知类别的空间分布。
例如,该第一模型M可以通过已知类别的标注信息进行训练。例如,在每次迭代中,可以随机选择一个已知类别的像素来模拟为未知类别的像素作为验证集中的一个验证数据,选择已知类别中的其他类别的像素作为训练集中的一个训练数据,基于所述验证集和所述训练集中的已知类别的像素的坐标,训练该第一模型M。
通过上述第一模型M,可以生成所述图像中每个像素来自所述未知类别的概率。应当认识到,上述第一模型M仅仅是一个示例,本公开还可以采用其他合适的第一模型,这里不做限制。
例如,所述半全局分支可以利用第二模型生成所述分区概率,所述第二模型是通过所述伴随文本和所述标注信息训练获得的。
例如,所述分区概率可以包括将图像细分为多个区域后生成的所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第一概率分布以及所述未知类别存在于所述每个图像细分区域中的第二概率分布。
例如,所述第二模型可以通过以下方式进行训练:沿垂直方向或水平方向将所述图像细分为多个区域;基于所述伴随本文,生成所述未知类别存在于所述每个图像细分区域中的第一训练概率分布;基于所述标注信息,生成所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第二训练概率分布;根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数;以及通过所述损失函数来训练所述第二模型。
例如,所述根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数包括:基于所述第一训练概率分布和所述第二训练概率分布之间的欧式距离来构建损失函数。
本公开的图像处理装置利用未知类别获取模型图像处理生成未知类别的概率和/或分布,利用该未知类别的概率和/或分布作为训练数据训练图像分割网络,可以实现在没有提供该未知类别的像素级语义标注的情况下自动标注出图像中未知类别,从而节省大量的人工费用和时间。进一步,本公开的图像处理装置通过最大化利用所有已收集数据中的信息,达到对于相同的标注成本提升图像处理模型的效果,或者对于相同的图像处理模型效果,降低标注成本并加快开发周期的效果。
下面将参照图11描述根据本公开实施例的图像处理设备。图11是根据本公开实施例的图像处理设备2000的示意图。由于本实施例的图像处理设备与在上文中参照图1描述的方法的细节相同,因此在这里为了简单起见,省略对相同内容的详细描述。
如图11所示,该图像处理设备2000包括处理器210、存储器220以及一个或多个计算机程序模块221。
例如,处理器210与存储器220通过总线系统230连接。例如,一个或多个计算机程序模块221被存储在存储器220中。例如,一个或多个计算机程序模块221包括用于执行本公开任一实施例提供的图像处理方法的指令。例如,一个或多个计算机程序模块221中的指令可以由处理器210执行。例如,总线系统230可以是常用的串行、并行通信总线等,本公开的实施例对此不作限制。
例如,该处理器210可以是中央处理单元(CPU)、数字信号处理器(DSP)、图像处理器(GPU)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元,可以为通用处理器或专用处理器,并且可以控制图像处理设备2000中的其它组件以执行期望的功能。
存储器220可以包括一个或多个计算机程序产品,该计算机程序产品可以包括各种 形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。该易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。该非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器210可以运行该程序指令,以实现本公开实施例中(由处理器210实现)的功能以及/或者其它期望的功能,例如图像处理方法等。在该计算机可读存储介质中还可以存储各种应用程序和各种数据,例如图像数据集的元素特征、第一模型以及应用程序使用和/或产生的各种数据等。
需要说明的是,为表示清楚、简洁,本公开实施例并没有给出该图像处理设备2000的全部组成单元。为实现图像处理设备2000的必要功能,本领域技术人员可以根据具体需要提供、设置其他未示出的组成单元,本公开的实施例对此不作限制。
关于不同实施例中的图像处理装置1000和图像处理设备2000的技术效果可以参考本公开的实施例中提供的图像处理方法的技术效果,这里不再赘述。
图像处理装置1000和图像处理设备2000可以用于各种适当的电子设备。
本公开至少一实施例还提供一种用于存储计算机可读程序的计算机可读存储介质。图12为本公开至少一实施例提供的一种存储介质的示意图。例如,如图12所示,该存储介质400非暂时性地存储计算机可读指令401,当非暂时性计算机可读指令由计算机(包括处理器)执行时可以执行本公开任一实施例提供的图像处理方法。
例如,该存储介质可以是一个或多个计算机可读存储介质的任意组合。例如,当该程序代码由计算机读取时,计算机可以执行该计算机存储介质中存储的程序代码,执行例如本公开任一实施例提供的图像处理方法。
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。
本领域技术人员可以理解,本申请的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本申请的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本申请的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。
本申请使用了特定词语来描述本申请的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本申请至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一替代性实施例”并不一定是指同一实施例。此外,本申请的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。
除非另有定义,这里使用的所有术语(包括技术和科学术语)具有与本公开所属领域的普通技术人员共同理解的相同含义。还应当理解,诸如在通常字典里定义的那些术语应当被解释为具有与它们在相关技术的上下文中的含义相一致的含义,而不应用理想化或极度形式化的意义来解释,除非这里明确地这样定义。
以上是对本公开的说明,而不应被认为是对其的限制。尽管描述了本公开的若干示例性实施例,但本领域技术人员将容易地理解,在不背离本公开的新颖教学和优点的前提下可以对示例性实施例进行许多修改。因此,所有这些修改都意图包含在权利要求书所限定的本公开范围内。应当理解,上面是对本公开的说明,而不应被认为是限于所公开的特定实施例,并且对所公开的实施例以及其他实施例的修改意图包含在所附权利要求书的范围内。本公开由权利要求书及其等效物限定。

Claims (20)

  1. 一种图像处理方法,包括:
    获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;
    基于所述图像数据集,利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
  2. 根据权利要求1所述的方法,其中,
    所述未知类别获取模型包含局部分支、半全局分支和全局分支,
    其中,所述局部分支被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支被配置为基于所述伴随文本生成所述未知类别存在于所述图像中的概率,所述半全局分支被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
  3. 根据权利要求2所述的方法,其中,
    所述全局分支基于所述伴随文本,利用文本语义提取模型生成所述未知类别存在于所述图像中的概率。
  4. 根据权利要求3所述的方法,其中,所述文本语义提取模型为来自变换器的双向编码表示BERT模型,其中,利用BERT模型生成所述未知类别存在于所述图像中的概率表示为:
    s x,c=sigmoid(H o(φ(caption(x);[EOS];description(c))))
    其中,H o(·)表示自由定义的函数,其输出是未经过sigmoid函数处理的、未知类别出现在图像中的概率,φ表示BERT模型,x表示BERT模型的输入图像,caption(x)表示图像的伴随文本,EOS为自然语言处理中的语句休止符,c表示未知类别,description(c)表示所述未知类别c的关键字或者文字描述。
  5. 根据权利要求2所述的方法,其中,
    所述局部分支利用第一模型来生成所述图像中每个像素来自所述未知类别的概率,其中所述第一模型是通过所述标注信息训练获得的。
  6. 根据权利要求5所述的方法,其中,所述标注信息包含已知类别的像素的坐标,所述第一模型通过以下方式进行训练:
    选择所述图像数据集中一个图像中多个已知类别中的一个已知类别的像素作为验证集中的一个验证数据;
    选择所述多个已知类别中的其他类别的像素作为训练集中的一个训练数据;以及
    基于所述验证集和所述训练集中的已知类别的像素的坐标,训练所述第一模型。
  7. 根据权利要求2所述的方法,其中,所述半全局分支利用第二模型生成所述分区概率,所述第二模型是通过所述伴随文本和所述标注信息训练获得的。
  8. 根据权利要求7所述的方法,其中,所述分区概率包括将图像细分为多个区域后生成的所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第一概率分布、以及所述未知类别存在于所述每个图像细分区域中的第二概率分布。
  9. 根据权利要求8所述的方法,其中,所述第二模型通过以下方式进行训练:
    沿垂直方向或水平方向将所述图像细分为多个区域;
    基于所述伴随本文,生成所述未知类别存在于所述每个图像细分区域中的第一训练概率分布;
    基于所述标注信息,生成所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第二训练概率分布;
    根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数;
    通过所述损失函数来训练所述第二模型。
  10. 根据权利要求9所述的方法,其中,所述根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数包括:
    基于所述第一训练概率分布和所述第二训练概率分布之间的欧式距离来构建损失函数。
  11. 根据权利要求1所述的方法,其中,所述伴随文本包括用户评论和/或图像标题。
  12. 一种图像分割方法,包括:
    获取第一图像;
    利用图像分割模型处理所述第一图像以生成分割后的第二图像,
    其中,所述图像分割模型是利用第一训练集对原始图像分割网络训练得到的,所述第一训练集包含利用权利要求1所述的方法得到的未知类别的概率和/或分布,其中所述第二图像包含对应不同类别的多个区域。
  13. 一种图像处理装置,包括:
    获取单元,用于获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;
    生成单元,用于基于所述图像数据集,利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
  14. 根据权利要求13所述的装置,其中,
    所述未知类别获取模型包含局部分支、半全局分支和全局分支,
    其中,所述局部分支被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支被配置为基于所述伴随文本生成所述未知类别存在于所述图像中的概率,所述半全局分支被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
  15. 根据权利要求14所述的装置,其中,
    所述全局分支基于所述伴随文本,利用文本语义提取模型生成所述未知类别存在于 所述图像中的概率。
  16. 根据权利要求14所述的装置,其中,
    所述局部分支利用第一模型来生成所述图像中每个像素来自所述未知类别的概率,其中所述第一模型是通过所述标注信息训练获得的。
  17. 根据权利要求16所述的装置,其中,所述标注信息包含已知类别的像素的坐标,所述第一模型通过以下方式进行训练:
    选择所述图像数据集中一个图像中多个已知类别中的一个已知类别的像素作为验证集中的一个验证数据;
    选择所述多个已知类别中的其他类别的像素作为训练集中的一个训练数据;以及
    基于所述验证集和所述训练集中的已知类别的像素的坐标,训练所述第一模型。
  18. 根据权利要求14所述的装置,其中,
    所述半全局分支利用第二模型生成所述分区概率,所述第二模型是通过所述伴随文本和所述标注信息训练获得的。
  19. 一种图像处理设备,包括:
    处理器;以及
    存储器,其中存储计算机可读指令,
    其中,在所述计算机可读指令被所述处理器运行时执行图像处理方法,所述方法包括:
    获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;
    利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
  20. 一种用于存储计算机可读程序的计算机可读存储介质,所述程序使得计算机执行如权利要求1所述的图像处理方法。
PCT/CN2021/087579 2020-05-21 2021-04-15 图像处理方法、装置、设备、存储介质以及图像分割方法 WO2021233031A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/754,158 US12039766B2 (en) 2020-05-21 2021-04-15 Image processing method, apparatus, and computer product for image segmentation using unseen class obtaining model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010438187.9 2020-05-21
CN202010438187.9A CN111612010B (zh) 2020-05-21 2020-05-21 图像处理方法、装置、设备以及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021233031A1 true WO2021233031A1 (zh) 2021-11-25

Family

ID=72195904

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/087579 WO2021233031A1 (zh) 2020-05-21 2021-04-15 图像处理方法、装置、设备、存储介质以及图像分割方法

Country Status (3)

Country Link
US (1) US12039766B2 (zh)
CN (1) CN111612010B (zh)
WO (1) WO2021233031A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936141A (zh) * 2021-12-17 2022-01-14 深圳佑驾创新科技有限公司 图像语义分割方法及计算机可读存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111612010B (zh) 2020-05-21 2024-07-16 京东方科技集团股份有限公司 图像处理方法、装置、设备以及计算机可读存储介质
CN112330685B (zh) * 2020-12-28 2021-04-06 北京达佳互联信息技术有限公司 图像分割模型训练、图像分割方法、装置及电子设备
US11948358B2 (en) * 2021-11-16 2024-04-02 Adobe Inc. Self-supervised hierarchical event representation learning
US20230410541A1 (en) * 2022-06-18 2023-12-21 Kyocera Document Solutions Inc. Segmentation of page stream documents for bidirectional encoder representational transformers
CN116269285B (zh) * 2022-11-28 2024-05-28 电子科技大学 一种非接触式常态化心率变异性估计系统
CN116758359B (zh) * 2023-08-16 2024-08-06 腾讯科技(深圳)有限公司 图像识别方法、装置及电子设备
CN117115565B (zh) * 2023-10-19 2024-07-23 南方科技大学 一种基于自主感知的图像分类方法、装置及智能终端

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005794A (zh) * 2015-07-21 2015-10-28 太原理工大学 融合多粒度上下文信息的图像像素语义标注方法
CN108229478A (zh) * 2017-06-30 2018-06-29 深圳市商汤科技有限公司 图像语义分割及训练方法和装置、电子设备、存储介质和程序
CN108229519A (zh) * 2017-02-17 2018-06-29 北京市商汤科技开发有限公司 图像分类的方法、装置及系统
US20190087964A1 (en) * 2017-09-20 2019-03-21 Beihang University Method and apparatus for parsing and processing three-dimensional cad model
CN111612010A (zh) * 2020-05-21 2020-09-01 京东方科技集团股份有限公司 图像处理方法、装置、设备以及计算机可读存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5953442A (en) * 1997-07-24 1999-09-14 Litton Systems, Inc. Fingerprint classification via spatial frequency components
AUPP568698A0 (en) * 1998-09-03 1998-10-01 Canon Kabushiki Kaisha Region-based image compositing
JP5370267B2 (ja) * 2010-05-27 2013-12-18 株式会社デンソーアイティーラボラトリ 画像処理システム
US8503801B2 (en) * 2010-09-21 2013-08-06 Adobe Systems Incorporated System and method for classifying the blur state of digital image pixels
US11507800B2 (en) * 2018-03-06 2022-11-22 Adobe Inc. Semantic class localization digital environment
CN109376786A (zh) * 2018-10-31 2019-02-22 中国科学院深圳先进技术研究院 一种图像分类方法、装置、终端设备及可读存储介质
CN110059734B (zh) * 2019-04-02 2021-10-26 唯思科技(北京)有限公司 一种目标识别分类模型的训练方法、物体识别方法、装置、机器人和介质
CN110837836B (zh) * 2019-11-05 2022-09-02 中国科学技术大学 基于最大化置信度的半监督语义分割方法
CN111311613B (zh) * 2020-03-03 2021-09-07 推想医疗科技股份有限公司 图像分割模型训练方法、图像分割方法及装置
CN113805824B (zh) * 2020-06-16 2024-02-09 京东方科技集团股份有限公司 电子装置以及在显示设备上显示图像的方法
CN111932555A (zh) * 2020-07-31 2020-11-13 商汤集团有限公司 一种图像处理方法及装置、计算机可读存储介质
CN115797632B (zh) * 2022-12-01 2024-02-09 北京科技大学 一种基于多任务学习的图像分割方法
CN117078714A (zh) * 2023-06-14 2023-11-17 北京百度网讯科技有限公司 图像分割模型训练方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005794A (zh) * 2015-07-21 2015-10-28 太原理工大学 融合多粒度上下文信息的图像像素语义标注方法
CN108229519A (zh) * 2017-02-17 2018-06-29 北京市商汤科技开发有限公司 图像分类的方法、装置及系统
CN108229478A (zh) * 2017-06-30 2018-06-29 深圳市商汤科技有限公司 图像语义分割及训练方法和装置、电子设备、存储介质和程序
US20190087964A1 (en) * 2017-09-20 2019-03-21 Beihang University Method and apparatus for parsing and processing three-dimensional cad model
CN111612010A (zh) * 2020-05-21 2020-09-01 京东方科技集团股份有限公司 图像处理方法、装置、设备以及计算机可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936141A (zh) * 2021-12-17 2022-01-14 深圳佑驾创新科技有限公司 图像语义分割方法及计算机可读存储介质
CN113936141B (zh) * 2021-12-17 2022-02-22 深圳佑驾创新科技有限公司 图像语义分割方法及计算机可读存储介质

Also Published As

Publication number Publication date
US12039766B2 (en) 2024-07-16
CN111612010A (zh) 2020-09-01
CN111612010B (zh) 2024-07-16
US20220292805A1 (en) 2022-09-15

Similar Documents

Publication Publication Date Title
WO2021233031A1 (zh) 图像处理方法、装置、设备、存储介质以及图像分割方法
US11373390B2 (en) Generating scene graphs from digital images using external knowledge and image reconstruction
US10586350B2 (en) Optimizations for dynamic object instance detection, segmentation, and structure mapping
EP3926531B1 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
US20190279074A1 (en) Semantic Class Localization Digital Environment
CN109791600A (zh) 将横屏视频转换为竖屏移动布局的方法
CN110378410B (zh) 多标签场景分类方法、装置及电子设备
WO2021212601A1 (zh) 一种基于图像的辅助写作方法、装置、介质及设备
JP2024526065A (ja) テキストを認識するための方法および装置
US11030726B1 (en) Image cropping with lossless resolution for generating enhanced image databases
US10755171B1 (en) Hiding and detecting information using neural networks
CN113869138A (zh) 多尺度目标检测方法、装置及计算机可读存储介质
US20230127525A1 (en) Generating digital assets utilizing a content aware machine-learning model
CN117033609B (zh) 文本视觉问答方法、装置、计算机设备和存储介质
CN117216194B (zh) 文博领域知识问答方法及装置、设备和介质
CN113779225B (zh) 实体链接模型的训练方法、实体链接方法及装置
US10699458B2 (en) Image editor for merging images with generative adversarial networks
WO2022222854A1 (zh) 一种数据处理方法及相关设备
KR102401113B1 (ko) 보상 가능성 정보 및 UX-bit를 이용한 자동 디자인 생성 인공신경망 장치 및 방법
CN117671426A (zh) 基于概念蒸馏和clip的可提示分割模型预训练方法及系统
CN111445545B (zh) 一种文本转贴图方法、装置、存储介质及电子设备
US10957017B1 (en) Synthetic image detector
CN115660069A (zh) 一种半监督卫星影像语义分割网络构建方法、装置及电子设备
Orhei Urban landmark detection using computer vision
Dibari et al. Semantic segmentation of multimodal point clouds from the railway context

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21809867

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21809867

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21809867

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.06.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21809867

Country of ref document: EP

Kind code of ref document: A1