WO2021233031A1 - 图像处理方法、装置、设备、存储介质以及图像分割方法 - Google Patents
图像处理方法、装置、设备、存储介质以及图像分割方法 Download PDFInfo
- Publication number
- WO2021233031A1 WO2021233031A1 PCT/CN2021/087579 CN2021087579W WO2021233031A1 WO 2021233031 A1 WO2021233031 A1 WO 2021233031A1 CN 2021087579 W CN2021087579 W CN 2021087579W WO 2021233031 A1 WO2021233031 A1 WO 2021233031A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- probability
- model
- category
- unknown category
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000003672 processing method Methods 0.000 title claims abstract description 30
- 238000003709 image segmentation Methods 0.000 title claims description 51
- 238000012549 training Methods 0.000 claims description 106
- 238000012545 processing Methods 0.000 claims description 54
- 230000006870 function Effects 0.000 claims description 49
- 238000005192 partition Methods 0.000 claims description 28
- 238000012795 verification Methods 0.000 claims description 19
- 230000015654 memory Effects 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 8
- 238000010200 validation analysis Methods 0.000 claims description 5
- 238000003058 natural language processing Methods 0.000 claims description 3
- 238000000638 solvent extraction Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 15
- 238000013528 artificial neural network Methods 0.000 description 13
- 238000002372 labelling Methods 0.000 description 13
- 238000013527 convolutional neural network Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 9
- 238000003062 neural network model Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 239000002096 quantum dot Substances 0.000 description 2
- 244000025254 Cannabis sativa Species 0.000 description 1
- 241000282818 Giraffidae Species 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 239000013535 sea water Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/84—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19147—Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
- G06V30/274—Syntactic or semantic context, e.g. balancing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/10—Recognition assisted with metadata
Definitions
- This application relates to an image processing method, device, equipment, computer readable storage medium, and image segmentation method.
- Image segmentation technology is one of the core issues in the field of computer vision. This technology aims to perform pixel-level semantic annotation on images.
- the input of the image segmentation model is generally an ordinary image or video frame, and the output is the semantic label of each pixel (the label category is usually specified in advance).
- an image processing method including: acquiring an image data set, the image data set containing an image and accompanying text related to an unknown category in the image; using an unknown category acquisition model to generate unknown The probability and/or distribution of the category, the probability and/or distribution of the unknown category include the probability that each pixel in the image comes from the unknown category, the probability that the unknown category exists in the image, and the The partition probability after subdividing into multiple regions.
- the unknown category acquisition model includes a local branch, a semi-global branch, and a global branch, wherein the local branch is configured to generate each pixel in the image based on the annotation information of the known category.
- the probability of the unknown category, the global branch is configured to generate a probability that the unknown category exists in the image based on the accompanying text, and the semi-global branch is configured to be based on the annotation information and the accompanying text , To generate the partition probability after the image is subdivided into multiple regions.
- the global branch is based on the accompanying text and uses a text semantic extraction model to generate the probability that the unknown category exists in the image.
- the text semantic extraction model is a bidirectional encoding representation BERT model from a transformer, wherein the probability that the unknown category is present in the image generated by the BERT model is expressed as:
- H o ( ⁇ ) represents a freely defined function, and its output is the probability that an unknown category appears in the image without being processed by the sigmoid function
- ⁇ represents the BERT model
- caption(x) represents the accompanying text of the image
- EOS is natural
- c represents an unknown category
- description(c) represents a keyword or text description of the unknown category c.
- the local branch uses a first model to generate the probability of each pixel in the image from the unknown category, wherein the first model is obtained through the annotation information training.
- the annotation information includes the coordinates of pixels of a known category
- the first model is trained in the following manner: selecting one of the known categories of an image in the image data set is known Pixels of a category are used as one piece of verification data in the verification set; pixels of other categories in the multiple known categories are selected as one piece of training data in the training set; and pixels based on the known categories in the verification set and the training set The coordinates of training the first model.
- the semi-global branch uses a second model to generate the partition probability, and the second model is obtained through training of the accompanying text and the annotation information.
- the partition probability includes that each pixel in each image subdivision area of the plurality of image subdivision areas generated after the image is subdivided into a plurality of areas is from the first unknown category.
- the second model is trained in the following manner: the image is subdivided into a plurality of regions along the vertical direction or the horizontal direction; based on the accompanying text, the unknown category is generated in the A first training probability distribution in each image subdivision area; based on the label information, generating a second training in which each pixel in each image subdivision area of the plurality of image subdivision areas comes from the unknown category Probability distribution; construct a loss function according to the first training probability distribution and the second training probability distribution; train the second model through the loss function.
- the constructing a loss function according to the first training probability distribution and the second training probability distribution includes: based on the difference between the first training probability distribution and the second training probability distribution Euclidean distance is used to construct the loss function for image processing.
- the accompanying text includes user comments and/or image titles.
- an image segmentation method including: acquiring a first image; processing the first image using an image segmentation model to generate a segmented second image, wherein the image segmentation model uses The first training set is obtained by training the original image segmentation network, the first training set includes the probability and/or distribution of the unknown category obtained by the above image processing method, wherein the second image includes multiple regions corresponding to different categories .
- an image processing device including: an acquisition unit for acquiring an image data set, the image data set including an image and accompanying text related to an unknown category in the image; a generating unit , Used to obtain the probability and/or distribution of the unknown category using the unknown category acquisition model, the probability and/or distribution of the unknown category including the probability that each pixel in the image comes from the unknown category, and the unknown category exists in The probability in the image and the partition probability after the image is subdivided into multiple regions.
- the unknown category acquisition model includes a local branch, a semi-global branch, and a global branch, wherein the local branch is configured to generate each pixel in the image based on the annotation information of the known category.
- the probability of the unknown category, the global branch is configured to generate a probability that the unknown category exists in the image based on the accompanying text, and the semi-global branch is configured to be based on the annotation information and the accompanying text , To generate the partition probability after the image is subdivided into multiple regions.
- an image processing device including: a processor; and a memory in which computer-readable instructions are stored, wherein the image processing method is executed when the computer-readable instructions are executed by the processor
- the method includes: acquiring an image data set, the image data set containing an image and accompanying text related to an unknown category in the image; using an unknown category acquisition model to generate the probability and/or distribution of the unknown category, the unknown
- the category probability and/or distribution includes the probability that each pixel in the image comes from the unknown category, the probability that the unknown category exists in the image, and the partition probability after the image is subdivided into multiple regions.
- a computer-readable storage medium for storing a computer-readable program, which causes a computer to execute the above-mentioned image processing method.
- Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure
- FIG. 2 shows a schematic diagram of an example of image accompanying text according to an embodiment of the present disclosure
- Fig. 3 shows a schematic diagram of an unknown category labeling method according to an embodiment of the present disclosure
- FIG. 4 shows a flowchart of the operation of training the first model according to an embodiment of the present disclosure
- FIG. 5 shows a flowchart of the operation of training a second model according to an embodiment of the present disclosure
- FIG. 6 shows a schematic diagram of the effect of a semi-global branch according to an embodiment of the present disclosure
- Fig. 7 shows a flowchart of an image segmentation method according to an embodiment of the present disclosure
- FIG. 8 shows a schematic diagram of a segmented image generated by an image segmentation model according to an embodiment of the present disclosure
- Fig. 9 shows a schematic diagram of a small sample image segmentation method according to an embodiment of the present disclosure
- Fig. 10 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure
- FIG. 11 shows a block diagram of an image processing device according to an embodiment of the present disclosure.
- FIG. 12 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.
- first”, “second” and similar words used in the present disclosure do not indicate any order, quantity, or importance, but are only used to distinguish different components.
- “including” or “including” and other similar words mean that the element or item appearing before the word covers the elements or items listed after the word and their equivalents, and does not exclude other elements or items.
- Similar words such as “connected” or “connected” are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up”, “Down”, “Left”, “Right”, etc. are only used to indicate the relative position relationship. When the absolute position of the described object changes, the relative position relationship may also change accordingly.
- the image segmentation model is obtained by collecting a large number of training images in advance, and performing semantic annotation at the pixel level, and then obtaining the optimal parameters of the model by means of machine learning.
- Semantic annotation in the image segmentation task is very labor intensive and severely restricts the scale of training data for this task.
- When deploying an image segmentation model to a new application scenario it usually encounters a new unknown class (or called a low-shot or zero sample). Semantic annotations of these unknown categories are extremely rare and may be completely missing in some cases.
- the small sample image segmentation task (or called the unknown category image segmentation task) aims to obtain an image segmentation model that can handle new categories from small sample (or zero sample) data.
- the present disclosure provides an image processing method that uses an unknown category acquisition model including a local branch, a semi-global branch, and a global branch to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as
- the training data trains the image segmentation network, so that the image segmentation network is used to automatically label the unknown category in the image without providing the pixel-level semantic annotation of the unknown category, thereby saving a lot of labor costs and time.
- At least one embodiment of the present disclosure provides an image processing method, an image processing apparatus, an image processing device, and a computer-readable storage medium.
- the following is a non-limiting description of the image processing method provided according to at least one embodiment of the present disclosure through several examples and embodiments. As described below, these specific examples and embodiments are different if they do not conflict with each other. The features can be combined with each other to obtain new examples and embodiments, and these new examples and embodiments also fall within the protection scope of the present disclosure.
- FIGS. 1-6 an image processing method according to an embodiment of the present disclosure will be described with reference to FIGS. 1-6.
- This method can be automatically completed by a computer or the like.
- the image processing method can be implemented in software, hardware, firmware, or any combination thereof, and is loaded and executed by a processor in a device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, and a network server.
- the image processing method is suitable for a computing device that includes any electronic device with computing functions, such as a mobile phone, a laptop, a tablet, a desktop computer, a network server, etc., which can load and execute the image processing Method, the embodiment of the present disclosure does not limit this.
- the computing device may include a central processing unit (CPU) or a graphics processing unit (Graphics Processing Unit, GPU) and other forms of processing units, storage units, etc. that have data processing capabilities and/or instruction execution capabilities,
- the computing device is also installed with an operating system, an application programming interface (for example, OpenGL (Open Graphics Library), Metal, etc.), etc., and the image processing method provided by the embodiment of the present disclosure is implemented by running code or instructions.
- the computing device may also include a display component, such as a liquid crystal display (LCD), an organic light emitting diode (OLED) display, and a quantum dot light emitting diode (Quantum Dot Light Emitting).
- a display component such as a liquid crystal display (LCD), an organic light emitting diode (OLED) display, and a quantum dot light emitting diode (Quantum Dot Light Emitting).
- Diode, QLED display screens, projection components, VR head-mounted display devices (for example, VR helmets, VR glasses), etc., which are not limited in the embodiments of the present disclosure.
- the display part can display the object to be displayed.
- the image processing method includes the following steps S101 to S102.
- the image processing described in the present disclosure may include image digitization, image encoding, image enhancement, image restoration, image segmentation, image analysis, etc., which are not limited here.
- the present disclosure takes image segmentation as an example for description.
- step S101 an image data set is obtained, the image data set including an image and accompanying text related to an unknown category in the image.
- an unknown category acquisition model is used to generate the probability and/or distribution of the unknown category.
- the probability and/or distribution of the unknown category include the probability that each pixel in the image comes from the unknown category and the existence of the unknown category.
- the probabilities in the image and the partition probability image processing after the image is subdivided into multiple regions.
- the image data set usually contains some kind of accompanying text, such as user comments under the image of a social networking site, image title, and so on.
- the accompanying text in the method described in the present disclosure takes an image caption as an example to show the use of the accompanying text for small sample image processing. It should be understood that the present disclosure may include other forms of image accompanying text, which is not limited here.
- the image title "The person wearing black short sleeves is playing guitar” is related to the unknown category "Guitar”
- the image title "The person wearing black short sleeves is playing the piano” is related to the unknown category " "Guitar” has nothing to do.
- the image title "A man in black short sleeves is playing a musical instrument” may be related to the unknown category "Guitar”.
- Figure 2 shows some examples of image titles.
- the image title is usually a sentence describing the most critical semantic content in the image.
- the image title is useful in the following situations: 1) The title directly contains keywords of the unknown category; 2) The title can implicitly infer that the unknown category exists in the title. Probability in the image.
- the unknown category acquisition model may include local branches, semi-global branches, and global branches.
- Local branches, semi-global branches, and global branches may correspond to different modules.
- the local branch may be configured to generate the probability that each pixel in the image comes from the unknown category based on the annotation information of the known category
- the global branch may be configured to generate the unknown category based on the accompanying text.
- the semi-global branch may be configured to generate a partition probability after the image is subdivided into multiple regions based on the annotation information and the accompanying text.
- Fig. 3 is a schematic diagram of an unknown category labeling method according to an embodiment of the present disclosure.
- the image processing method of the present disclosure reuses the existing annotation information 31 of the known category, and at the same time uses the accompanying text 32 of the image, and uses the image including the local branch 33, the semi-global branch 34, and the global branch 35.
- the model is processed to generate the probability of the existence of unknown categories at different levels (e.g., pixel level, image subdivision area level, image global).
- the local branch 33 generates the probability (pixel-level probability 34) that each pixel in the image comes from the unknown category based on the label information 31 of the known category
- the global branch 37 generates the probability based on the accompanying text 32.
- the probability that the unknown category exists in the image image global probability 38
- the semi-global branch 35 generates the partition probability after the image is subdivided into multiple regions based on the annotation information 31 and the accompanying text 32 36.
- the global branch may be based on the accompanying text, using a text semantic extraction model to generate the probability that the unknown category exists in the image.
- a context-sensitive pre-trained text semantic extraction model such as bidirectional encoder representations from transformers (BERT) from a transformer, can be used to deal with the following contextual entailment questions (CEQ) in the accompanying text:
- CEQ contextual entailment questions
- x represents a specific image.
- caption(x) represents the text caption of the image.
- EOS is the end of sentence in natural language processing.
- c means unknown category.
- description(c) represents the keyword or text description of the unknown category c.
- the training process of the BERT model includes tasks related to context-based entailment relations between sentences. Therefore, after the above-mentioned CEQ is sent to a deep network model such as BERT, its high-level output includes the judgment of the implication relationship.
- a pair of premise and hypothesis sentences can be divided into three categories: contradiction, neutral, and entailment.
- a football match involving many men means “some men are participating in a sport”, which contradicts "no man moving in the image”.
- the goal of the above-mentioned CEQ is to predict the relationship between the premise and the hypothesis, which can be either an implicit relationship or a contradictory relationship. If it is judged to be a highly implicit relationship, it means that the unknown category c is consistent with the semantics of the image title.
- the judgment of the above-mentioned implication relationship can be controlled by introducing parameters.
- the range of CEQ can be widened to [0,1], and the relationship between the premise and the hypothesis can be predicted by converting it into a binary classification of confidence modulation.
- This can be achieved by attaching a fully connected head (denoted as H o ( ⁇ )) to the backbone of the BERT model.
- H o ( ⁇ ) a fully connected head
- H o ( ⁇ ) represents a freely defined function, which is not limited here, and its output is the probability of a specific category (without sigmoid()) appearing in the image, and ⁇ represents the BERT model.
- the output of the activation function Sigmoid() is in the interval [0,1], as a probability output, x represents the input image of the BERT model. It should be realized that the above activation function Sigmoid() is only an example, and activation functions such as softmax, tanh, etc. can also be used, and there is no limitation here.
- the binary cross entropy loss can be used to optimize the head H o and ⁇ based on the known category S, as shown below:
- a known category can be randomly simulated as an unknown category as a verification data in the verification set, and other categories in the known category can be used as training in the training set.
- the BERT model is trained based on the unknown category (formed by simulation of the known category) in the verification set and the known category in the training set.
- the neural network can be trained through the loss function of equation (2) to obtain a neural network model based on BERT, and the probability of an unknown category appearing in the image can be obtained through the neural network model.
- BERT model is only an example, and other suitable text semantic extraction models may also be used in the present disclosure, which is not limited here.
- the present disclosure will describe the operation of the local branch (the local branch 33 in FIG. 3) according to the embodiment of the present disclosure.
- the local branch may use a first model to generate the probability that each pixel in the image comes from the unknown category, wherein the first model is obtained through the annotation information training.
- the first model proposed in the present disclosure may be implemented as a multilayer perceptron network, for example, which may be obtained by training on label information.
- the specific description of the first model is as follows: (1) The training set contains a certain number of known categories. Most of these categories have sufficient pixel-level semantic annotations, and standard machine learning models (such as encoding-decoding networks based on convolution and pooling operations, etc.) can be used to obtain high-quality image processing models. In other words, for a given image, each pixel can be provided with a high-confidence probability of a known category. (2) By using word embedding technology (such as word2vec), the keywords of each category can be feature vectorized. (3) The first model can be trained using the label information of the known category to generate the probability that each pixel in the image comes from the unknown category.
- word2vec word2vec
- FIG. 4 is a flowchart of an operation 200 of training a first model according to an embodiment of the present disclosure.
- the operation of training the first model includes the following steps S201 to S203.
- step S201 a pixel of a known category among a plurality of known categories in an image in the image data set is selected as a verification data in the verification set.
- step S202 pixels of other categories among the multiple known categories are selected as one of the training data in the training set.
- step S203 the first model is trained based on the coordinates of the pixels of the known category in the verification set and the training set.
- the annotation information includes the coordinates of pixels of a known category.
- the probability that each pixel in the image comes from an unknown category can be generated by the following first model M:
- the pixel-level first model M of the present disclosure samples the source pixels s of the known category from all the labeled pixels x′ of the known category and the unlabeled target pixel t.
- e s represents the category of the source pixel s. Since the source pixel s is known to belong to the known category in the first model, e s ⁇ S, S represents the known category, and U represents the unknown category.
- position(p) represents the two-dimensional coordinates of pixel p, and its size is [0, 1].
- w e ⁇ R d is the word embedding related to category e (that is, the characteristic vector after passing through a model such as word2vec), Is the word embedding related to the category e s of the source pixel s, and w u is the word embedding related to the category u (u ⁇ U).
- the spatial distribution of the unknown category u(u ⁇ U) can be obtained by integrating the prediction results obtained from all the marked pixels:
- the first model M can be trained by labeling information of known categories. For example, in each iteration, a pixel of a known category can be randomly selected to simulate a pixel of an unknown category as a verification data in the verification set, and pixels of other categories in a known category can be selected as a training data in the training set. The first model M is trained based on the coordinates of the pixels of the known category in the verification set and the training set.
- the probability that each pixel in the image comes from the unknown category can be generated. It should be realized that the above-mentioned first model M is only an example, and the present disclosure may also adopt other suitable first models, which are not limited here.
- training can be performed by labeling information of known categories, and the spatial distribution of unknown categories can be generated without providing unknown category labels, thereby saving a lot of labor costs and time.
- the present disclosure will describe the operation of the semi-global branch (the local branch 36 in FIG. 3) according to the embodiment of the present disclosure.
- the spatial arrangement of different objects is very important for image processing. For example, at least two hints can be used to guess the position of an object in the image.
- the first hint is the structural arrangement between objects. For example, “people” are usually observed at the “desk”, and “giraffes” are rarely observed at the “desk”.
- certain objects or concepts often have a concentrated spatial distribution, such as the top area of the image. Often see the "sky”.
- the context in the pre-trained text semantic extraction model in the global branch implies that the accompanying text of the image (which contains global semantic information) is used as input, while the pixel-level first model in the local branch takes the known category of Pixel-level annotations (which contain local category information) are used as input.
- the present disclosure proposes to use consistency loss to jointly train global branches and local branches.
- the semi-global branch is configured to generate a partition probability after the image is subdivided into multiple regions based on the annotation information and the accompanying text.
- the semi-global branch may use a second model to generate the partition probability, and the second model is obtained through training of the accompanying text and the annotation information.
- the partition probability includes the first probability distribution of each pixel in each image subdivision area from the unknown category and the unknown The category exists in the second probability distribution in the subdivided area of each image.
- FIG. 5 is a flowchart of an operation 300 of training a second model according to an embodiment of the present disclosure.
- the operation of training the second model includes the following steps S301 to S305.
- step S301 the image is subdivided into a plurality of regions along the vertical direction or the horizontal direction.
- step S302 based on the accompanying text, a first training probability distribution in which the unknown category exists in the subdivided regions of each image is generated.
- step S303 based on the annotation information, generate a second training probability distribution in which each pixel in each image subdivision area of the plurality of image subdivision areas is from the unknown category.
- step S304 a loss function is constructed according to the first training probability distribution and the second training probability distribution.
- step S305 the second model is trained through the loss function.
- the first training probability distribution can be generated based on the following model.
- the present disclosure can generate image category-specific spatial distribution from image titles. Assume that the complex context in the title can roughly tell the location of the object. The realization of this idea is still based on the customization of the BERT model. In most cases, the image and its vertically flipped version can be described with the same title, but this may complicate the prediction of the horizontal position of the object. Therefore, preferably, the model of the present disclosure only focuses on vertically positioning certain objects in the image. In particular, all images will be divided into vertical regions of equal length. It should be understood that the image can also be subdivided into multiple regions of unequal sizes, and there is no limitation here.
- Another head H s ( ⁇ ) can be attached to the backbone of the BERT model, and a softmax of K output can be placed at the end of the BERT model, so that the BERT model can be designed to estimate a certain value in the image x
- the spatial distribution of the unknown category c (that is, the distribution on the subdivision area obtained by processing the image accompanying text through the BERT model) is also called the first training probability distribution:
- H s ( ⁇ ) represents a freely defined function, and there is no restriction here.
- Softmax activation function is only an example, and activation functions such as sigmoid, tanh, etc. can also be used, and there is no limitation here.
- the BERT model can be trained through the following loss function L.
- L loss function
- the loss function L s can be realized through the objective of information entropy:
- H o ( ⁇ ) and H s ( ⁇ ) controlled by L o +L s are complementary to each other.
- Is the number of image pixels classified as unknown category c in the kth (k 1...K) region of image x
- the model for generating the first training probability distribution of the unknown category existing in the subdivision area of each image is not limited to this, and other suitable models can be used to generate the first training. Probability distribution, there is no restriction here.
- a second training probability distribution can be generated based on the following model.
- step S304 for example, according to the above-mentioned first training probability distribution (Equation (6)) (It should be recognized that in this disclosure, both c and u(u ⁇ U) represent unknown categories, so here Can also be expressed as ) And the second training probability distribution (Equation (9)) L 2 distance (Euclidean distance) between (Euclidean distance) to construct the following loss function:
- step S305 the second model constructed is trained through the aforementioned loss function.
- the above-mentioned model for generating the second training probability distribution of each pixel in each image subdivision area of the plurality of image subdivision areas from the unknown category based on the annotation information is not limited to this, and may Other suitable models are used to generate the second training probability distribution, and there is no restriction here.
- Fig. 6 is a schematic diagram of the effect of the semi-global branch according to an embodiment of the present disclosure.
- Fig. 6 shows the spatial distribution of different categories in the obtained image after all the images are divided into vertical regions of equal length according to the above-mentioned second model. It can be seen that for the same category of Frisbee, the second model of the present disclosure can obtain different results according to different image titles.
- the two images on the left side of Fig. 6 are divided into 5 regions in the vertical direction, and the distribution map on the right side of Fig. 6 shows the corresponding spatial distribution after each image is subdivided into 5 regions.
- the distribution map on the right side of Fig. 6 shows the corresponding spatial distribution after each image is subdivided into 5 regions.
- first model and the second model according to the embodiments of the present disclosure may adopt different neural network structures, including but not limited to convolutional neural networks, recurrent neural networks (RNN), and the like.
- the convolutional neural network includes but is not limited to U-Net neural network, ResNet, DenseNet, etc.
- the probability and/or distribution of the unknown category generated by the unknown category acquisition model including local branches, semi-global branches and global branches.
- the probability and/or distribution of unknown categories can be obtained for each image, including pixel level, image Subdivide area level and global probability.
- the above-mentioned different levels of probability information can be used as the training set, and by using a deep network such as U-Net as the main body of the model, the optimization objective function of the image segmentation model of the unknown category can be constructed, so that the image segmentation model can be trained by the image segmentation model. Divide, thereby obtaining a divided image.
- a deep network such as U-Net
- the neural network model in the present disclosure may include various neural network models, such as but not limited to: Convolutional Neural Networks (CNN) (including GoogLeNet, AlexNet, VGG networks, etc.), regions with convolutional neural networks ( R-CNN), Region Proposal Network (RPN), Recurrent Neural Network (RNN), Stack-based Deep Neural Network (S-DNN), Deep Belief Network (DBN), Restricted Boltzmann Machine (RBM), Complete Convolutional networks, long short-term memory (LSTM) networks and classification networks.
- CNN Convolutional Neural Networks
- R-CNN Region Proposal Network
- RNN Region Proposal Network
- RNN Recurrent Neural Network
- S-DNN Stack-based Deep Neural Network
- DNN Deep Belief Network
- RBM Restricted Boltzmann Machine
- LSTM long short-term memory
- the neural network model for performing a task may include a sub-neural network, and the sub-neural network may include
- Fig. 7 shows a flowchart of an image segmentation method according to an embodiment of the present disclosure. As shown in FIG. 7, the image segmentation method includes the following steps S401 to S402.
- step S401 a first image is acquired.
- step S402 the first image is processed using an image segmentation model to generate a segmented second image.
- the first image is the input image of the image segmentation model.
- the image segmentation model may be obtained by training the original image segmentation network using the first training set, the first training set containing the probability and/or distribution of the unknown category obtained by the image processing method shown in FIG. 1, where The second image includes multiple regions corresponding to different categories.
- the image segmentation model of the present disclosure may be a convolutional neural network, a recurrent neural network (RNN), etc., which can be trained by constructing a loss function L:
- L is the loss function of the image segmentation model
- ⁇ is the weighting factor, used to balance the loss function L SEG of the known category and the loss function L RS of the unknown category.
- the loss function L SEG of the known category can be obtained by the currently known technology, which will not be described in detail here.
- the loss function L RS of the unknown category for example, it can be constructed based on the probability of the unknown category obtained by the above-mentioned semi-global branch and global branch.
- the present disclosure may use pair-wise ranking loss to utilize probability information of unknown categories.
- f ⁇ R h ⁇ w ⁇ d where h ⁇ w defines the spatial resolution
- d is the length of the extracted feature
- the prediction in the image segmentation task is performed in a pixel-by-pixel manner.
- the truth label map y since the truth label map y can be accessed, the truth label map of course only contains the pixel-level annotations in the known category S, so it is assumed that the unknown category will only appear in the unlabeled part.
- Y can be expressed as a collection of unlabeled pixel positions:
- CNN model Given a pair of images x1 and x2, CNN model can be used Obtain coding feature maps f 1 and f 2 . And the title annotations r 1 , r 2 can be used to generate the occurrence probabilities of specific categories s 1,e , s 2,e through the unknown category acquisition model of the present disclosure. if It can be considered that the image x1 is more likely to contain the category e u than the image x2. In other words, the unlabeled part Y1 of x1 is more likely to contain the unknown category eu (u ⁇ U) than the unlabeled part Y2 of x2. Therefore, the ranking loss can be written as:
- the spatial distribution of a certain category (that is, the partition probability after the image is subdivided into multiple regions) can also be generated from the title.
- this type of information can be used to trim the area where the category appears.
- k ⁇ (1,2,...,N) is the index of the area divided in the vertical direction. Is the predicted spatial distribution of the category e u (that is, the partition probability obtained by the above-mentioned global branch).
- the loss function of the unknown category can be constructed based on the probability of the unknown category obtained by the above-mentioned local branch, semi-global branch and global branch, and there is no limitation here.
- the above-mentioned image segmentation model can be trained on the server side.
- the trained model needs to be deployed to the client before it can be used.
- the data set required for the training of the neural network model only needs to be stored and used on the server side, and does not need to be deployed on the client side.
- the neural network model according to the embodiments of the present disclosure can adopt different network structures, including but not limited to convolutional neural network, recurrent neural network (RNN), and the like.
- the convolutional neural network includes but is not limited to U-Net neural network, ResNet, DenseNet, etc.
- FIG. 8 schematically depicts a schematic diagram of a segmented image generated by an image segmentation model according to an embodiment of the present disclosure.
- the input image is the five pictures in the first row of Fig. 8, and each picture contains different categories (for example, for the first picture, it contains categories such as dog, frisbee, grass, etc.).
- a true value image is a segmented image obtained after image segmentation using artificial tags. The segmented image contains regions represented by multiple colors corresponding to different categories. It can be seen that, compared with other types (for example, SPNet), the segmented image generated by the image segmentation model of the present disclosure (the last row of FIG. 8), the segmented image generated by the present disclosure is closer to the true value image. And the noise is smaller.
- Fig. 9 is a schematic diagram of a small sample image segmentation method according to an embodiment of the present disclosure.
- the present disclosure uses the unknown category acquisition model to generate the probability and/or distribution 51 of the unknown category.
- the probability and/or distribution of the unknown category includes the probability that each pixel in the image generated based on the annotation information 53 of the known category comes from the unknown category, and the probability and/or distribution generated based on the accompanying text (contained in the image data set 55)
- the probability that the unknown category exists in the image, and the partition probability after the image is subdivided into a plurality of regions generated based on the annotation information 53 and the accompanying text (included in the image data set 55).
- the unknown category 54 is not labeled.
- an image segmentation model 52 can be obtained, and the image segmentation model 52 can be used to segment the input image.
- the present disclosure uses the unknown category acquisition model including local branches, semi-global branches and global branches to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as training data to train the image segmentation network.
- the unknown category in the image is automatically marked, which reduces the cost of annotation and speeds up the development cycle, thereby saving a lot of labor costs and time.
- the present disclosure uses the unknown category to obtain the model image processing to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as training data to train the image segmentation network, which can be realized when no pixels of the unknown category are provided.
- the unknown categories in the image are automatically marked, thereby saving a lot of labor costs and time.
- the present disclosure achieves the effect of increasing the image processing model for the same labeling cost by maximizing the information in all collected data, or for the same image processing model effect, reducing the labeling cost and accelerating the development cycle.
- FIG. 10 is a functional block diagram illustrating an image processing apparatus according to an embodiment of the present disclosure.
- the image processing apparatus 1000 according to an embodiment of the present disclosure includes an acquiring unit 1001 and a generating unit 1002.
- the above-mentioned modules can respectively execute the steps of the image processing method according to the embodiment of the present disclosure as described above with reference to FIGS. 1 to 9.
- these unit modules can be implemented in various ways by hardware alone, by software alone, or by a combination thereof, and the present disclosure is not limited to any one of them.
- CPU central processing unit
- GPU image processor
- TPU tensor processor
- FPGA field programmable logic gate array
- the acquiring unit 1001 is configured to acquire an image data set, the image data set including an image and accompanying text related to an unknown category in the image.
- the generating unit 1002 is configured to use the unknown category acquisition model to generate the probability and/or distribution of the unknown category.
- the probability and/or distribution of the unknown category include the probability and/or distribution of each pixel in the image from the unknown category. The probability that the unknown category exists in the image, and the partition probability after the image is subdivided into multiple regions.
- an image data set usually contains some kind of accompanying text, such as user comments and image titles under social networking site images.
- the accompanying text in the method described in the present disclosure takes an image caption as an example to show the use of the accompanying text for small sample image processing. It should be understood that the present disclosure may include other forms of image accompanying text, which is not limited here.
- the unknown category acquisition model may include local branches, semi-global branches, and global branches.
- the local branch may be configured to generate the probability that each pixel in the image comes from the unknown category based on the annotation information of the known category
- the global branch may be configured to generate the unknown category based on the accompanying text.
- the semi-global branch may be configured to generate a partition probability after the image is subdivided into multiple regions based on the annotation information and the accompanying text.
- the global branch may be based on the accompanying text, using a text semantic extraction model to generate the probability that the unknown category exists in the image.
- the text semantic extraction model is a bidirectional encoding representation BERT model from a transformer, where the probability of using the BERT model to generate the unknown category in the image is expressed as:
- H o ( ⁇ ) represents a freely defined function, and its output is the probability that an unknown category appears in the image without being processed by the sigmoid function
- ⁇ represents the BERT model
- x represents the input image of the BERT model
- caption(x) represents The accompanying text of the image
- EOS is a sentence rest in natural language processing
- c represents an unknown category
- description(c) represents a keyword or text description of the unknown category c.
- the local branch may use a first model to generate the probability that each pixel in the image comes from the unknown category, wherein the first model is obtained through the annotation information training.
- the annotation information includes the coordinates of pixels of a known category
- the first model can be trained in the following manner: selecting a pixel of a known category among multiple known categories in an image in the image dataset as One piece of validation data in the validation set; selecting pixels of other categories in the multiple known categories as one piece of training data in the training set; and training based on the coordinates of the pixels of the known categories in the validation set and the training set The first model.
- first model M can be used to generate the probability that each pixel in the image comes from an unknown category:
- the pixel-level first model M of the present disclosure samples the source pixels s of the known category from all the labeled pixels x′ of the known category and the unlabeled target pixel t.
- e s represents the category of the source pixel s. Since the source pixel s is known to belong to the known category in the first model, e s ⁇ S, S represents the known category, and U represents the unknown category.
- position(p) represents the two-dimensional coordinates of pixel p, and its size is [0, 1].
- w e ⁇ R d is the word embedding related to category e (that is, the characteristic vector after passing through a model such as word2vec), Is the word embedding related to the category e s of the source pixel s, and w u is the word embedding related to the category u (u ⁇ U).
- the spatial distribution of the unknown category u(u ⁇ U) can be obtained by integrating the prediction results obtained from all the marked pixels:
- the first model M can be trained by labeling information of known categories. For example, in each iteration, a pixel of a known category can be randomly selected to simulate a pixel of an unknown category as a verification data in the verification set, and pixels of other categories in a known category can be selected as a training data in the training set. The first model M is trained based on the coordinates of the pixels of the known category in the verification set and the training set.
- the probability that each pixel in the image comes from the unknown category can be generated. It should be realized that the above-mentioned first model M is only an example, and the present disclosure may also adopt other suitable first models, which are not limited here.
- the semi-global branch may use a second model to generate the partition probability, and the second model is obtained through training of the accompanying text and the annotation information.
- the partition probability may include a first probability distribution that each pixel in each image subdivision area of the plurality of image subdivision areas generated after the image is subdivided into a plurality of areas is from the unknown category, and The unknown category exists in a second probability distribution in the subdivision area of each image.
- the second model may be trained in the following manner: subdivide the image into multiple regions along the vertical or horizontal direction; based on the accompanying text, generate the unknown category to exist in each image.
- a first training probability distribution in a sub-region based on the label information, generating a second training probability distribution in which each pixel in each image sub-region of the plurality of image sub-regions is from the unknown category;
- the first training probability distribution and the second training probability distribution are used to construct a loss function; and the second model is trained through the loss function.
- the constructing a loss function according to the first training probability distribution and the second training probability distribution includes: constructing a loss based on the Euclidean distance between the first training probability distribution and the second training probability distribution function.
- the image processing device of the present disclosure uses the unknown category to obtain the model image processing to generate the probability and/or distribution of the unknown category, and uses the probability and/or distribution of the unknown category as training data to train the image segmentation network.
- the unknown category in the image is automatically marked, thereby saving a lot of labor costs and time.
- the image processing device of the present disclosure maximizes the use of information in all collected data to achieve the effect of increasing the image processing model for the same annotation cost, or for the same image processing model effect, reducing the annotation cost and accelerating the development cycle Effect.
- FIG. 11 is a schematic diagram of an image processing apparatus 2000 according to an embodiment of the present disclosure. Since the image processing device of this embodiment has the same details as the method described above with reference to FIG. 1, a detailed description of the same content is omitted here for the sake of simplicity.
- the image processing device 2000 includes a processor 210, a memory 220, and one or more computer program modules 221.
- the processor 210 and the memory 220 are connected through a bus system 230.
- one or more computer program modules 221 are stored in the memory 220.
- one or more computer program modules 221 include instructions for executing the image processing method provided by any embodiment of the present disclosure.
- instructions in one or more computer program modules 221 may be executed by the processor 210.
- the bus system 230 may be a commonly used serial or parallel communication bus, etc., which is not limited in the embodiments of the present disclosure.
- the processor 210 may be a central processing unit (CPU), a digital signal processor (DSP), an image processor (GPU), or other forms of processing units with data processing capabilities and/or instruction execution capabilities, and may be general-purpose processing units.
- CPU central processing unit
- DSP digital signal processor
- GPU image processor
- the memory 220 may include one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
- the volatile memory may include random access memory (RAM) and/or cache memory (cache), for example.
- the non-volatile memory may include read-only memory (ROM), hard disk, flash memory, etc., for example.
- One or more computer program instructions may be stored on a computer-readable storage medium, and the processor 210 may run the program instructions to implement the functions (implemented by the processor 210) and/or other desired functions in the embodiments of the present disclosure, For example, image processing methods.
- the computer-readable storage medium may also store various application programs and various data, such as the element characteristics of the image data set, the first model, and various data used and/or generated by the application program.
- the embodiment of the present disclosure does not provide all the constituent units of the image processing device 2000.
- those skilled in the art may provide and set other unshown component units according to specific needs, and the embodiments of the present disclosure do not limit this.
- the image processing apparatus 1000 and the image processing apparatus 2000 can be used in various appropriate electronic devices.
- FIG. 12 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
- the storage medium 400 non-transitory stores computer-readable instructions 401, and when the non-transitory computer-readable instructions are executed by a computer (including a processor), it can execute any one of the embodiments of the present disclosure.
- Image processing method As shown in FIG. 12, the storage medium 400 non-transitory stores computer-readable instructions 401, and when the non-transitory computer-readable instructions are executed by a computer (including a processor), it can execute any one of the embodiments of the present disclosure. Image processing method.
- the storage medium may be any combination of one or more computer-readable storage media.
- the computer when the program code is read by a computer, the computer can execute the program code stored in the computer storage medium, and execute, for example, the image processing method provided in any embodiment of the present disclosure.
- the storage medium may include a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), Portable compact disk read-only memory (CD-ROM), flash memory, or any combination of the foregoing storage media may also be other suitable storage media.
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- CD-ROM Portable compact disk read-only memory
- flash memory or any combination of the foregoing storage media may also be other suitable storage media.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (20)
- 一种图像处理方法,包括:获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;基于所述图像数据集,利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
- 根据权利要求1所述的方法,其中,所述未知类别获取模型包含局部分支、半全局分支和全局分支,其中,所述局部分支被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支被配置为基于所述伴随文本生成所述未知类别存在于所述图像中的概率,所述半全局分支被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
- 根据权利要求2所述的方法,其中,所述全局分支基于所述伴随文本,利用文本语义提取模型生成所述未知类别存在于所述图像中的概率。
- 根据权利要求3所述的方法,其中,所述文本语义提取模型为来自变换器的双向编码表示BERT模型,其中,利用BERT模型生成所述未知类别存在于所述图像中的概率表示为:s x,c=sigmoid(H o(φ(caption(x);[EOS];description(c))))其中,H o(·)表示自由定义的函数,其输出是未经过sigmoid函数处理的、未知类别出现在图像中的概率,φ表示BERT模型,x表示BERT模型的输入图像,caption(x)表示图像的伴随文本,EOS为自然语言处理中的语句休止符,c表示未知类别,description(c)表示所述未知类别c的关键字或者文字描述。
- 根据权利要求2所述的方法,其中,所述局部分支利用第一模型来生成所述图像中每个像素来自所述未知类别的概率,其中所述第一模型是通过所述标注信息训练获得的。
- 根据权利要求5所述的方法,其中,所述标注信息包含已知类别的像素的坐标,所述第一模型通过以下方式进行训练:选择所述图像数据集中一个图像中多个已知类别中的一个已知类别的像素作为验证集中的一个验证数据;选择所述多个已知类别中的其他类别的像素作为训练集中的一个训练数据;以及基于所述验证集和所述训练集中的已知类别的像素的坐标,训练所述第一模型。
- 根据权利要求2所述的方法,其中,所述半全局分支利用第二模型生成所述分区概率,所述第二模型是通过所述伴随文本和所述标注信息训练获得的。
- 根据权利要求7所述的方法,其中,所述分区概率包括将图像细分为多个区域后生成的所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第一概率分布、以及所述未知类别存在于所述每个图像细分区域中的第二概率分布。
- 根据权利要求8所述的方法,其中,所述第二模型通过以下方式进行训练:沿垂直方向或水平方向将所述图像细分为多个区域;基于所述伴随本文,生成所述未知类别存在于所述每个图像细分区域中的第一训练概率分布;基于所述标注信息,生成所述多个图像细分区域中的每个图像细分区域中每个像素来自所述未知类别的第二训练概率分布;根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数;通过所述损失函数来训练所述第二模型。
- 根据权利要求9所述的方法,其中,所述根据所述第一训练概率分布和所述第二训练概率分布来构建损失函数包括:基于所述第一训练概率分布和所述第二训练概率分布之间的欧式距离来构建损失函数。
- 根据权利要求1所述的方法,其中,所述伴随文本包括用户评论和/或图像标题。
- 一种图像分割方法,包括:获取第一图像;利用图像分割模型处理所述第一图像以生成分割后的第二图像,其中,所述图像分割模型是利用第一训练集对原始图像分割网络训练得到的,所述第一训练集包含利用权利要求1所述的方法得到的未知类别的概率和/或分布,其中所述第二图像包含对应不同类别的多个区域。
- 一种图像处理装置,包括:获取单元,用于获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;生成单元,用于基于所述图像数据集,利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
- 根据权利要求13所述的装置,其中,所述未知类别获取模型包含局部分支、半全局分支和全局分支,其中,所述局部分支被配置为基于已知类别的标注信息生成所述图像中每个像素来自所述未知类别的概率,所述全局分支被配置为基于所述伴随文本生成所述未知类别存在于所述图像中的概率,所述半全局分支被配置为基于所述标注信息和所述伴随文本,生成将图像细分为多个区域后的分区概率。
- 根据权利要求14所述的装置,其中,所述全局分支基于所述伴随文本,利用文本语义提取模型生成所述未知类别存在于 所述图像中的概率。
- 根据权利要求14所述的装置,其中,所述局部分支利用第一模型来生成所述图像中每个像素来自所述未知类别的概率,其中所述第一模型是通过所述标注信息训练获得的。
- 根据权利要求16所述的装置,其中,所述标注信息包含已知类别的像素的坐标,所述第一模型通过以下方式进行训练:选择所述图像数据集中一个图像中多个已知类别中的一个已知类别的像素作为验证集中的一个验证数据;选择所述多个已知类别中的其他类别的像素作为训练集中的一个训练数据;以及基于所述验证集和所述训练集中的已知类别的像素的坐标,训练所述第一模型。
- 根据权利要求14所述的装置,其中,所述半全局分支利用第二模型生成所述分区概率,所述第二模型是通过所述伴随文本和所述标注信息训练获得的。
- 一种图像处理设备,包括:处理器;以及存储器,其中存储计算机可读指令,其中,在所述计算机可读指令被所述处理器运行时执行图像处理方法,所述方法包括:获取图像数据集,所述图像数据集包含图像以及与所述图像中的未知类别有关的伴随文本;利用未知类别获取模型生成未知类别的概率和/或分布,所述未知类别的概率和/或分布包括所述图像中每个像素来自所述未知类别的概率、所述未知类别存在于所述图像中的概率、以及将图像细分为多个区域后的分区概率。
- 一种用于存储计算机可读程序的计算机可读存储介质,所述程序使得计算机执行如权利要求1所述的图像处理方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/754,158 US12039766B2 (en) | 2020-05-21 | 2021-04-15 | Image processing method, apparatus, and computer product for image segmentation using unseen class obtaining model |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010438187.9 | 2020-05-21 | ||
CN202010438187.9A CN111612010B (zh) | 2020-05-21 | 2020-05-21 | 图像处理方法、装置、设备以及计算机可读存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021233031A1 true WO2021233031A1 (zh) | 2021-11-25 |
Family
ID=72195904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/087579 WO2021233031A1 (zh) | 2020-05-21 | 2021-04-15 | 图像处理方法、装置、设备、存储介质以及图像分割方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US12039766B2 (zh) |
CN (1) | CN111612010B (zh) |
WO (1) | WO2021233031A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113936141A (zh) * | 2021-12-17 | 2022-01-14 | 深圳佑驾创新科技有限公司 | 图像语义分割方法及计算机可读存储介质 |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111612010B (zh) | 2020-05-21 | 2024-07-16 | 京东方科技集团股份有限公司 | 图像处理方法、装置、设备以及计算机可读存储介质 |
CN112330685B (zh) * | 2020-12-28 | 2021-04-06 | 北京达佳互联信息技术有限公司 | 图像分割模型训练、图像分割方法、装置及电子设备 |
US11948358B2 (en) * | 2021-11-16 | 2024-04-02 | Adobe Inc. | Self-supervised hierarchical event representation learning |
US20230410541A1 (en) * | 2022-06-18 | 2023-12-21 | Kyocera Document Solutions Inc. | Segmentation of page stream documents for bidirectional encoder representational transformers |
CN116269285B (zh) * | 2022-11-28 | 2024-05-28 | 电子科技大学 | 一种非接触式常态化心率变异性估计系统 |
CN116758359B (zh) * | 2023-08-16 | 2024-08-06 | 腾讯科技(深圳)有限公司 | 图像识别方法、装置及电子设备 |
CN117115565B (zh) * | 2023-10-19 | 2024-07-23 | 南方科技大学 | 一种基于自主感知的图像分类方法、装置及智能终端 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005794A (zh) * | 2015-07-21 | 2015-10-28 | 太原理工大学 | 融合多粒度上下文信息的图像像素语义标注方法 |
CN108229478A (zh) * | 2017-06-30 | 2018-06-29 | 深圳市商汤科技有限公司 | 图像语义分割及训练方法和装置、电子设备、存储介质和程序 |
CN108229519A (zh) * | 2017-02-17 | 2018-06-29 | 北京市商汤科技开发有限公司 | 图像分类的方法、装置及系统 |
US20190087964A1 (en) * | 2017-09-20 | 2019-03-21 | Beihang University | Method and apparatus for parsing and processing three-dimensional cad model |
CN111612010A (zh) * | 2020-05-21 | 2020-09-01 | 京东方科技集团股份有限公司 | 图像处理方法、装置、设备以及计算机可读存储介质 |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5953442A (en) * | 1997-07-24 | 1999-09-14 | Litton Systems, Inc. | Fingerprint classification via spatial frequency components |
AUPP568698A0 (en) * | 1998-09-03 | 1998-10-01 | Canon Kabushiki Kaisha | Region-based image compositing |
JP5370267B2 (ja) * | 2010-05-27 | 2013-12-18 | 株式会社デンソーアイティーラボラトリ | 画像処理システム |
US8503801B2 (en) * | 2010-09-21 | 2013-08-06 | Adobe Systems Incorporated | System and method for classifying the blur state of digital image pixels |
US11507800B2 (en) * | 2018-03-06 | 2022-11-22 | Adobe Inc. | Semantic class localization digital environment |
CN109376786A (zh) * | 2018-10-31 | 2019-02-22 | 中国科学院深圳先进技术研究院 | 一种图像分类方法、装置、终端设备及可读存储介质 |
CN110059734B (zh) * | 2019-04-02 | 2021-10-26 | 唯思科技(北京)有限公司 | 一种目标识别分类模型的训练方法、物体识别方法、装置、机器人和介质 |
CN110837836B (zh) * | 2019-11-05 | 2022-09-02 | 中国科学技术大学 | 基于最大化置信度的半监督语义分割方法 |
CN111311613B (zh) * | 2020-03-03 | 2021-09-07 | 推想医疗科技股份有限公司 | 图像分割模型训练方法、图像分割方法及装置 |
CN113805824B (zh) * | 2020-06-16 | 2024-02-09 | 京东方科技集团股份有限公司 | 电子装置以及在显示设备上显示图像的方法 |
CN111932555A (zh) * | 2020-07-31 | 2020-11-13 | 商汤集团有限公司 | 一种图像处理方法及装置、计算机可读存储介质 |
CN115797632B (zh) * | 2022-12-01 | 2024-02-09 | 北京科技大学 | 一种基于多任务学习的图像分割方法 |
CN117078714A (zh) * | 2023-06-14 | 2023-11-17 | 北京百度网讯科技有限公司 | 图像分割模型训练方法、装置、设备及存储介质 |
-
2020
- 2020-05-21 CN CN202010438187.9A patent/CN111612010B/zh active Active
-
2021
- 2021-04-15 US US17/754,158 patent/US12039766B2/en active Active
- 2021-04-15 WO PCT/CN2021/087579 patent/WO2021233031A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005794A (zh) * | 2015-07-21 | 2015-10-28 | 太原理工大学 | 融合多粒度上下文信息的图像像素语义标注方法 |
CN108229519A (zh) * | 2017-02-17 | 2018-06-29 | 北京市商汤科技开发有限公司 | 图像分类的方法、装置及系统 |
CN108229478A (zh) * | 2017-06-30 | 2018-06-29 | 深圳市商汤科技有限公司 | 图像语义分割及训练方法和装置、电子设备、存储介质和程序 |
US20190087964A1 (en) * | 2017-09-20 | 2019-03-21 | Beihang University | Method and apparatus for parsing and processing three-dimensional cad model |
CN111612010A (zh) * | 2020-05-21 | 2020-09-01 | 京东方科技集团股份有限公司 | 图像处理方法、装置、设备以及计算机可读存储介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113936141A (zh) * | 2021-12-17 | 2022-01-14 | 深圳佑驾创新科技有限公司 | 图像语义分割方法及计算机可读存储介质 |
CN113936141B (zh) * | 2021-12-17 | 2022-02-22 | 深圳佑驾创新科技有限公司 | 图像语义分割方法及计算机可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US12039766B2 (en) | 2024-07-16 |
CN111612010A (zh) | 2020-09-01 |
CN111612010B (zh) | 2024-07-16 |
US20220292805A1 (en) | 2022-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021233031A1 (zh) | 图像处理方法、装置、设备、存储介质以及图像分割方法 | |
US11373390B2 (en) | Generating scene graphs from digital images using external knowledge and image reconstruction | |
US10586350B2 (en) | Optimizations for dynamic object instance detection, segmentation, and structure mapping | |
EP3926531B1 (en) | Method and system for visio-linguistic understanding using contextual language model reasoners | |
US20190279074A1 (en) | Semantic Class Localization Digital Environment | |
CN109791600A (zh) | 将横屏视频转换为竖屏移动布局的方法 | |
CN110378410B (zh) | 多标签场景分类方法、装置及电子设备 | |
WO2021212601A1 (zh) | 一种基于图像的辅助写作方法、装置、介质及设备 | |
JP2024526065A (ja) | テキストを認識するための方法および装置 | |
US11030726B1 (en) | Image cropping with lossless resolution for generating enhanced image databases | |
US10755171B1 (en) | Hiding and detecting information using neural networks | |
CN113869138A (zh) | 多尺度目标检测方法、装置及计算机可读存储介质 | |
US20230127525A1 (en) | Generating digital assets utilizing a content aware machine-learning model | |
CN117033609B (zh) | 文本视觉问答方法、装置、计算机设备和存储介质 | |
CN117216194B (zh) | 文博领域知识问答方法及装置、设备和介质 | |
CN113779225B (zh) | 实体链接模型的训练方法、实体链接方法及装置 | |
US10699458B2 (en) | Image editor for merging images with generative adversarial networks | |
WO2022222854A1 (zh) | 一种数据处理方法及相关设备 | |
KR102401113B1 (ko) | 보상 가능성 정보 및 UX-bit를 이용한 자동 디자인 생성 인공신경망 장치 및 방법 | |
CN117671426A (zh) | 基于概念蒸馏和clip的可提示分割模型预训练方法及系统 | |
CN111445545B (zh) | 一种文本转贴图方法、装置、存储介质及电子设备 | |
US10957017B1 (en) | Synthetic image detector | |
CN115660069A (zh) | 一种半监督卫星影像语义分割网络构建方法、装置及电子设备 | |
Orhei | Urban landmark detection using computer vision | |
Dibari et al. | Semantic segmentation of multimodal point clouds from the railway context |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21809867 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21809867 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21809867 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.06.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21809867 Country of ref document: EP Kind code of ref document: A1 |