CN117671688A - Segmentation recognition and text description method and system based on hintable segmentation model - Google Patents

Segmentation recognition and text description method and system based on hintable segmentation model Download PDF

Info

Publication number
CN117671688A
CN117671688A CN202311676811.9A CN202311676811A CN117671688A CN 117671688 A CN117671688 A CN 117671688A CN 202311676811 A CN202311676811 A CN 202311676811A CN 117671688 A CN117671688 A CN 117671688A
Authority
CN
China
Prior art keywords
segmentation
hintable
image
model
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311676811.9A
Other languages
Chinese (zh)
Inventor
王鑫龙
潘汀
唐路路
黄铁军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyuan Artificial Intelligence Research Institute
Original Assignee
Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyuan Artificial Intelligence Research Institute filed Critical Beijing Zhiyuan Artificial Intelligence Research Institute
Priority to CN202311676811.9A priority Critical patent/CN117671688A/en
Publication of CN117671688A publication Critical patent/CN117671688A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Character Input (AREA)

Abstract

The invention discloses a segmentation recognition and text description method based on a hintable segmentation model, which comprises the following steps: acquiring an image target; establishing a hintable segmentation model ProTo, wherein the hintable segmentation model ProTo is used for simultaneously carrying out segmentation recognition and text description on a target based on the language capability of the fusion CLIP in a hintable segmentation task under an SMA framework, and comprises an image encoder, a hint encoder and an image decoder; the image encoder and the image decoder provide regional semantic information based on visual cues by simulating the CLIP; the image decoder is used for providing regional visual representation based on visual prompt; and carrying out segmentation recognition and text description on the image target based on the hintable segmentation model to obtain hintable segmentation, concept prediction and hintable text description. Corresponding systems and electronic devices are also disclosed, and model pre-training is performed by effectively utilizing CLIP through conceptual distillation; the universal image marking tool with the position sensing capability is realized, and regional visual understanding is promoted.

Description

Segmentation recognition and text description method and system based on hintable segmentation model
Technical Field
The invention relates to the technical field of automatic generation of content AIGC (automatic guided graphics) of image and text related artificial intelligence, in particular to a segmentation recognition and text description method and system based on a hintable segmentation model.
Background
Recent research has focused on building a unified framework that achieves region-level visual-language alignment through a large number of region-text pairs or multi-modal datasets, achieving significant performance on both open semantics and interactive segmentation benchmarks.
However, the currently available region-text data is significantly limited in size compared to large-scale segmented image datasets (e.g., SA-1B). Exhaustive semantic labeling of each instance presents significant challenges, especially when an object is attributable to multiple categories; traditional visual-language alignment methods rely on image-text pairs, which limit their fine-grained region understanding capabilities. Existing data sets such as LVIS tend to assign a single semantic label to each object. Thus, supervised learning on these artificially labeled datasets may limit the zero sample migration capability of the model due to its limited size, fixed categories, and occasional ambiguous text annotations.
For visual cue based region level visual characterization, SAM is an advanced hintable segmentation model, but lacks semantic tags in its output.
The architecture of semantic knowledge SAM is refined from a pre-trained visual-language big model (e.g., CLIP), providing a viable path for achieving zero sample area visual understanding. Recent research (e.g., { SAM-CLIP, SAMSpot }) aims to combine the exhaustive segmentation capability of SAM with the open vocabulary classification capability of CLIP. However, such integration methods typically require aligned visual-language training data and cannot be performed with uniform cues under a uniform architecture. For example, SAM-CLIP retrains a visual encoder with the original SAM and part of the CLIP data. While it retains the original advantages of CLIP and SAM, it does not enable prediction of one hint (e.g., point, box) to complete multiple tasks simultaneously. On the other hand, the regionstoot trains an adapter (adapter) on the object detection data set to realize unified prompt, so that the mask mark of the SAM can interact with the CLIP feature of the image at the mask. Nonetheless, regionSpot still requires that the multitasking be implemented by executing separate two models.
Therefore, it is necessary to build a universal unified hintable segmentation model, and simultaneously complete the functions of segmentation, recognition and text description, and simplify the task size and reduce the cost of the functions.
Disclosure of Invention
In order to solve the problems existing in the prior art, the invention provides a segmentation recognition and text description method and a system based on a hintable segmentation model, which simulate a visual encoder (image encoder) of a CLIP in a hintable segmentation task under a SAM architecture, wherein a mask decoder can generate a semantic token (semanteme token) for each predicted mask; visual embeddings are then predicted from the semantic tags and used to align the distribution of the concept vocabulary between SAM and CLIP. Unlike the general segmentation model SAM, the SAM performs positioning segmentation only under the condition of given visual cues, and the object of the invention is to construct a general regional characterization, and realize general recognition and positioning through the visual cues. To achieve strong region characterization in practical applications, model pre-training employs a large number of segmentation masks (e.g., SA-1B masks) and semantic prior knowledge of a large CLIP model with 50 hundred million parameters. On one hand, a hintable model capable of simultaneously segmenting, identifying and describing any object is provided, and the model is completely dependent on visual cues; on the other hand, the pretraining of the proposed unified model is carried out by a conceptual distillation method so as to effectively utilize the CLIP; thirdly, a hintable image marking paradigm is provided, and a universal image marking tool with position sensing capability is realized to facilitate regional visual understanding.
In one aspect, the present invention provides a segmentation recognition and text description method based on a hintable segmentation model, including:
s1, acquiring an image target;
s2, establishing a hintable segmentation model ProTo, wherein the hintable segmentation model ProTo is used for merging the language capability of the CLIP in a hintable segmentation task under an SMA framework to simultaneously carry out segmentation recognition and text description on a target, and comprises an image encoder, a hint encoder and an image decoder; the image encoder and the image decoder provide regional semantic information based on visual cues by simulating the CLIP; the image decoder is used for providing regional visual representation based on visual prompt;
and S3, carrying out segmentation recognition and text description on the image target based on the hintable segmentation model so as to obtain hintable segmentation, concept prediction and hintable text description.
Preferably, the S2 includes:
s21, acquiring an image segmentation dataset and a predicted segmentation mask for the hintable segmentation model pre-training;
s22, upgrading a mask decoder in the SAM architecture into a general image decoder to obtain a model primary architecture, and adding a semantic mark on a mask mark of each predicted segmentation mask; the semantic tags are used to learn semantic prior knowledge from a predefined concept space, the semantic prior knowledge from a CLIP big model with 50 billion parameters;
s23, pre-training the model primary framework in the image segmentation data set, and constructing a simulation CLIP visual encoder and a general image decoder which are integrated in the model primary framework based on the model pre-training;
s24, carrying out joint optimization on segmentation loss based on mask marks and concept distillation loss based on semantic marks based on two subtasks of hintable segmentation and vocabulary concept prediction to obtain a pre-training hintable segmentation model with region recognition and positioning capabilities;
s25, performing fine tuning training on the pre-training hintable segmentation model on the regional text description task.
Preferably, the analog CLIP visual encoder and the universal image decoder are of unitary construction; the image segmentation dataset is from SA-1B; the hint encoder obtains input for points, boxes and sketches of the SA-1B data source; the two subtasks comprise a mask mark of a predicted segmentation mask, a calculated score and other modes of prediction mask through mask embedding after passing through an image decoder, and a hintable segmentation is obtained; text embedding prediction area concepts through text description, visual embedding, and CLIP text encoder; the analog CLIP visual encoder is used for preprocessing the image segmentation data of the SA-1B; the image decoder co-generates 9 tokens, the 9 tokens including 4 semantic tokens, 4 mask tokens and 1 IoU token.
Preferably, the pre-training is performed by adopting a two-stage sampling strategy, and at most 9 prompt points are adopted; in the first stage, frames or points are sampled with equal probability from the real mask corresponding to the image segmentation data; in the second stage, 1 to 8 points are uniformly sampled from an error region between a prediction mask and a true mask corresponding to the image division data.
Preferably, the obtaining the hintable segmentation of S3 includes using a sketch or a mask as a first hint, and the obtaining the hintable segmentation includes: sampling the image segmentation data by adopting a non-interactive sampling method with 50% probability; the 50% probability non-interactive sampling method comprises:
uniformly acquiring 1 to 9 points from the real mask during sampling;
9 points are selected from the linear space of the flattened two-dimensional coordinates of the real mask or sketch during reasoning to ensure certainty;
during the supervised learning of the segmentation mask, a linear combination of Focal loss and Dice loss is adopted, and loss function optimization is carried out according to the ratio of 20:1.
Preferably, said obtaining concept prediction of said S3 comprises predicting regional concepts using semantic tags, thereby enhancing semantic understanding capabilities of the model; the obtaining a concept prediction includes:
based on semantic marks, 1024-dimensional visual embedding is obtained through a 3-layer multilayer perceptron;
projecting the 1024-dimensional visual embedding further into 2560-dimensional distributed logits;
optimizing the KL divergence loss between the predicted and target distributions mitigates performance degradation caused by similar concepts and prevents learning from CLIP to amplified annotation bias.
Preferably, the obtaining the hintable text description of S3 includes hinting based on an output semantic tag of an image decoder [ S ], generating a region text description; comprising the following steps:
establishing a generative visual language model through causal language modeling, comprising:
(1) Converting the visual signal into a causal sequence of fixed length using a causal transducer architecture, thereby tokenizing the visual signal for causal language modeling;
(2) Placing a semantic mark at a leading position of the causal sequence, and adding a [ BOS ] mark at the back;
(3) Supervising the next marker prediction using cross entropy loss;
(4) Embedding a position code of the integrated multimodal sequence based on the rotation;
establishing a visual encoder, determining semantic tags modeled together with the hintable segmentation, matching the semantic tags with text-embedded dimensions based on linear projection and the visual encoder;
establishing a text decoder, wherein the text decoder has 2500 ten thousand parameters and is in a lightweight structure and is used for executing translation from mask to text; the process of mask-to-text translation includes: predicting the tagged region description according to the Llama method based on the byte pair code and a vocabulary comprising 32,000 tokens; text decoding is carried out by adopting an 8-layer standard transducer and embedding a network structure with the dimension of 512 so as to adapt to short description;
performing text reasoning and generating regional text description; the generating the region text description includes:
(1) Iteratively generating up to 40 tokens for each mask;
(2) Caching a plurality of key value pairs based on autoregressive standard practice and generating a plurality of outputs corresponding to each hintable segmentation;
(3) Based on a routing policy between visual embedding and text embedding, a final output is selected from a plurality of outputs corresponding to each hint segmentation as a region text description.
A second aspect of the present invention provides a segmentation recognition and text description system based on a hintable segmentation model to enhance the regional level semantic understanding capability of the hintable segmentation model ProTo by aligning vision and language in the hintable segmentation model SAM; comprising the following steps:
the image target acquisition module is used for acquiring an image target;
the system comprises a hintable segmentation model building module, a hintable segmentation model processing module and a target processing module, wherein the hintable segmentation model building module is used for building a hintable segmentation model ProTo, the hintable segmentation model ProTo is used for simultaneously carrying out segmentation recognition and text description on a target based on language capability of fusion CLIP in a hintable segmentation task under an SMA framework, and the hintable segmentation model comprises an image encoder, a hintable encoder and an image decoder; the image encoder and the image decoder provide regional semantic information based on visual cues by simulating the CLIP; the image decoder is used for providing regional visual representation based on visual prompt;
and the segmentation recognition and text description module is used for carrying out segmentation recognition and text description on the image target based on the hintable segmentation model so as to obtain hintable segmentation, concept prediction and hintable text description.
A third aspect of the invention provides an electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being for reading the instructions and performing the method according to the first aspect.
A fourth aspect of the invention provides a computer readable storage medium storing a plurality of instructions readable by a processor and for performing the method of the first aspect.
The method, the system and the electronic equipment provided by the invention have the following beneficial effects:
1. the segmentation, recognition and description can be performed simultaneously by large-scale more readily available segmentation mask data, rather than relying on expensive region-text pairs;
2. by fusing the language capabilities of CLIP in the architecture of SAM through large-scale image segmentation data, forming an exhaustive "conceptual distillation" to distinguish from conventional feature alignment, this approach offers some significant advantages:
(1) By using the knowledge base of CLIP, specific annotation bias can be avoided;
(2) Using conceptual distillation rather than feature alignment avoids taking strict feature similarity measurements between different architectural features;
(3) Reverse visual-language alignment with CLIP is performed without affecting the original geometric hint space of SAM.
(4) Notably, by integrating CLIP in the mask decoder, the model obtains new functions based on segmentation results, such as object recognition and text description.
3. By introducing the concept of "hintable markup" (ProTo), representing a given visual hint (e.g., a dot, box, or sketch), the model can serve as a generic, location-aware image tagging tool that can understand and decode regional-level visual contexts, serving a wider range of visual-language tasks.
Drawings
FIG. 1 is a flow chart of a segmentation recognition and text description method based on a hintable segmentation model in accordance with a preferred embodiment of the present invention.
FIG. 2 is a schematic diagram of a model architecture of a hinted segmentation model ProTo according to a preferred embodiment of the present invention;
FIG. 3 is a diagram of the portable text generation principle of the preferred embodiment of the present invention;
FIG. 4 is a schematic diagram of a segmentation recognition and text description system based on a hintable segmentation model in accordance with a preferred embodiment of the present invention;
FIGS. 5 (a) -5 (e) are visual results of a preferred embodiment of the present invention;
fig. 6 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The method provided by the invention can be implemented in a terminal environment, and the terminal can comprise one or more of the following components: processor, memory and display screen. Wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method described in the embodiments below.
The processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and invoking data stored in the memory.
The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code, sets of codes, or instructions.
The display screen is used for displaying a user interface of each application program.
In addition, it will be appreciated by those skilled in the art that the structure of the terminal described above is not limiting and that the terminal may include more or fewer components, or may combine certain components, or a different arrangement of components. For example, the terminal further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and the like, which are not described herein.
Example 1
As shown in fig. 1, in one aspect, the present invention provides a segmentation recognition and text description method based on a hintable segmentation model, which enhances the regional semantic understanding capability of the hintable segmentation model ProTo by aligning the vision and language in the hintable segmentation model SAM; comprising the following steps:
s1, acquiring an image target;
s2, establishing a hintable segmentation model ProTo, wherein the hintable segmentation model ProTo is used for merging the language capability of the CLIP in a hintable segmentation task under an SMA framework to simultaneously carry out segmentation recognition and text description on a target, and comprises an image encoder, a hint encoder and an image decoder; the image encoder and the image decoder provide regional semantic information based on visual cues by simulating the CLIP; the image decoder is used for providing regional visual representation based on visual prompt;
and S3, carrying out segmentation recognition and text description on the image target based on the hintable segmentation model so as to obtain hintable segmentation, concept prediction and hintable text description.
As a preferred embodiment, the hintable segmentation model expands the capabilities of the model, including new capabilities such as object recognition and text description, while maintaining the original capabilities of the SAM (particularly segmentation capabilities).
As a preferred embodiment, the analog CLIP visual encoder (image encoder) and the general image decoder are integrated (all-in-one).
As shown in fig. 2, as a preferred embodiment, the S2 includes:
s21, acquiring an image segmentation dataset and a predicted segmentation mask for the hintable segmentation model pre-training;
in this embodiment, the image segmentation dataset is from SA-1B.
S22, replacing a mask decoder (mask decoder) in the SAM architecture with a general image decoder to obtain a model primary architecture, and adding a semantic mark into a mask mark of each predicted segmentation mask; the semantic tags are used to learn semantic prior knowledge from a predefined concept space, the semantic prior knowledge from a CLIP big model with 50 billion parameters;
s23, pre-training the model primary framework in the image segmentation data set, and constructing an integrated analog CLIP visual encoder (image encoder) and a general image decoder in the model primary framework based on the model pre-training;
s24, carrying out joint optimization on segmentation loss based on mask marks and concept distillation loss based on semantic marks based on two subtasks of hintable segmentation and vocabulary concept prediction to obtain a pre-training hintable segmentation model with region recognition and positioning capabilities;
as shown in FIG. 2, the hint encoder obtains input for the points, boxes, and sketches of the SA-1B data source. The two subtasks comprise a mask mark [ M ] of a predicted segmentation mask, a calculated score and other modes of predicting the mask through mask embedding after passing through an image decoder, and obtaining a hintable segmentation; the prediction area concept is embedded by text description (semantic tags S), visual embedding and CLIP text encoder.
In this embodiment, unlike previous approaches that rely on high cost region-text data, the present invention uses CLIP and image segmentation data from SA-1B to align the segmentation mask with the language. Since SA-1B is a class-agnostic dataset, the present invention uses off-the-shelf CLIP embedding in the artificially designed concept space and distributes and aligns the prediction of SAM and the conceptual vocabulary of CLIP projection to achieve alignment of segmentation masks with language.
S25, performing fine tuning training on the pre-training hintable segmentation model on the regional text description task.
As a preferred embodiment, the analog CLIP visual encoder (image encoder) is used for preprocessing the image segmentation data of SA-1B.
In this embodiment, the preprocessing is different from the previous prompting method, such as SAM and SEEM, and the text prompting mode is omitted. This is because text cues may be ambiguous, particularly in responding to 70% of the small area masks in SA-1B, as compared to point-based, box-based visual cues. The prior art typically uses pre-trained region proposal network (area candidate network) to determine candidate boxes, and then uses CLIP to directly extract the image embeddings within the candidate boxes. In contrast, the SA-1B dataset has provided a high quality mask for each object in the image. It allows direct computation of the image embedding of the mask area based on these ground-try masks, avoiding annotation bias or candidate frame prediction errors for specific datasets. Specifically, the present invention calculates image embeddings from mask image cropping using a high-performance, open-source CLIP model EVA-CLIP with 5B parameters and stores it locally.
As a preferred embodiment, the image decoder co-generates 9 tokens, the 9 tokens including 4 semantic tokens, 4 mask tokens and 1 IoU token.
In this embodiment, the image decoder is associated with hintable segmentation. The Mask decoder in the SAM uses a lightweight architecture derived from Mask2Former and can respond explicitly to specific input cues. Thus, segmentation may be suggested as a necessary premise for mining the semantic capabilities of the model. Following SAM, the promo model of the present invention predicts four masks per hint, but the routing policy will ultimately determine one to resolve the ambiguity. Thus, the image decoder generates 9 markers in total: 4 semantic tags, 4 mask tags and 1 IoU tag.
As a preferred embodiment, in order to improve the training efficiency of the large-scale SA-1B dataset, the pre-training is performed using a two-stage sampling strategy, with a maximum of 9 hint points. In the first stage, frames or points are sampled with equal probability from the real mask corresponding to the image segmentation data; in the second stage, 1 to 8 points are uniformly sampled from an error region between a prediction mask and a true mask corresponding to the image division data.
As a preferred embodiment, said obtaining a hintable segmentation of said S3 comprises employing a sketch or mask as a first hint, which is an aspect not explored in the original SAM, said obtaining a hintable segmentation comprising: sampling the image segmentation data by adopting a non-interactive sampling method with 50% probability; the 50% probability non-interactive sampling method comprises:
uniformly acquiring 1 to 9 points from the real mask during sampling;
9 points are selected from the linear space of the flattened two-dimensional coordinates of the real mask or sketch during reasoning to ensure certainty;
during the supervised learning of the segmentation mask, the linear combination of Focal loss and Dice loss is adopted to optimize the loss function according to the proportion of 20:1, and the strategy of the original SAM is followed.
In this embodiment:
(1) Focal loss is proposed to solve the problem of class imbalance in one-stage object detection, namely, when one image is subjected to object detection, 10000-100000 candidate positions are evaluated, but only a few of the candidate positions contain object objects, so that the problem that training efficiency is low and most of the negative samples are located in a background area is solved, and the Focal loss is regulated by introducing parameters in cross entropy, so that the distribution of the last layer generally accords with the distribution of the samples relative to the negative samples when the parameters are initialized.
(2) The Dice Loss is a Loss function for the image segmentation task. It calculates a score based on the ratio of the overlapping area between the target segmented image and the model output result. It is more suitable for handling difficult-to-divide objects than the cross entropy loss function. The formula of the Dice Loss is as follows:
DiceLoss=1-(2*|X&Y|)/(|X|+|Y|);
wherein X represents a predicted value of a hinting segmentation model, Y represents a target segmented image corresponding to image segmentation data, symbol represents a pixel-by-pixel bitwise AND, |X| represents the sum of all pixel values in X.
DiceLoss has good performance in handling class imbalance and small but much targeted image segmentation tasks. The cross entropy loss function ignores the similarity between the predicted value and the target value and is not sensitive enough to extreme pixel values. And DiceLoss is an evaluation index based on similarity, and can treat the condition of unbalanced pixel values well by focusing on the same pixel values.
As a preferred embodiment, said obtaining concept predictions of S3 includes predicting regional concepts using semantic tags, thereby enhancing semantic understanding capabilities of the model.
In this embodiment, the obtaining the concept prediction includes:
based on semantic tags, 1024-dimensional visual embedding is obtained through a 3-layer MLP (multi-layer perceptron);
projecting the 1024-dimensional visual embedding further into 2560-dimensional distributed logits;
the KL divergence loss between the predicted distribution and the target distribution is optimized, and performance degradation caused by similar concepts is relieved. For example, bulldog (bulldog) is a form of dog (dog), and therefore, should not deviate too far in the characterization space from concepts related to dogs or cats, etc.; the image-text distribution provides more information and prevents models from learning from CLIP to enlarged annotation bias.
As a preferred embodiment, the obtaining hintable text description of S3 is used to replace traditional manually customized predictive tasks with next markup predictions based on a large language model and to release potential capabilities of hintable semantic markup based on a text generation paradigm.
The obtaining of the hintable text description comprises hinting based on the output semantic tags [ S ] of the image decoder, and generating a regional text description; comprising the following steps:
1. a generative visual language model is established through causal language modeling to promote the development of a visual basic model. Based on the generated visual language model, semantic tags from the image decoder are directly used to hint causal text generation. Specifically, the creating the generative visual language model through causal language modeling includes:
(1) Converting the visual signal into a causal sequence of fixed length using a causal transducer architecture, thereby tokenizing the visual signal for causal language modeling;
(2) Placing a semantic mark at a leading position of the causal sequence, and adding a [ BOS ] mark at the back;
(3) Supervising the next marker prediction using cross entropy loss;
(4) The position coding of the integrated multimodal sequence is embedded based on rotation.
Unlike the prior art, which roughly couples three frozen models, the model of the present invention is able to handle this task end-to-end, with a text generation architecture as shown in fig. 3. Many previous works fine-tune pre-trained models by using pseudo-open semantic classifiers on open semantic data sets. However, this task is far behind compared to context dialogue-based fine tuning in natural language processing, which can encode unlimited human knowledge from the open world. Unlike synchronous operation of Emu and ASM, the invention converts visual signals into causal sequences of fixed length by introducing an intermediate transducer, essentially marking the visual signals for causal modeling; by placing semantic tags in the leading position of the sequence, followed by a [ BOS ] tag, and supervising the next tag prediction using cross entropy loss; position coding using rotation embedded integrated multi-modal sequences
2. Establishing a visual encoder, determining semantic tags modeled together with the hintable segmentation, matching the semantic tags with text-embedded dimensions based on linear projection and the visual encoder; this encoder is simpler than previous methods, it parameterizes the region of interest characteristics; and visually encoding the segmented regions to achieve a more comprehensive visual understanding.
3. Establishing a text decoder, wherein the text decoder has 2500 ten thousand parameters and is in a lightweight structure and is used for executing translation from mask to text; the process of mask-to-text translation includes: predicting the tagged region description according to the Llama method based on the byte pair code and a vocabulary comprising 32,000 tokens; text decoding is performed using an 8-layer standard Transformer with an embedded dimension of 512 for a short description (maximum context length 40);
4. performing text reasoning and generating regional text description; the generating the region text description includes:
(1) Iteratively generating up to 40 tokens for each mask;
(2) In order to accelerate the calculation of the attention module, caching a plurality of key value pairs based on the autoregressive standard practice and generating a plurality of outputs corresponding to each hintable segmentation;
(3) Based on a routing policy between visual embedding and text embedding, a final output is selected from a plurality of outputs corresponding to each hint segmentation as a region text description.
Example two
As shown in fig. 4, the present embodiment provides a segmentation recognition and text description system based on a hintable segmentation model, which enhances the regional semantic understanding capability of the hintable segmentation model ProTo by aligning the vision and language in the hintable segmentation model SAM, comprising:
an image target acquisition module 101 for acquiring an image target;
a hintable segmentation model building module 102, configured to build a hintable segmentation model ProTo, where the hintable segmentation model ProTo is configured to perform segmentation recognition and text description on a target simultaneously based on language capability of a fusion CLIP in a hintable segmentation task under an SMA architecture, and the hintable segmentation model includes an image encoder (image encoder), a hint encoder, and an image decoder; the image encoder and the image decoder provide regional semantic information based on visual cues by simulating the CLIP; the image decoder is used for providing regional visual representation based on visual prompt;
the segmentation recognition and text description module 103 is configured to perform segmentation recognition and text description on the image target based on the hintable segmentation model so as to obtain hintable segmentation, concept prediction and hintable text description.
The visualization results are shown in fig. 5 (a) -5 (e). The visual illustrations of fig. 5 (a) -5 (e) are given as a graph, and the model can simultaneously perform object segmentation, recognition and text description on the selected region by simply clicking, drawing or sketching on the graph.
Experimental and visual results show that the proposed model shows strong performance in zero sample classification recognition (e.g., 58.8AP implemented on LVIS) while maintaining competitive segmentation performance (sam42.1 vs. the present invention 42.0 AP). Notably, the present invention creates a new benchmark in the region text description task of the visual genome dataset with a CIDEr score of 150.7. The research shows that the model can be a universal regional image marking tool, can encode regional semantic context for wider visual-language tasks, and the marked regional features can be directly used for prompting causal language modeling, thereby providing a new view for visual-language research.
The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.
As shown in fig. 6, the present invention further provides an electronic device, including a processor 301 and a memory 302 connected to the processor 301, where the memory 302 stores a plurality of instructions, and the instructions may be loaded and executed by the processor, so that the processor can execute the method according to the first embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. A segmentation recognition and text description method based on a hintable segmentation model, comprising:
s1, acquiring an image target;
s2, establishing a hintable segmentation model ProTo, wherein the hintable segmentation model ProTo is used for merging the language capability of the CLIP in a hintable segmentation task under an SMA framework to simultaneously carry out segmentation recognition and text description on a target, and comprises an image encoder, a hint encoder and an image decoder; the image encoder and the image decoder provide regional semantic information based on visual cues by simulating the CLIP; the image decoder is used for providing regional visual representation based on visual prompt;
and S3, carrying out segmentation recognition and text description on the image target based on the hintable segmentation model so as to obtain hintable segmentation, concept prediction and hintable text description.
2. The segmentation recognition and text description method based on the hintable segmentation model of claim 1, wherein S2 comprises:
s21, acquiring an image segmentation dataset and a predicted segmentation mask for the hintable segmentation model pre-training:
s22, replacing a mask decoder in the SAM architecture with a general image decoder to obtain a model primary architecture, and adding a semantic mark into a mask mark of each predicted segmentation mask; the semantic tags are used to learn semantic prior knowledge from a predefined concept space, the semantic prior knowledge from a CLIP big model with 50 billion parameters;
s23, pre-training the model primary framework in the image segmentation data set, and constructing a simulation CLIP visual encoder and a general image decoder which are integrated in the model primary framework based on the model pre-training;
s24, carrying out joint optimization on segmentation loss based on mask marks and concept distillation loss based on semantic marks based on two subtasks of hintable segmentation and vocabulary concept prediction to obtain a pre-training hintable segmentation model with region recognition and positioning capabilities;
s25, performing fine tuning training on the pre-training hintable segmentation model on the regional text description task.
3. The segmentation recognition and text description method based on the hintable segmentation model as set forth in claim 2, wherein the simulated CLIP visual encoder and the generic image decoder are of unitary construction; the image segmentation dataset is from SA-1B; the hint encoder obtains input for points, boxes and sketches of the SA-1B data source; the two subtasks comprise a mask mark of a predicted segmentation mask, a calculated score and other modes of prediction mask through mask embedding after passing through an image decoder, and a hintable segmentation is obtained; text embedding prediction area concepts through text description, visual embedding, and CLIP text encoder; the analog CLIP visual encoder is used for preprocessing the image segmentation data of the SA-1B; the image decoder co-generates 9 tokens, the 9 tokens including 4 semantic tokens, 4 mask tokens and 1 IoU token.
4. A segmentation recognition and text description method based on a hintable segmentation model according to claim 3, wherein the pre-training is performed using a two-stage sampling strategy with at most 9 hint points; in the first stage, frames or points are sampled with equal probability from the real mask corresponding to the image segmentation data; in the second stage, 1 to 8 points are uniformly sampled from an error region between a prediction mask and a true mask corresponding to the image division data.
5. The method of claim 4, wherein the obtaining the hintable segmentation of S3 comprises using a sketch or a mask as a first hint, the obtaining the hintable segmentation comprising: sampling the image segmentation data by adopting a non-interactive sampling method with 50% probability; the 50% probability non-interactive sampling method comprises:
uniformly acquiring 1 to 9 points from the real mask during sampling;
9 points are selected from the linear space of the flattened two-dimensional coordinates of the real mask or sketch during reasoning to ensure certainty;
during the segmentation mask supervised learning, linear combinations of focallos and dicholoss were employed, according to 20: the ratio of 1 is subjected to loss function optimization.
6. The segmentation recognition and text description method based on a hintable segmentation model as set forth in claim 5, wherein the deriving concept prediction of S3 comprises predicting regional concepts using semantic tags to enhance semantic understanding capabilities of the model; the obtaining a concept prediction includes:
based on semantic marks, 1024-dimensional visual embedding is obtained through a 3-layer multilayer perceptron;
projecting the 1024-dimensional visual embedding further into 2560-dimensional distributed logits;
optimizing the KL divergence loss between the predicted and target distributions mitigates performance degradation caused by similar concepts and prevents learning from CLIP to amplified annotation bias.
7. The segmentation recognition and text description method based on a hintable segmentation model of claim 6 wherein the obtaining a hintable text description of S3 comprises hinting based on an output semantic tag of an image decoder [ S ] to generate a region text description; comprising the following steps:
establishing a generative visual language model through causal language modeling, comprising:
(1) Converting the visual signal into a causal sequence of fixed length using a causal transducer architecture, thereby tokenizing the visual signal for causal language modeling;
(2) Placing a semantic mark at a leading position of the causal sequence, and adding a [ BOS ] mark at the back;
(3) Supervising the next marker prediction using cross entropy loss;
(4) Embedding a position code of the integrated multimodal sequence based on the rotation;
establishing a visual encoder, determining semantic tags modeled together with the hintable segmentation, matching the semantic tags with text-embedded dimensions based on linear projection and the visual encoder;
establishing a text decoder, wherein the text decoder has 2500 ten thousand parameters and is in a lightweight structure and is used for executing translation from mask to text; the process of mask-to-text translation includes: predicting the tagged region description according to the Llama method based on the byte pair code and a vocabulary comprising 32,000 tokens; text decoding is carried out by adopting an 8-layer standard transducer and embedding a network structure with the dimension of 512 so as to adapt to short description;
performing text reasoning and generating regional text description; the generating the region text description includes:
(1) Iteratively generating up to 40 tokens for each mask;
(2) Caching a plurality of key value pairs based on autoregressive standard practice and generating a plurality of outputs corresponding to each hintable segmentation;
(3) Based on a routing policy between visual embedding and text embedding, a final output is selected from a plurality of outputs corresponding to each hint segmentation as a region text description.
8. A segmentation recognition and text description system based on a hintable segmentation model for implementing the method of any one of claims 1-7, comprising:
an image target acquisition module (101) for acquiring an image target;
a hintable segmentation model building module (102) for building a hintable segmentation model ProTo, the hintable segmentation model ProTo being used for performing segmentation recognition and text description on a target simultaneously based on language capability of a fusion CLIP in a hintable segmentation task under an SMA architecture, the hintable segmentation model comprising an image encoder, a hint encoder and an image decoder; the image encoder and the image decoder provide regional semantic information based on visual cues by simulating the CLIP; the image decoder is used for providing regional visual representation based on visual prompt;
and the segmentation recognition and text description module (103) is used for carrying out segmentation recognition and text description on the image target based on the hintable segmentation model so as to obtain hintable segmentation, concept prediction and hintable text description.
9. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor configured to read the instructions and perform the method of any of claims 1-7.
10. A computer readable storage medium storing a plurality of instructions readable by a processor and for performing the method of any one of claims 1-7.
CN202311676811.9A 2023-12-07 2023-12-07 Segmentation recognition and text description method and system based on hintable segmentation model Pending CN117671688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311676811.9A CN117671688A (en) 2023-12-07 2023-12-07 Segmentation recognition and text description method and system based on hintable segmentation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311676811.9A CN117671688A (en) 2023-12-07 2023-12-07 Segmentation recognition and text description method and system based on hintable segmentation model

Publications (1)

Publication Number Publication Date
CN117671688A true CN117671688A (en) 2024-03-08

Family

ID=90074797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311676811.9A Pending CN117671688A (en) 2023-12-07 2023-12-07 Segmentation recognition and text description method and system based on hintable segmentation model

Country Status (1)

Country Link
CN (1) CN117671688A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117911430A (en) * 2024-03-19 2024-04-19 中国医学科学院北京协和医院 Method and device for segmenting interactive microorganism image based on transformer
CN117909535A (en) * 2024-03-15 2024-04-19 中国科学技术大学 Combined understanding method, system, equipment and medium based on visual language model
CN117953224A (en) * 2024-03-27 2024-04-30 暗物智能科技(广州)有限公司 Open vocabulary 3D panorama segmentation method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023154192A1 (en) * 2022-02-14 2023-08-17 Snap Inc. Video synthesis via multimodal conditioning
CN116612281A (en) * 2023-05-20 2023-08-18 复旦大学 Text supervision-based open vocabulary image semantic segmentation system
CN116935418A (en) * 2023-09-15 2023-10-24 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system
CN116994140A (en) * 2023-08-14 2023-11-03 航天宏图信息技术股份有限公司 Cultivated land extraction method, device, equipment and medium based on remote sensing image
CN117036706A (en) * 2023-08-11 2023-11-10 北京无代码科技有限公司 Image segmentation method and system based on multi-modal dialogue language model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023154192A1 (en) * 2022-02-14 2023-08-17 Snap Inc. Video synthesis via multimodal conditioning
CN116612281A (en) * 2023-05-20 2023-08-18 复旦大学 Text supervision-based open vocabulary image semantic segmentation system
CN117036706A (en) * 2023-08-11 2023-11-10 北京无代码科技有限公司 Image segmentation method and system based on multi-modal dialogue language model
CN116994140A (en) * 2023-08-14 2023-11-03 航天宏图信息技术股份有限公司 Cultivated land extraction method, device, equipment and medium based on remote sensing image
CN116935418A (en) * 2023-09-15 2023-10-24 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117909535A (en) * 2024-03-15 2024-04-19 中国科学技术大学 Combined understanding method, system, equipment and medium based on visual language model
CN117909535B (en) * 2024-03-15 2024-05-31 中国科学技术大学 Combined understanding method, system, equipment and medium based on visual language model
CN117911430A (en) * 2024-03-19 2024-04-19 中国医学科学院北京协和医院 Method and device for segmenting interactive microorganism image based on transformer
CN117953224A (en) * 2024-03-27 2024-04-30 暗物智能科技(广州)有限公司 Open vocabulary 3D panorama segmentation method and system

Similar Documents

Publication Publication Date Title
Cho et al. Unifying vision-and-language tasks via text generation
CN117671688A (en) Segmentation recognition and text description method and system based on hintable segmentation model
CN113792113A (en) Visual language model obtaining and task processing method, device, equipment and medium
CN114580424B (en) Labeling method and device for named entity identification of legal document
CN111612010A (en) Image processing method, device, equipment and computer readable storage medium
CN114548099A (en) Method for jointly extracting and detecting aspect words and aspect categories based on multitask framework
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN113569068B (en) Descriptive content generation method, visual content encoding and decoding method and device
CN112084788B (en) Automatic labeling method and system for implicit emotion tendencies of image captions
CN114937277B (en) Image-based text acquisition method and device, electronic equipment and storage medium
CN115759262A (en) Visual common sense reasoning method and system based on knowledge perception attention network
CN114241279A (en) Image-text combined error correction method and device, storage medium and computer equipment
CN114692649A (en) Automatic answer text generation method using multi-view information
Ma et al. Diagram perception networks for textbook question answering via joint optimization
Liang et al. Savitar: an intelligent sign language translation approach for deafness and dysphonia in the COVID-19 era
CN117671426B (en) Concept distillation and CLIP-based hintable segmentation model pre-training method and system
Cai et al. ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
CN114138934B (en) Text smoothness detection method, device, equipment and storage medium
CN117557871B (en) Three-dimensional model labeling method, device, equipment and storage medium
US20240176959A1 (en) Method and apparatus for generating language model using crossmodal information
CN116310984B (en) Multi-mode video subtitle generating method based on Token sampling
CN113192030B (en) Remote sensing image description generation method and system
CN117506940B (en) Robot track language description generation method, device and readable storage medium
CN118133191A (en) Target detection method and device for multi-mode data
CN117889864A (en) Visual language navigation method based on dual semantic graph and modal alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination