CN118365887B

CN118365887B - Image segmentation method and device for open vocabulary transmission line equipment

Info

Publication number: CN118365887B
Application number: CN202410782433.0A
Authority: CN
Inventors: 张峰; 李端姣; 李雄刚; 陈浩; 郑志豪; 汪皓; 李国强; 刘高; 蒙华伟; 林俊省; 廖建东; 饶成成; 喻凌立; 张英; 缪钟灵; 李声福; 谢卓均; 黄兴; 张洛嘉
Original assignee: Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd
Priority date: 2024-06-18
Filing date: 2024-06-18
Publication date: 2024-09-10
Anticipated expiration: 2044-06-18
Also published as: CN118365887A

Abstract

The application discloses an image segmentation method and device for open vocabulary transmission line equipment. The method comprises the steps of extracting image features by inputting an image to be segmented into an image encoder in a visual language large model of preset transmission line equipment; inputting the image characteristics to a pre-trained adapter to obtain text implicit characteristics; inputting the implicit text features and the images to be segmented into a pre-trained text image diffusion model to obtain internal image features; inputting the internal features of the image into a pre-trained segmentation result predictor, and obtaining segmentation results and embedding characterization of the prediction results; and acquiring a final segmentation result by combining the segmentation result, the prediction result embedding representation, the text category label and the image characteristic so as to segment the image to be segmented. The application can solve the image segmentation task in the dynamic environment, and simultaneously reduce the data demand burden and the manual labeling cost.

Description

Image segmentation method and device for open vocabulary transmission line equipment

Technical Field

The invention relates to the field of transmission line equipment image segmentation, in particular to an open vocabulary transmission line equipment image segmentation method and device.

Background

The transmission line equipment image segmentation is beneficial to extracting high-level geographic information from a large-scale transmission line equipment image, and provides rich geographic information support for various decision making, management and monitoring tasks.

In the prior art, an image segmentation model is usually obtained through a closed training method, namely the segmentation class is always kept unchanged in the training and testing stages, however, a more complex environment is always generated in the actual application scene, so that the segmentation target class is continuously changed, and the segmentation precision of the image segmentation model is finally low because the target of the new class cannot be correctly segmented. In order to solve the above problem, the prior art retrains the model by adding a data set related to a new category in a human labeling manner. However, the acquisition of the data set requires a large amount of raw data and labor cost, which causes the difficulty and cost of the segmentation task to be greatly increased.

In order to solve the technical problems of low segmentation precision caused by the closed training and high cost caused by manually adding new training data, a transmission line equipment image segmentation method with higher robustness and without a large amount of human resources is urgently needed.

Disclosure of Invention

The invention provides an image segmentation method and device for open vocabulary transmission line equipment, which are used for solving the technical problems of low segmentation precision caused by closed training and high cost caused by manually adding new training data.

In order to solve the above technical problems, in a first aspect, an embodiment of the present invention provides an image segmentation method for an open vocabulary transmission line device, including:

Acquiring an image to be segmented and a text type label, inputting the image to be segmented into an image encoder in a visual language large model of preset transmission line equipment, and extracting image features;

inputting the image characteristics to a pre-trained adapter to obtain text implicit characteristics;

Inputting the implicit text features and the images to be segmented into a pre-trained text image diffusion model to obtain internal image features;

Inputting the internal features of the image into a pre-trained segmentation result predictor, and obtaining segmentation results and embedding characterization of the prediction results;

And acquiring a final segmentation result by combining the segmentation result, the predictive result embedded representation, the text category label and the image characteristic, and segmenting the image to be segmented according to the final segmentation result.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: the text image diffusion model and the equipment visual language large model have the characteristic of rich semantics, so that the model has strong migration capability and faces a large number of dynamic change scenes, so that the segmentation task of a large number of new classification targets is processed, and the robustness of the image segmentation model is improved; meanwhile, rich semantic information of the visual language pre-training large model is utilized, and the dot product operation is combined, so that the image features extracted by the visual language large model of the equipment are efficiently utilized, the new classification of targets is ensured, and the segmentation precision and accuracy are improved; and the defects that a large amount of manual labeling cost or incremental training models are needed when the traditional segmentation method faces to a new classification target are overcome, and the data acquisition burden and cost of a segmentation task are reduced.

In an embodiment of the first aspect, the obtaining a final segmentation result by combining the segmentation result, the prediction result embedding representation, the text category label and the image feature comprises:

Extracting text display characteristics in the text category labels through a text encoder in the visual language big model of the preset power transmission line equipment;

And fusing the segmentation result, the prediction result embedding representation, the text display feature and the image feature through dot product operation to obtain the final segmentation result.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: according to the rich semantic characteristics of the visual language big model, the text explicit characteristics in the text category labels are obtained, the image characteristics, the predicted segmentation result characteristics and the text explicit characteristics are further comprehensively considered, the information integration of the image characteristics and the text labels is realized, and the segmentation accuracy is improved.

In an embodiment of the first aspect, the fusing the segmentation result, the prediction result embedding representation, the text display feature, and the image feature by dot product operation to obtain the final segmentation result includes:

fusing the segmentation result, the prediction result embedding representation, the text display feature and the image feature by the following formula:

wherein, Representing the final segmentation result; represents the first The segmentation results of the segmentation class are fused with features,Is the firstEmbedding a representation into a prediction result of the seed segmentation category; Is the first A segmentation result of the seed segmentation class; Is an image feature; the fusion of the internal image characteristics and the predicted segmentation result is shown; Representing text category labels; Representing the text display characteristics extracted by a text encoder in the visual language big model of the preset transmission line equipment; Is a normalization function.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: through dot product operation, the text display features, the image features and the segmentation result features are fused, feature fusion and joint representation learning are realized, and the information integration effect is improved.

In an embodiment of the first aspect, the degree of correspondence of the segmentation result with the text class label, the segmentation result, is iteratively trained by a cross entropy loss function when training the adapter and the segmentation result predictor.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: through the cross entropy loss function, not only is iterative training carried out on the prediction segmentation result, but also accurate alignment of the prediction result and the category label text is ensured, and the reliability and accuracy of the segmentation result are improved.

In an embodiment of the first aspect, the iteratively training the degree of correspondence between the segmentation result and the text category label and the segmentation result by a cross entropy loss function includes:

the cross entropy loss function is specifically:

wherein, The text category label loss function is used for iteratively training the corresponding degree of the segmentation result and the text category label; The image type label loss function is used for iteratively training the segmentation result; is a cross entropy loss function; Is a normalization function; Is the first Embedding a representation into a prediction result of the seed segmentation category; Is the first A segmentation result of the seed segmentation class; Is the first A true text category label of the category segmentation category; Is the first True segmented images of the seed segmentation class; Displaying features for text; For dividing the number of categories.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: when aligning the predicted result and the text of the category label, firstly embedding the predicted result into the characterization and text display characteristics to perform characteristic fusion, obtaining the predicted result of the text category label, then performing iterative training on the alignment degree of the predicted result and the text of the category label through a cross entropy loss function, and simultaneously performing iterative training on the predicted segmentation result, thereby improving the reliability and accuracy of the segmentation result.

In an embodiment of the first aspect, the inputting the implicit text feature and the image to be segmented into a pre-trained text image diffusion model, obtaining the internal image feature includes:

extracting a noise image of the image to be segmented through a forward process of the diffusion model;

and inputting the noise image and the implicit text features to a denoising module in the backward process of the diffusion model to obtain the internal image features.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: the text image diffusion model has excellent semantic control capability, and can generate high-quality images according to various open vocabulary languages by calculating cross attention between text embedded characterization and image visual characterization, so that the correlation and the distinguishing degree of the features are improved, the segmentation performance of the model is enhanced, and the generalization capability of image segmentation can be improved when a new segmentation task is faced.

In an embodiment of the first aspect, the extracting the noise image of the image to be segmented by the forward process of the diffusion model includes:

the noise image acquisition mode specifically comprises the following steps:

wherein, Is a noisy image; Is the step length; is an image to be segmented; noise sampled from the standard normal distribution; Super parameters for controlling the noise size in the model; is a standard normal distribution.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: a text image diffusion model integrated by a denoising module is used, noise and text prompt input are processed, a high-quality denoising text representation segmented image is generated, the process is favorable for establishing a closer association between text and image characteristics, and the segmentation accuracy is improved.

In an embodiment of the first aspect, the adapter consists of a multi-layer perceptron; and the adapter is in residual connection with a text encoder in the visual language large model of the transmission line equipment.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: through the adapter composed of the multi-layer perceptron and residual connection with the text encoder, complex nonlinear image information can be effectively converted, the processing capacity of the model on complex features is improved, and meanwhile, the suitability of the model data which is input to the text image diffusion model subsequently is improved.

In an embodiment of the first aspect, the diffusion model is obtained through training of a preset text image data set of the transmission line equipment.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: the diffusion model is subjected to iterative optimization through preset text image data of the power transmission line equipment, so that the adaptation degree of the diffusion model to specific scenes and ground objects in the power transmission line equipment image can be improved, and the segmentation accuracy of a subsequent model to a segmentation task containing a new classification target is improved.

In a second aspect, an embodiment of the present invention further provides an image segmentation apparatus for an open vocabulary transmission line device, including: the device comprises an image feature extraction module, a text implicit feature extraction module, an image internal feature extraction module, an image segmentation module and a feature fusion module;

The image feature extraction module is used for acquiring an image to be segmented and a text type label, inputting the image to be segmented into an image encoder in a visual language large model of preset transmission line equipment, and extracting image features;

The text implicit characteristic extraction module is used for inputting the image characteristics to a pre-trained adapter to acquire text implicit characteristics;

The image internal feature extraction module is used for inputting the text implicit feature and the image to be segmented into a pre-trained text image diffusion model to obtain the image internal feature;

The image segmentation module is used for inputting the internal features of the image into a pre-trained segmentation result predictor to obtain a segmentation result and embedding the prediction result into a representation;

The feature fusion module is used for acquiring a segmented image by combining the segmentation result, the prediction result embedded representation, the text category label and the image feature.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of an image segmentation method for an open vocabulary transmission line device according to the present invention;

fig. 2 is a schematic structural diagram of an embodiment of an image segmentation apparatus for an open vocabulary transmission line device according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, an image segmentation method for an open vocabulary transmission line device according to an embodiment of the present invention includes S101 to S105, specifically:

S101: and acquiring an image to be segmented and a text type label, inputting the image to be segmented into an image encoder in a visual language large model of preset transmission line equipment, and extracting image features.

Further, the visual language big model of the preset power transmission line equipment further comprises a text encoder, and the text category labels are input to the encoder to obtain text explicit characteristics.

S102: and inputting the image characteristics into a pre-trained adapter to acquire the implicit characteristics of the text.

Further, the adapter consists of a multi-layer perceptron; and the adapter is in residual connection with a text encoder in the visual language large model of the transmission line equipment.

Through the adapter composed of the multi-layer perceptron and residual connection with the text encoder, complex nonlinear image information can be effectively converted, the processing capacity of the model on complex features is improved, and meanwhile, the suitability of the model data which is input to the text image diffusion model subsequently is improved.

S103: and inputting the implicit text features and the image to be segmented into a pre-trained text image diffusion model to obtain the internal image features.

Further, the inputting the implicit text feature and the image to be segmented into a pre-trained text image diffusion model to obtain the internal image feature includes:

Preferably, the image internal features may be extracted by the following preferred embodiments:

the noise image is first extracted by the following formula:

After the noise image is acquired, inputting the noise image and the implicit text features into a UNet module of the diffusion model, and denoising the noise image:

wherein, Is an image internal feature; Is an implicit feature of the text; And a denoising module for a backward process of the diffusion model.

Because the diffusion model calculates the cross attention between the text embedded representation and the image visual representation, potential features of the diffusion model have good distinction and correlation with the language description semantic concept, and therefore when an image segmentation task of open vocabulary is executed, the internal features of the image are extracted through the text image diffusion model, and the segmentation accuracy and generalization capability of a subsequent model can be improved.

Further, the diffusion model is obtained through training of a text image data set of preset power transmission line equipment.

Preferably, before the image features are extracted through the visual language large model, the text image data of the power transmission line equipment are firstly input into the text image diffusion model to refine the text image diffusion model, wherein the setting batchsize is 4, 10 epochs are refined in total in an iterative mode, and other parameters are kept unchanged, so that the text image diffusion model can be better adapted to specific scenes and ground objects in the power transmission line equipment image.

S104: and inputting the internal features of the image into a pre-trained segmentation result predictor, and obtaining segmentation results and embedding characterization of the prediction results.

Further, when training the adapter and the segmentation result predictor, iteratively training the corresponding degree of the segmentation result and the text category label and the segmentation result through a cross entropy loss function.

Through the cross entropy loss function, not only is iterative training carried out on the prediction segmentation result, but also accurate alignment of the prediction result and the category label text is ensured, and the reliability and accuracy of the segmentation result are improved.

Preferably, when the adapter and the segmentation result predictor participate in training, preset text image data of transmission line equipment is adopted as a training set, 90 epochs are trained in the training process, an Adam optimizer is adopted to optimize a model, wherein the super parameter batch size is set to 8, the learning rate is 0.0001, and the weight attenuation is 0.05.

Further, the iteratively training the corresponding degree of the segmentation result and the text category label and the segmentation result through the cross entropy loss function includes:

the cross entropy loss function is specifically:

Because the segmentation task of the open vocabulary involves a large number of new classification targets, in order to improve the segmentation effect of the model, besides the iterative training of the segmentation result, whether the class label text containing the new classification targets is aligned with the prediction result or not needs to be considered, so that the prediction result of the text class label is obtained by feature fusion of the embedded representation of the prediction result and the text display feature, then the iterative training is performed on the alignment degree of the prediction result and the class label text through the cross entropy loss function, and meanwhile, the iterative training is performed on the prediction segmentation result, so that the reliability and the accuracy of the segmentation result are improved.

Preferably, the segmentation result predictor may, but is not limited to, employ a trainable Mask r-cnn.

S105: and acquiring a final segmentation result by combining the segmentation result, the predictive result embedded representation, the text category label and the image characteristic, and segmenting the image to be segmented according to the final segmentation result.

Further, the obtaining a final segmentation result by combining the segmentation result, the predictive result embedding representation, the text category label and the image feature includes:

Preferably, the final segmentation result can be obtained by the following preferred implementation method:

Inputting the obtained image internal features into a trained segmentation result predictor to obtain a segmentation result And predictive result embedded characterization; The segmentation result is then followedAnd predictive result embedded characterizationText of test object class labelsAnd image featuresIn combination, the image is segmented, in particular by the following formula:

According to the preferred embodiment, the visual language large model with rich semantic features is used for acquiring the text explicit features in the text category labels, a unique feature fusion strategy is provided, the image features, the predicted segmentation result features and the text explicit features are comprehensively considered, and the features and the image features are fused in the reasoning stage, so that deeper information integration is realized, and the segmentation accuracy is improved.

In summary, it can be seen that the image segmentation method for the open vocabulary transmission line equipment provided by the embodiment of the invention has the following beneficial effects compared with the prior art: by combining the text image diffusion model and the visual language large model and optimizing the transmission line equipment image, the adaptability and the robustness of the model under a specific scene are improved, so that the model can be better generalized to a new target class; in addition, a unique feature fusion strategy is provided, image features, predicted segmentation result features and text explicit features are comprehensively considered, meanwhile, the output of an image encoder is adjusted through a multi-layer perceptron (MLP) adapter, implicit text features are obtained, and the features are fused with the image features in an reasoning stage, so that deeper information integration is realized; a text-image diffusion model integrated by a UNet module is used, noise and text prompt input are processed, and a high-quality denoising text representation segmentation image is generated, so that the relevance between text and image characteristics is improved, and the segmentation accuracy is further improved; through the cross entropy loss function, not only is iterative training carried out on the prediction segmentation result, but also accurate alignment of the prediction result and the category label text is ensured, and the reliability and accuracy of the segmentation result are improved.

Example two

Referring to fig. 2, the embodiment of the present invention further provides an image segmentation apparatus for an open vocabulary transmission line apparatus, including: an image feature extraction module 11, a text implicit feature extraction module 12, an image internal feature extraction module 13, an image segmentation module 14 and a feature fusion module 15.

Further, the image feature extraction module 11 is configured to obtain an image to be segmented and a text type label, input the image to be segmented into an image encoder in a visual language large model of a preset transmission line device, and extract image features; the text implicit feature extraction module 12 is configured to input the image feature to a pre-trained adapter, and obtain a text implicit feature; the image internal feature extraction module 13 is configured to input the text implicit feature and the image to be segmented into a pre-trained text image diffusion model, and obtain an image internal feature; the image segmentation module 14 is configured to input the internal features of the image into a pre-trained segmentation result predictor, obtain a segmentation result, and embed a representation of the prediction result; the feature fusion module 15 is configured to obtain a final segmentation result by combining the segmentation result, the prediction result embedding representation, the text category label and the image feature, and segment the image to be segmented according to the final segmentation result.

Further, the feature fusion module 15 is configured to obtain a final segmentation result by combining the segmentation result, the prediction result embedding representation, the text category label and the image feature, and includes: extracting text display characteristics in the text category labels through a text encoder in the visual language big model of the preset power transmission line equipment; and fusing the segmentation result, the prediction result embedding representation, the text display feature and the image feature through dot product operation to obtain the final segmentation result.

Further, the fusing the segmentation result, the prediction result embedded representation, the text display feature and the image feature through dot product operation to obtain the final segmentation result includes:

the cross entropy loss function is specifically:

Further, the image internal feature extraction module 13 is configured to input the text implicit feature and the image to be segmented into a pre-trained text image diffusion model, and obtain an image internal feature, where the image internal feature includes: extracting a noise image of the image to be segmented through a forward process of the diffusion model; and inputting the noise image and the implicit text features to a denoising module in the backward process of the diffusion model to obtain the internal image features.

Further, the extracting the noise image of the image to be segmented through the forward process of the diffusion model includes:

the noise image acquisition mode specifically comprises the following steps:

Example III

On the basis of the embodiment of the open vocabulary transmission line equipment image segmentation configuration method, another embodiment of the present invention provides an open vocabulary transmission line equipment image segmentation configuration terminal device, where the open vocabulary transmission line equipment image segmentation configuration terminal device includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor executes the computer program to implement the open vocabulary transmission line equipment image segmentation configuration method according to any embodiment of the present invention.

Illustratively, in this embodiment the computer program may be partitioned into one or more modules, which are stored in the memory and executed by the processor to perform the present invention. The one or more modules may be a series of computer program instruction segments capable of performing particular functions for describing the execution of the computer program in the open vocabulary transmission line device image segmentation configuration device.

The image segmentation configuration equipment of the open vocabulary transmission line equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server. The open vocabulary transmission line equipment image segmentation configuration terminal equipment can comprise, but is not limited to, a processor and a memory.

The Processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the image segmentation configuration device of the open vocabulary transmission line device, and connects various parts of the image segmentation configuration device of the whole open vocabulary transmission line device by using various interfaces and lines. The memory may be used to store the computer program and/or module, and the processor may implement various functions of the open vocabulary transmission line device image segmentation configuration device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Example IV

On the basis of the embodiment of the open vocabulary transmission line equipment image segmentation configuration method, another embodiment of the present invention provides a storage medium, where the storage medium includes a stored computer program, and when the computer program runs, the equipment where the storage medium is located is controlled to execute the open vocabulary transmission line equipment image segmentation configuration method according to any one of the embodiments of the present invention.

In this embodiment, the storage medium is a computer-readable storage medium, and the computer program includes computer program code, where the computer program code may be in a source code form, an object code form, an executable file, or some intermediate form, and so on. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read-only memory (ROM), a random access memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

1. An open vocabulary transmission line equipment image segmentation method is characterized by comprising the following steps:

Obtaining a final segmentation result by combining the segmentation result, the predictive result embedded representation, the text category label and the image feature, and segmenting the image to be segmented according to the final segmentation result;

The obtaining a final segmentation result by combining the segmentation result, the predictive result embedding representation, the text category label and the image feature comprises:

The segmentation result, the prediction result embedding representation, the text display feature and the image feature are fused through dot product operation, and the final segmentation result is obtained; the step of obtaining the final segmentation result by fusing the segmentation result, the prediction result embedding representation, the text display feature and the image feature through dot product operation, comprises the following steps:

2. The method for segmenting an image of an open vocabulary transmission line device according to claim 1, comprising: and when training the adapter and the segmentation result predictor, iteratively training the corresponding degree of the segmentation result and the text category label and the segmentation result through a cross entropy loss function.

3. The method for segmenting an image of an open vocabulary transmission line device according to claim 2, wherein the iteratively training the corresponding degree of the segmentation result and the text category label and the segmentation result by using a cross entropy loss function comprises:

the cross entropy loss function is specifically:

4. The method for segmenting the image of the open vocabulary transmission line equipment according to claim 1, wherein the step of inputting the implicit text feature and the image to be segmented into a pre-trained text image diffusion model to obtain the internal image feature comprises the following steps:

5. The method for segmenting an image of an open vocabulary transmission line device according to claim 4, wherein the extracting the noise image of the image to be segmented by the forward process of the diffusion model comprises:

the noise image acquisition mode specifically comprises the following steps:

wherein, Is a noisy image; Is the step length; is an image to be segmented; Super parameters for controlling the noise size in the model; noise sampled from the standard normal distribution; is a standard normal distribution.

6. The method for segmenting an image of an open vocabulary transmission line device according to claim 1, comprising: the adapter consists of a multi-layer perceptron; and the adapter is in residual connection with a text encoder in the visual language large model of the transmission line equipment.

7. The method for segmenting an image of an open vocabulary transmission line device according to claim 1, comprising: the diffusion model is obtained through training of a text image data set of preset transmission line equipment.

8. An open vocabulary transmission line equipment image segmentation equipment, characterized by comprising: the device comprises an image feature extraction module, a text implicit feature extraction module, an image internal feature extraction module, an image segmentation module and a feature fusion module;

the feature fusion module is used for obtaining a final segmentation result by combining the segmentation result, the predictive result embedded representation, the text category label and the image feature, and segmenting the image to be segmented according to the final segmentation result;

The feature fusion module is configured to obtain a final segmentation result by combining the segmentation result, the prediction result embedded representation, the text category label, and the image feature, and includes:

The segmentation result, the prediction result embedding representation, the text display feature and the image feature are fused through dot product operation, and the final segmentation result is obtained;

The step of obtaining the final segmentation result by fusing the segmentation result, the prediction result embedding representation, the text display feature and the image feature through dot product operation, comprises the following steps: