CN118279724A

CN118279724A - Industrial vision multi-downstream task processing method based on large language model

Info

Publication number: CN118279724A
Application number: CN202410710722.XA
Authority: CN
Inventors: 赵书雯; 陈安; 周才健; 周柔刚
Original assignee: Hangzhou Huicui Intelligent Technology Co ltd
Current assignee: Hangzhou Huicui Intelligent Technology Co ltd
Priority date: 2024-06-04
Filing date: 2024-06-04
Publication date: 2024-07-02
Anticipated expiration: 2044-06-04
Also published as: CN118279724B

Abstract

The invention discloses an industrial vision multi-downstream task processing method based on a large language model, which comprises the following steps: acquiring an industrial query image and a question text, dividing the industrial query image into a plurality of image blocks, sending the image blocks into an image feature encoder to extract image features, and converting the image features into a query image token; identifying the questioning text to obtain a visual task type, and generating an input text of a large language model according to system setting, the questioning text, the visual task type and the query image token; and processing different visual tasks by the large language model of the input text according to the designated task type, the input system setting and the task requirement. The invention can efficiently and accurately process various downstream tasks in the industrial vision scene, and improve the adaptability and performance of the large model on the industrial vision multi-downstream task.

Description

Industrial vision multi-downstream task processing method based on large language model

Technical Field

The invention relates to the technical field of visual task application of large language models, in particular to an industrial visual multi-downstream task processing method based on a large language model.

Background

Currently, traditional large models are typically optimized for a single task, such as object detection or image classification, which makes it difficult for the model to accommodate different types of tasks without retraining and fine-tuning. This specialization approach limits the model's performance on new tasks or tasks that change slightly.

Insufficient generalization ability: since traditional models are mainly trained on specific data sets, these data sets tend to cover limited scenes and variations, and models tend to perform poorly when new conditions are encountered in an actual industrial environment, particularly when the data distribution changes. Efficiency and cost issues: whenever a new downstream task is introduced, a large amount of data needs to be re-collected and fine-tuning of the model is performed, which is time consuming and costly, especially in a rapidly changing industrial environment, where this approach is inefficient.

In traditional large model applications, each type of visual task often requires a separate model architecture or extensive fine-tuning for each new task. For example, one model may be optimized for image classification, while another is dedicated to object detection. Not only does this approach increase the complexity and cost of deploying multiple models, but it requires a new round of data labeling and model training each time a new downstream task is added, which is time consuming and costly.

Industrial vision systems generally require extremely high flexibility and efficiency because they need to quickly accommodate various visual inspection tasks on the production line, such as detecting different types of defects, dividing multiple objects, counting objects simultaneously, etc. These tasks have stringent requirements on the format, processing accuracy and response time of the data, and conventional large models often have difficulty meeting the flexible processing requirements of such multitasking.

Disclosure of Invention

The invention aims to overcome the technical defects, and provides an industrial vision multi-downstream task processing method based on a large language model, which solves the technical problem that the traditional large model in the prior art is difficult to meet the flexible processing requirement of the multi-task.

In order to achieve the above technical object, in a first aspect, the present invention provides a method for processing industrial vision multiple downstream tasks based on a large language model, comprising the following steps:

Acquiring an industrial query image and a question text, dividing the industrial query image into a plurality of image blocks, sending the image blocks into an image feature encoder to extract image features, and converting the image features into a query image token;

identifying the questioning text to obtain a visual task type, and generating an input text of a large language model according to system setting, the questioning text, the visual task type and the query image token;

And processing different visual tasks by the large language model of the input text according to the designated task type, the input system setting and the task requirement.

Compared with the prior art, the industrial vision multi-downstream task processing method based on the large language model has the beneficial effects that:

The invention provides an industrial vision multi-downstream task processing method based on a large language model, which is used for efficiently and accurately processing various downstream tasks in an industrial vision scene and improving the adaptability and performance of the large model on the industrial vision multi-downstream task:

Model structure optimization: by integrating the image encoder and the image-to-text feature converter, the invention optimizes the conversion process from industrial images to refined image tokens, ensuring that the model is able to capture detailed information critical to each downstream task. Meanwhile, various prompt image processing strategies are introduced to support small sample learning and image prompt tasks.

Structured output of large language model: according to different task requirements, the invention designs a structured output format, which comprises a customized output template aiming at tasks such as image description, target positioning, anomaly detection, segmentation, classification, counting and the like, so that the model can generate an intuitive and easy-to-understand result.

Enhanced generalization ability: the invention provides a unified model architecture, which enhances the generalization capability of the model to new unseen scenes by sharing learning across various visual tasks. This approach significantly reduces the need for retraining for a particular task.

Efficient multitasking capability: by integrating multiple downstream visual tasks into a single model, the present invention significantly improves efficiency. The unified method allows the seamless switching of the models to tasks such as image description, target detection, anomaly detection and the like, and multiple independent models are not needed.

Adaptability and flexibility: the proposed model incorporates a dynamic adaptation mechanism allowing it to be adjusted on the fly to accommodate different task demands. This adaptability is critical to maintaining high performance in a wide variety of industrial vision applications, ensuring that the model remains effective even when the task conditions change.

According to some embodiments of the invention, generating the input text of the large language model further comprises the steps of:

Acquiring an industrial prompt image, dividing the industrial prompt image into a plurality of image blocks, sending the image blocks into an image feature encoder to extract image features, and converting the image features into a prompt image token;

And generating an input text of a large language model according to the system setting, the questioning text, the visual task type, the query image token and the prompt image token.

According to some embodiments of the invention, the visual task types include:

image description task, object detection task, anomaly detection task, segmentation task, classification task, and counting task.

According to some embodiments of the invention, before inputting the text into the large language model, the method comprises the steps of:

creating annotation information for the training image to obtain a training data set, wherein the training data set comprises descriptive text annotations, segmentation annotations of remarkable objects, rotary rectangular frame annotations and abnormal region annotations of abnormal images;

Description text labeling: describing each training image in detail, including the name of the object, the color of the object, the position of the object, the relative size of the object and the activity state of the object appearing in the image;

And (3) marking the segmentation of the remarkable object: each salient object in the training image is segmented and marked, and the outline of each salient object is defined by using pixel-level marking, so that the model recognizes and understands the shape and the spatial position of each independent object in the image;

Labeling a rotating rectangular frame: drawing a rotating rectangular box for each salient object in the image;

labeling an abnormal region: for an image containing abnormal conditions, marking an abnormal region, accurately defining the position of occurrence of the abnormality by using segmentation marks, and simultaneously providing description to indicate the nature of the abnormality and the reason of occurrence of the abnormality;

the training data set is input into the large language model for alternate task training.

According to some embodiments of the invention, the data processing of the object detection task comprises the steps of:

Normalizing the industrial query image coordinate values to be in the range of 0-99, wherein 0 represents the leftmost (x) or uppermost (y) of the image and 99 represents the rightmost (x) or lowermost (y) of the image;

Labeling the position and shape information of each target object, wherein the labeling information of each target object comprises center point coordinates (x, y), width (w), height (h) and rotation angle (theta);

center point coordinates (x, y): representing the position of the center point of the target object in the image;

width (w) and height (h): respectively representing the width and the height of the target object, and normalizing the size to be in the range of 0-99 to represent the proportion of the target object relative to the whole image size;

Rotation angle (θ): indicating the rotation angle of the target object relative to the horizontal line of the image, ranging from-180 degrees to 180 degrees, angle 0 indicating no rotation, positive values indicating clockwise rotation, and negative values indicating counterclockwise rotation.

According to some embodiments of the invention, the data processing of the anomaly detection task comprises the steps of:

Dividing the industrial query image into 24x24 grids to obtain 574 image blocks, wherein each image block represents a small part of the image, and the state is represented as 0 or 1, wherein 0 represents that the image block does not contain a target object or an abnormal region, and 1 represents that the image block contains the target object or the abnormal region;

for three consecutive image blocks in each row, all possible state combinations based on the three image blocks are encoded as one number from 0 to 7, each row being divided into 8 groups of three consecutive image blocks, so each row can be represented by 8 serial numbers between 0 and 7;

Serial number output: the abnormality detection result of the industrial query image is output as 24 serial numbers, and each serial number corresponds to one line in the image and represents the abnormality detection state of the industrial query image.

According to some embodiments of the invention, the large language model of the input text comprises the steps of:

inputting the input text into a word embedding model to perform word vector conversion to obtain word embedding vectors which can be processed by the model

，

Wherein the method comprises the steps ofIn the case of a content of a text,For a word embedding model, T is a word embedding vector for large language model inputs.

According to some embodiments of the invention, after processing the different visual tasks, the method further comprises the steps of:

A low-rank adaptive tuning strategy (LoRA) is used to fine tune the large language model while keeping the feature converter and image encoder frozen during training.

In a second aspect, the present invention provides a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the industrial vision multi-downstream task processing method based on the large language model according to any one of the first aspects.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings, in which the summary drawings are to be fully consistent with one of the drawings of the specification:

FIG. 1 is a flow chart of a large language model based industrial vision multi-downstream task processing method according to an embodiment of the present invention;

FIG. 2 is an input image of an image description provided by one embodiment of the present invention;

FIG. 3a is an input image of object detection provided by one embodiment of the present invention;

FIG. 3b is an output image of object detection provided by one embodiment of the present invention;

FIG. 4a is an input image of anomaly detection provided by one embodiment of the present invention;

FIG. 4b is an output image of anomaly detection provided by one embodiment of the present invention;

FIG. 5a is an input image of an image segmentation provided in one embodiment of the present invention;

FIG. 5b is an output image of image segmentation provided by one embodiment of the present invention;

FIG. 6 is a classified input image provided by one embodiment of the present invention;

FIG. 7a is a counted input image provided by one embodiment of the present invention;

Fig. 7b is an output image of a count provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It should be noted that although functional block diagrams are depicted as block diagrams, and logical sequences are shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the block diagrams in the system. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In the field of industrial vision, many scenes have a high degree of similarity in visual characteristics despite task diversification (e.g., image description, object detection, anomaly detection, segmentation and counting, etc.). However, conventional large models often suffer from inadequate generalization capability.

Conventional models tend to be trained and optimized for specific tasks, such as individual object detection or anomaly detection models. This oversaturation results in models that perform poorly in the face of new tasks that vary only slightly, as they fail to learn the generic visual features that can be applied across tasks. And most models are trained on defined data sets that often cover only limited scene changes. Such data limitations result in models that are susceptible to performance degradation when dealing with new situations that are not seen in the actual industrial environment, as they lack the ability to handle different data distributions.

Embodiments of the present invention will be further described below with reference to the accompanying drawings.

Referring to fig. 1-7, fig. 1 is a flowchart of a method for processing industrial vision multiple downstream tasks based on a large language model according to an embodiment of the present invention; FIG. 2 is an input image of an image description provided by one embodiment of the present invention; FIG. 3a is an input image of object detection provided by one embodiment of the present invention; FIG. 3b is an output image of object detection provided by one embodiment of the present invention; FIG. 4a is an input image of anomaly detection provided by one embodiment of the present invention; FIG. 4b is an output image of anomaly detection provided by one embodiment of the present invention; FIG. 5a is an input image of an image segmentation provided in one embodiment of the present invention; FIG. 5b is an output image of image segmentation provided by one embodiment of the present invention; FIG. 6 is a classified input image provided by one embodiment of the present invention; FIG. 7a is a counted input image provided by one embodiment of the present invention; fig. 7b is an output image of a count provided by an embodiment of the present invention.

In one embodiment, the industrial vision multi-downstream task processing method based on the large language model comprises the following steps: acquiring an industrial query image and a question text, dividing the industrial query image into a plurality of image blocks, sending the image blocks into an image feature encoder to extract image features, and converting the image features into a query image token; identifying a questioning text to obtain a visual task type, and generating an input text of a large language model according to system setting, the questioning text, the visual task type and a query image token; and inputting a text large language model, and processing different visual tasks by the large language model according to the designated task type, the input system setting and the task requirement.

1. Extracting image features: key features are first extracted from an industrial image using an image encoder and these features are converted into image tokens (token). In addition, feature extraction and tokenization processing are performed similarly for a presentation image including auxiliary information.

2. Image-text feature integration: the image token is combined with the associated text information (e.g., task instructions, question text, etc.) via an image-to-text feature converter to provide comprehensive input for the large language model.

3. Large language model processing: and combining the image characteristics and text input, and performing deep processing on the large language model according to the designated task type. The model can process different visual tasks according to the input system settings and task requirements.

4. And (3) output generation: the model generates structured outputs through specific text templates according to the processing results, and the outputs directly correspond to various task requirements, such as descriptive texts, target positions, anomaly detection results, segmentation information, classification labels, counting results and the like. For target detection, anomaly detection and segmentation tasks, further visualization processing of the output information is required.

In order to efficiently and accurately process various downstream tasks in an industrial vision scene and improve adaptability and performance of a large model on the tasks, the invention provides an industrial vision multi-downstream task processing method based on a large language model, aiming at tasks such as image description, target detection, anomaly detection, segmentation, classification, counting and the like, the invention realizes the following critical modification and optimization:

1) Model structure optimization: by integrating the image encoder and the image-to-text feature converter, the invention optimizes the conversion process from industrial images to refined image tokens, ensuring that the model is able to capture detailed information critical to each downstream task. Meanwhile, various prompt image processing strategies are introduced to support small sample learning and image prompt tasks,

2) Structured output of large language model: according to different task demands, the invention designs a structured output format, which comprises a customized output template aiming at tasks such as image description, target positioning, anomaly detection, segmentation, classification, counting and the like, so that the model can generate an intuitive and easy-to-understand result;

3) And (3) manufacturing a high-quality training set: in the aspect of data preparation, the invention provides a set of detailed language description labeling schemes for industrial visual images.

Model structure

(1) Image feature extraction

-Each input image(Size 336x 336) is divided into 576 image blocks (patches),

These image blocks are fed into an image feature encoder(Based on the ViT structure of CLIP), 576 image tokens are output, each token having dimensions 1024,

The output image token passes through an MLP adapterProcessing to obtain a text token: The size of 576x4096,

；

(2) Large language model input

The input format is designed to support flexible dialog modes including system settings, task types, question text, and query and hint image tokens. The specific format is as follows:

[ System setting ] [ Task type ][ Question text ][ Query image Starter ][ Query image token ][ Query image terminator ][ Hint image Starter ][ Hint image token ][ Prompt image terminator ]] [...]]

-Prompting the image portion as an option, an image prompt may be provided as a reference. If the image is not prompted, the image can be not added; if there are multiple hint images, multiple may be added,

-The text content of the system settings part is set to:

="A dialogue focused on industrial vision where an AI assistant provides precise, detail-oriented, and polite responses."

( Translation: dialogs around industrial vision in which artificial intelligence assistants provide accurate, detail and polite answers. )

-Text content of the task type part is selected from the following 6:

="<describe>" / "<object_detection>" / "<classify_anomaly>" / "<classify_object>" / "<classify_image>" / "<count>"

-the question text is a user input section;

-start and stop symbols of the query and hint images are used to mark the start and end of the query and hint images, respectively, set to:

="<query_img_s>"

="<query_img_e>"

="<prompt_img1_s>"

="<prompt_img1_e>"

if a plurality of prompt images exist, the start and stop signs of the prompt images are increased in sequence;

-text-to-word vector conversion: for word embedding models, for converting text into embedded vectors that the model can handle:

，

Wherein the method comprises the steps of For text content, T is a word embedding vector for large language model inputs. The model is applied to transform all the text contents;

-querying an image token And prompting an image tokenAre obtained by the image feature extraction method of 3.1.1.

(3) Large language model output

Depending on the task type, the model will have a relatively fixed output format, the specific output format referencing the training set making section,

For the target detection task, it is necessary to draw a rectangular box on the input image according to the coordinates, size and rotation angle of the rotating rectangular box in the answer;

For the anomaly detection and segmentation task, it is necessary to mark the anomaly region or the segmentation region at a corresponding position on the input image, based on the index of the image block in the answer.

Generating the input text of the large language model further comprises the steps of: acquiring an industrial prompt image, dividing the industrial prompt image into a plurality of image blocks, sending the image blocks into an image feature encoder to extract image features, and converting the image features into a prompt image token; and generating an input text of the large language model according to the system setting, the questioning text, the visual task type, the query image token and the prompt image token.

Further, the visual task types include: image description task, object detection task, anomaly detection task, segmentation task, classification task, and counting task.

Further, before inputting the text large language model, the method comprises the steps of:

Creating annotation information for the training image to obtain a training data set, wherein the training data set comprises descriptive text annotations, segmentation annotations of remarkable objects, rotary rectangular frame annotations and abnormal region annotations of abnormal images; description text labeling: describing each training image in detail, including the name of the object, the color of the object, the position of the object, the relative size of the object and the activity state of the object appearing in the image; and (3) marking the segmentation of the remarkable object: each salient object in the training image is segmented and marked, and the outline of each salient object is defined by using pixel-level marking, so that the model recognizes and understands the shape and the spatial position of each independent object in the image; labeling a rotating rectangular frame: drawing a rotating rectangular box for each salient object in the image; labeling an abnormal region: for an image containing abnormal conditions, marking an abnormal region, accurately defining the position of occurrence of the abnormality by using segmentation marks, and simultaneously providing description to indicate the nature of the abnormality and the reason of occurrence of the abnormality; the training data set is input into a large language model for alternating task training.

Training set making

(1) Preliminary data annotation

The object is: rich annotation information is created for all images, including descriptive text, segmentation annotations for salient objects, rotation rectangular box annotations, and anomaly region annotations for anomaly images.

The steps are as follows:

descriptive text labeling: each image is described in detail, including the objects present in the image, the color, location, relative size, and any significant activity or state of the objects. The description is as comprehensive as possible, covering all the important elements and details in the image,

-Salient object segmentation labeling: and carrying out accurate segmentation and labeling on each remarkable object in the image. Defining the outline of each object using pixel-level labeling, ensuring that the model is able to identify and understand the shape and spatial location of each individual object in the image;

-rotating rectangular box labeling: drawing a rotating rectangular box for each salient object in the image;

Abnormal region labeling (abnormal image only): for an image containing an abnormal situation, an abnormal region needs to be marked specifically. Segmentation markers are used to accurately define the location where the anomaly occurred, while providing a short description indicating the nature and possible cause of the anomaly.

(2) Target detection task data processing mode

In order to meet the requirements of a large language model in processing an image target detection task, a specific data processing mode is defined by the method. The method aims at realizing efficient and accurate target detection by marking the position and shape information of each target object and converting the information into a format which can be understood by a model.

Labeling information: the labeling information of each target object includes center point coordinates (x, y), width (w), height (h), and rotation angle (θ).

Center point coordinates (x, y): representing the position of the center point of the target object in the image. The coordinate values are normalized to a range of 0-99, where 0 represents the leftmost (x) or uppermost (y) edge of the image and 99 represents the rightmost (x) or lowermost (y) edge of the image.

Width (w) and height (h): representing the width and height, respectively, of the target object, the dimensions are also normalized to the range of 0-99, reflecting the ratio of the target relative to the overall image size.

Rotation angle (θ): representing the rotation angle of the target object relative to the horizontal line of the image, ranging from-180 degrees to 180 degrees. The angle 0 indicates no rotation, a positive value indicates clockwise rotation, and a negative value indicates counterclockwise rotation.

(3) Abnormality detection and division task data processing method

In order to effectively adapt the task of image segmentation and anomaly detection to the processing capabilities of large language models, an innovative data representation method is employed that simplifies the output of the model by subdividing the image into 24x24 image blocks and utilizing a coding mechanism to achieve efficient and accurate image segmentation and anomaly detection.

Image segmentation: first, each input image is divided into 24×24 grids. Each image block in the grid represents a small portion of the image, the state of which is represented as 0 or 1, where 0 indicates that the image block does not contain the target object or abnormal region, and 1 indicates that it does.

Encoding states of three consecutive patches: for three consecutive image blocks in each row, their states (0 or 1) are encoded as a number from 0 to 7. This encoding is based on all possible state combinations of three image blocks (i.e., 000, 001, 010, 011, 100, 101, 110, 111), each corresponding to a unique number.

The row represents: for each line, each line can be represented by 8 sequence numbers between 0 and 7, since it is divided into 8 consecutive groups of three tiles. Thus, all information of a row is compressed into a serial number of 8 digits. In addition, it is specifically prescribed that if an entire row is 0, a single 0 is used to represent a value of a single row.

Serial number output: the segmentation or anomaly detection result for each image will be output as 24 serial numbers, one for each line in the image. These sequence numbers are regarded as words, and taken together represent the segmentation or abnormality detection state of the entire image.

Examples: if the sequence of image block states in a row is 000, 111, 010, 001, 111, 000, 010, 111, the corresponding coding sequence number is 0,7, 2, 1,7,0, 2, 7, this row is abbreviated as "07217027".

(4) Answer template

The templates described below are for reference only, and the actual answer may take many different forms and expressions. The model is trained to take into account the diversity of answers to accommodate various possible queries and scenarios.

Image description task: the answer templates of the image description task are text descriptions in the data annotation.

Target detection task:

The model is intended to predict the coordinates, size and rotation angle of a rotating rectangular frame, so the answer is desired to be simple and fixed, so the following design is made:

"A/The { object } is located at { x, y, w, h, θ }" (translation } "object } is located at { coordinate })

Abnormality detection task:

The model is intended to judge whether there is an abnormality, if so, give a description about the abnormality, and give a binary image code to help locate the abnormality location:

no abnormal condition:

"THIS IMAGE does not contain any abnormals" ("translation: this image does not contain any anomaly.)

"There is no { specific abnormality } DETECTED IN THE image" (translation: no { specific anomaly } detected in the image)

The abnormal condition is as follows:

"This image contains {type of anomaly} anomalies. The binary map of anomalies is represented by the following sequence: {}, {}, {},..., indicating the distribution of anomalies across the image." （ Translation: there are { particular anomalies } in the image. The abnormal binary pattern is represented by the following sequence: { }, { }, etc., indicates the distribution of the anomaly over the image. )

Segmentation tasks: similar to anomaly detection, it is necessary to determine whether the object is present in the image, and if so, binary image coding is given:

There is no corresponding object to be placed in the space,

"THIS IMAGE does not contain any { object }" (translation: this image does not contain any { object })

The object is:

"The segmentation of the image is represented by the following sequence: {}, {}, {}, ..., indicating the distribution of different regions across the image." （ Translation: the segmentation of the image is represented by the following sequence: { }, { },... ")

Classification tasks: provides a classification result, and requires a very simple answer,

"THE IMAGE IS CLASSIFIED AS A { object }" (translation: the image is classified as { object })

Counting tasks: the requirement explicitly indicates the number of specific objects in the image,

"There are { } { objects } visible in the image" ("translation: there are { } objects } visible in the image)

Fine tuning method

(1) Pre-training model

-An image encoder: CLIP-ViT (L-14)

-An image-text feature converter: MLP adapter pre-trained using BLIP subtitles on 558K subset of LAION-CC-SBU dataset

-Large language model: vicuna-v1.5 (7B)

(2) Fine tuning method

Trimming the large language model using a low-rank adaptive trim strategy (LoRA) while keeping the feature converter and the image encoder frozen during training,

-An alternating training of the different tasks,

In the segmentation task, if there are a plurality of objects in the training image, each time a part of them is randomly selected for questioning,

Fine-tuning of small sample learning involves target detection, anomaly detection, segmentation, classification, counting tasks.

(3) Question templates:

The templates described below are only used as references, and the actual question may be presented in a variety of different sentence patterns and expressions. The model is trained to take into account the diversity of answers to accommodate various possible queries and scenarios.

Image description task:

Question templates: "Describe THE IMAGE IN detail" ("translation: detailed description image")

Target detection task:

question templates: "Describe the position of { object }" (translation: description of the location of { object })

Small sample question template: "Describe the position of the object in prompt images" ("translation: describe the position of an object in a hint image.)

Abnormality detection task:

Question template ："Does this image contain any anomalies? If yes, describe the type of anomaly and its distribution." （ translation: is this image contain any anomalies? If so, the anomaly type and its distribution are described. )

Small sample question template ："Does this image contain any anomalies compared with the prompt images? If yes, describe the type of anomaly and its distribution." （ translation: is this image to contain any anomalies compared to the hint image? If so, the anomaly type and its distribution are described. )

Segmentation tasks:

Question template ："How is the {object} segmented in this image? Describe its boundary and relation to other objects." （ translation: how is the { object } in this image segmented? Describing its boundaries and relationships with other objects. )

Small sample question template ："How is the {object} in prompt images segmented in this image? Describe its boundary and relation to other objects." （ translation: how does { object } in a hint image split in this image? Describing its boundaries and relationships with other objects. )

Classification tasks:

Question templates: "CLASSIFY THE IMAGE WITHIN one of THE GIVEN CLASSES: { object1}, { object2},..

Small sample question template: "IS THIS IMAGE of the object in prompt images

Counting tasks:

Question templates: "How much { object } ARE IN THE IMAGE

Small sample question template: "How many of the object in prompt IMAGES ARE IN THE query image

Actual operation effect, the following left is image input and text input, and the right is the result of the model of the present invention.

Image description:

，

。

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, an embodiment of the present invention provides a computer-readable storage medium storing computer-executable instructions that are executed by a processor or a controller, for example, by one of the processors in the above-described terminal embodiment, so that the above-described processor performs the industrial vision multi-downstream task processing method based on the large language model in the above-described embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

The above-described embodiments of the present invention do not limit the scope of the present invention. Any other corresponding changes and modifications made in accordance with the technical idea of the present invention shall be included in the scope of the claims of the present invention.

Claims

1. The industrial vision multi-downstream task processing method based on the large language model is characterized by comprising the following steps of:

2. The large language model based industrial vision multiple downstream task processing method of claim 1, wherein generating the input text of the large language model further comprises the steps of:

3. The large language model based industrial vision multiple downstream task processing method of claim 1, wherein the vision task types include:

4. A method of large language model based industrial vision multiple downstream task processing as defined in claim 3, comprising the steps of, prior to said inputting said text into said large language model:

5. The large language model based industrial vision multi-downstream task processing method of claim 4, wherein the data processing of the object detection task comprises the steps of:

normalizing the industrial query image coordinate values to be in the range of 0-99, wherein 0 represents the leftmost x or the uppermost y of the image, and 99 represents the rightmost x or the lowermost y of the image;

Labeling the position and shape information of each target object, wherein the labeling information of each target object comprises center point coordinates x, y, width w, height h and rotation angle theta;

center point coordinates x, y: representing the position of the center point of the target object in the image;

width w and height h: respectively representing the width and the height of the target object, and normalizing the size to be in the range of 0-99 to represent the proportion of the target object relative to the whole image size;

rotation angle θ: the rotation angle of the target object with respect to the horizontal line of the image is represented, ranging from-180 degrees to 180 degrees, the rotation angle 0 represents no rotation, a positive value represents clockwise rotation, and a negative value represents counterclockwise rotation.

6. The large language model based industrial vision multi-downstream task processing method according to claim 4, wherein the data processing of the abnormality detection task includes the steps of:

7. The large language model based industrial vision multiple downstream task processing method of claim 1, wherein the large language model of the input text comprises the steps of:

,

8. The large language model based industrial vision multiple downstream task processing method of claim 1, further comprising the step of, after processing the different vision tasks:

the large language model is fine-tuned using a low-rank adaptive fine-tuning strategy while keeping the feature converter and image encoder frozen during training.

9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the large language model-based industrial vision multi-downstream task processing method according to any one of claims 1 to 8.