CN117745680A

CN117745680A - Abnormality detection method and device based on large visual language model

Info

Publication number: CN117745680A
Application number: CN202311782815.5A
Authority: CN
Inventors: 袁烨; 张永; 兰儒恺; 周晨; 金骏阳; 王茂霖
Original assignee: Yuanshi Intelligent Technology Nantong Co ltd
Current assignee: Yuanshi Intelligent Technology Nantong Co ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-22

Abstract

The invention provides an anomaly detection method and device based on a large visual language model, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: inputting the industrial image to be detected into a target image coding module in a target visual language model to obtain multi-scale image characteristics; inputting the multi-scale image features to a target feature fusion module in the target visual language model to obtain fusion features; inputting the fusion characteristics and the abnormal detection problem text corresponding to the industrial image to be detected into a target language module in the target visual language model to obtain an abnormal detection answer text; the target visual language model is obtained by training based on a simulated sample image generated by a Bezier curve generation algorithm, an abnormality detection question text, an image description text, an abnormality detection positioning label and an abnormality detection answer label corresponding to the simulated sample image. The invention can effectively improve the efficiency and the accuracy of abnormality detection even under the condition that the abnormal sample is scarce.

Description

Abnormality detection method and device based on large visual language model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an anomaly detection method and device based on a large visual language model.

Background

Industrial anomaly detection is a technique for identifying and detecting anomalies in an industrial process. The industrial abnormality detection not only can improve the production efficiency and the product quality, but also can reduce the production cost and ensure the safe production, so that the technical problem to be solved for the industrial production is urgent in how to efficiently and accurately detect the industrial abnormality.

In industrial anomaly detection, sufficient sample is required to be able to detect and locate an anomaly region in an image of an industrial product. However, since abnormal samples in a factory are often scarce and the type of defect is difficult to predict, it is difficult to accurately and efficiently detect an abnormality.

What is needed is a method and apparatus for anomaly detection based on a large visual language model to solve the above-mentioned problems.

Disclosure of Invention

The invention provides an anomaly detection method and device based on a large visual language model, which are used for solving the defect that in the prior art, an anomaly sample is rare and is difficult to accurately and efficiently detect the anomaly, and improving the efficiency and the accuracy of anomaly detection under the condition that the anomaly sample is rare.

The invention provides an anomaly detection method based on a large visual language model, which comprises the following steps:

inputting an industrial image to be detected into a target image coding module in a target visual language model to obtain multi-scale image characteristics of the industrial image to be detected;

Inputting the multi-scale image features to a target feature fusion module in the target visual language model to obtain fusion features;

inputting the fusion characteristics and the abnormality detection question text corresponding to the industrial image to be detected to a target language module in the target visual language model to obtain an abnormality detection answer text corresponding to the abnormality detection question text;

the target visual language model is obtained by training based on simulated sample images with various defect types, an abnormality detection question text, an image description text, an abnormality detection positioning label and an abnormality detection answer label corresponding to each simulated sample image; each of the simulated sample images is a sample image generated based on various simulated defect morphologies and each of the normal samples; each simulated defect form is generated by simulating various different types of industrial defect forms based on a Bezier curve generation algorithm; the normal sample is a sample industrial image without a defective area.

According to the anomaly detection method based on the large visual language model, the target visual language model is obtained based on training of the following steps:

Dividing each sample industrial image into a normal sample or an abnormal sample according to the defect marks of each sample industrial image;

counting the industrial defect forms corresponding to the abnormal samples to obtain industrial defect forms of various different defect types;

based on the Bezier curve generation algorithm, respectively simulating each industrial defect form to generate at least one simulated defect form corresponding to each industrial defect form;

fusing each simulated defect form with each normal sample to generate a simulated sample image corresponding to each simulated defect form;

generating an abnormality detection question text, an image description text, an abnormality detection positioning label and an abnormality detection answer label corresponding to each simulated sample image;

constructing a sample data set based on each of the simulated sample images, the abnormality detection question text, the image description text, the abnormality detection locating tag and the abnormality detection answer tag;

and performing iterative training on the initial visual language model based on the sample data set to obtain the target visual language model.

According to the abnormality detection method based on the large visual language model, the initial visual language model comprises an initial image coding module, an initial image decoding module, an initial text coding module, an initial feature extraction module, an initial feature fusion module and an initial language module;

The iterative training is performed on the initial visual language model based on the sample data set to obtain the target visual language model, and the iterative training comprises the following steps:

inputting each simulated sample image in the sample data set to the initial image coding module to obtain a first multi-scale image characteristic of each simulated sample image, inputting an image description text corresponding to each simulated sample image in the sample data set to the initial text coding module to obtain a first text characteristic corresponding to each simulated sample image, and inputting the first multi-scale image characteristic and the first text characteristic to the initial image decoding module to obtain an abnormality detection positioning prediction result corresponding to each simulated sample image;

performing iterative training on the initial image coding module, the initial image decoding module and the initial text coding module based on the abnormality detection positioning prediction result and the abnormality detection positioning label;

inputting each simulated sample image to a trained initial image coding module to obtain second multi-scale image features of each simulated sample image, inputting image description text corresponding to each simulated sample image to the trained initial text coding module to obtain second text features corresponding to each simulated sample image, and inputting the second multi-scale image features and the second text features to a feature alignment layer of a trained initial image decoding module to obtain first fusion features corresponding to each simulated sample image;

Inputting the first fusion feature to an initial feature extraction module to obtain a second fusion feature, and inputting the second fusion feature and the second multi-scale image feature to the initial feature fusion module to obtain a target fusion feature corresponding to each simulated sample image;

inputting the target fusion characteristics and the abnormality detection question text corresponding to each simulation sample image to the initial language module to obtain an abnormality detection answer prediction result corresponding to each simulation sample image;

performing iterative training on the initial feature extraction module and the initial feature fusion module based on the abnormal detection answer prediction result and the abnormal detection answer label;

the target image coding module is built based on the trained initial image coding module, the target feature fusion module is built based on the trained initial feature fusion module, and the target language module is built based on the initial language module.

According to the anomaly detection method based on the large visual language model, the initial image decoding module comprises a feature alignment layer and an anomaly positioning layer, wherein the feature alignment layer comprises a feature pyramid layer and a first feature fusion layer;

Inputting the first multi-scale image feature and the first text feature to the initial image decoding module to obtain an anomaly detection positioning prediction result corresponding to each simulated sample image, including:

inputting the first multi-scale image features to the feature pyramid layer to obtain multi-scale fusion image features;

inputting the multi-scale fusion image features and the first text features into the first feature fusion layer to obtain third fusion features;

and inputting the third fusion characteristic to the abnormal positioning layer to obtain the abnormal detection positioning prediction result.

According to the abnormality detection method based on the large visual language model, the initial feature fusion module comprises a second feature fusion layer and a self-attention layer;

inputting the second fusion feature and the second multi-scale image feature to the initial feature fusion module to obtain target fusion features corresponding to the simulated sample images, wherein the target fusion features comprise:

inputting the second fusion feature and the second multi-scale image feature into the second feature fusion layer to obtain a fourth fusion feature;

and inputting the fourth fusion feature into the self-attention layer to obtain the target fusion feature.

According to the anomaly detection method based on the large visual language model provided by the invention, the iterative training is carried out on the initial visual language model based on the sample data set to obtain the target visual language model, and the anomaly detection method comprises the following steps:

performing data preprocessing on each analog sample image in the sample data set;

performing iterative training on the initial visual language model based on the preprocessed sample data set to obtain the target visual language model;

wherein the data preprocessing includes at least one of contrast enhancement processing, edge sharpening processing, smoothing filtering processing, and normalization processing.

According to the anomaly detection method based on the large visual language model provided by the invention, the initial visual language model is iteratively trained based on the preprocessed sample data set to obtain the target visual language model, and the anomaly detection method comprises the following steps:

performing data enhancement on each simulated sample image in the preprocessed sample data set;

performing iterative training on the initial visual language model based on the sample data set after data enhancement to obtain the target visual language model;

wherein the data enhancement includes at least one of random rotation, random flip, random scaling, and random clipping.

The invention also provides an abnormality detection device based on the large visual language model, which comprises:

the characteristic extraction unit is used for inputting the industrial image to be detected into a target image coding module in a target visual language model to obtain multi-scale image characteristics of the industrial image to be detected;

the feature fusion unit is used for inputting the multi-scale image features to a target feature fusion module in the target visual language model to obtain fusion features;

the detection unit is used for inputting the fusion characteristics and the abnormality detection question text corresponding to the industrial image to be detected into a target language module in the target visual language model to obtain an abnormality detection answer text corresponding to the abnormality detection question text;

The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the abnormality detection method based on the large visual language model when executing the program.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the large visual language model-based anomaly detection method as described in any one of the above.

The present invention also provides a computer program product comprising a computer program which when executed by a processor implements the large visual language model-based anomaly detection method as described in any one of the above.

According to the anomaly detection method and device based on the large visual language model, on one hand, various industrial defect forms of different types are simulated through the Bezier curve generation algorithm to generate various simulated defect forms, the various simulated defect forms are combined with various normal samples to obtain rich simulated sample images of various different defect types, and then the initial visual language model is trained based on the simulated sample images of various different defect types, so that the efficiency and the accuracy of anomaly detection can be effectively improved even under the condition that the abnormal samples are scarce; on the other hand, the multi-scale image feature extraction and feature fusion are carried out on the industrial image to be detected through the target visual language model, and the anomaly detection is carried out based on the fusion feature and the anomaly detection problem text, so that the image feature of the industrial defect is integrated into the large visual language model, and the method can be rapidly and accurately applied to the anomaly detection of various industrial scenes, and therefore the efficiency and the accuracy of the anomaly detection are further improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an anomaly detection method based on a large visual language model provided by the invention;

FIG. 2 is a second flow chart of the anomaly detection method based on the large visual language model provided by the invention;

FIG. 3 is a schematic diagram of the structure of an initial visual language model provided by the present invention;

FIG. 4 is a schematic diagram of a simulation of an anomaly detection method based on a large visual language model provided by the invention;

FIG. 5 is a schematic diagram of the structure of the abnormality detection device based on the large visual language model provided by the invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The abnormality detection method based on the large visual language model of the present invention is described below with reference to fig. 1 to 4.

Fig. 1 is a schematic diagram of a large visual language model-based anomaly detection method according to one embodiment; as shown in fig. 1, the method includes:

step 101, inputting an industrial image to be detected into a target image coding module in a target visual language model to obtain multi-scale image characteristics of the industrial image to be detected; the target visual language model is obtained by training based on simulated sample images with various defect types, an abnormality detection question text, an image description text, an abnormality detection positioning label and an abnormality detection answer label corresponding to each simulated sample image; each of the simulated sample images is a sample image generated based on various simulated defect morphologies and each of the normal samples; each simulated defect form is generated by simulating various different types of industrial defect forms based on a Bezier curve generation algorithm; the normal sample is a sample industrial image without a defect area;

the industrial image to be detected is an image of an industrial part such as a screw, a gear, a cable, a steel material, etc. which is required to be subjected to abnormality detection; the industrial image to be detected herein may be downloaded and input by a user from the internet or shared and input through the image capturing device, which is not particularly limited in this embodiment.

The target visual language model may include a target image encoding module, a target feature fusion module, and a target language module; the target image coding module is used for extracting features of different scales from the image; the target feature fusion module is used for fusing the multi-scale features; the target language module is used for outputting corresponding abnormality detection answers based on the fusion characteristics and the abnormality detection question text.

Optionally, prior to performing step 101, a target visual language model needs to be trained in advance; the model can be obtained by training based on simulated sample images with various defect types, anomaly detection question texts corresponding to the simulated sample images, image description texts, anomaly detection positioning labels and anomaly detection answer labels, and the specific training steps comprise:

firstly, collecting a small number of sample industrial images with different defect types, namely abnormal samples, and a large number of sample industrial images without defect areas, namely normal samples; the small number, i.e. the number, is smaller than the target value configured according to the number of defect types, e.g. equal to the number of defect types, or within a preset multiple of the number of defect types;

Secondly, counting industrial defect forms of different defect types contained in the abnormal sample; simulating various different types of industrial defect forms based on a Bezier curve generation algorithm to generate simulated defect forms; and combining each simulated defect form with each normal sample to obtain simulated sample images of various defect types.

Then, taking simulated sample images of various defect types, corresponding abnormality detection question texts and corresponding image description texts as samples, and constructing a sample data set by taking abnormality detection positioning labels and abnormality detection answer labels as labels;

then, the initial visual language model may be optimally trained based on the sample data set, so as to construct a target visual language model capable of precisely outputting an abnormality detection answer text containing an abnormality detection result based on the industrial image and the corresponding abnormality detection question text according to the trained initial visual language model.

Here, the optimization training may be overall training of all initial modules in the initial visual language model based on the sample data set, for example, overall iterative training of an initial image encoding module, an initial image decoding module, an initial text encoding module, an initial feature extraction module, an initial feature fusion module, and an initial language module in the initial visual language model based on a total loss function of the initial visual language model; or controlling each part of modules to participate in model training in a manner of freezing and fine-tuning in stages, for example, fixing parameters of one part of modules, optimizing parameters of other part of modules, fixing other trained part of modules, and optimizing parameters of the part of modules.

Then, after the target visual language model is obtained through iterative training, the industrial image to be detected can be directly input into a target image coding module of the target visual language model to carry out multi-scale image feature extraction, so as to obtain multi-scale image features of the industrial image to be detected; the method may also be that preprocessing is performed on the industrial image to be detected, including but not limited to scale normalization processing, image alignment processing, filtering processing, noise reduction processing, and the like, and then the processed image is input to a target image coding module of the target visual language model to perform multi-scale image feature extraction, so as to obtain multi-scale image features of the industrial image to be detected, which is not limited in this embodiment. The target image coding module can be a feature coding layer comprising a plurality of different scale feature extraction, and the specific structure can be configured according to actual requirements, for example, the target image coding module can be formed based on a graph encoder in a large visual language model PandaGPT.

102, inputting the multi-scale image features to a target feature fusion module in the target visual language model to obtain fusion features;

optionally, after the multi-scale image features are obtained in step 101, the multi-scale image features may be input to a target feature fusion module, and at least one feature fusion is performed on the multi-scale image features to obtain corresponding fusion features;

The target feature fusion module can comprise one or more network layers for feature fusion, so that one or more times of feature fusion of the multi-scale image features is realized; for example, the target feature fusion module may include a feature fusion layer and a self-attention layer, so that the feature fusion layer performs primary feature fusion on the multi-scale image features and then inputs the primary feature fusion to the self-attention layer, and the self-attention layer performs secondary feature fusion to obtain corresponding fusion features.

Step 103, inputting the fusion characteristics and the abnormality detection question text corresponding to the industrial image to be detected to a target language module in the target visual language model to obtain an abnormality detection answer text corresponding to the abnormality detection question text;

the abnormality detection question text here is a question text configured to request acquisition of an abnormality detection result of an industrial image to be detected, such as a question text of "whether or not there is an abnormal region in this picture" or "what is the abnormality type of this picture" or the like.

The target language module here includes LLM (Large Language Model ) that is pre-trained based on a large amount of question-answer data from various different domains in the internet.

Optionally, after the fusion feature is obtained based on the step 102, the fusion feature and an abnormality detection question text corresponding to the industrial image to be detected may be input to a target language module, and the target language module inputs a corresponding abnormality detection answer text based on the fusion feature and the abnormality detection question text; if the question text is "whether the picture has an abnormal region", the corresponding answer text may be "yes", and the upper left of the picture has an abnormal region "; for another example, the question text is "what is the abnormal type of the picture", and the corresponding answer text may be "abnormal knocks on the gear surface in the picture".

According to the method provided by the embodiment, on one hand, various industrial defect forms of different types are simulated through a Bezier curve generation algorithm to generate various simulated defect forms, the various simulated defect forms are combined with various normal samples to obtain rich simulated sample images of various different defect types, and then, an initial visual language model is trained based on the simulated sample images of various different defect types, so that the efficiency and the accuracy of anomaly detection can be effectively improved even under the condition that abnormal samples are scarce; on the other hand, the multi-scale image feature extraction and feature fusion are carried out on the industrial image to be detected through the target visual language model, and the anomaly detection is carried out based on the fusion feature and the anomaly detection problem text, so that the image feature of the industrial defect is integrated into the large visual language model, and the method can be rapidly and accurately applied to the anomaly detection of various industrial scenes, and therefore the efficiency and the accuracy of the anomaly detection are further improved.

In some embodiments, the target visual language model is trained based on the following steps:

FIG. 2 is a second flow chart of the abnormality detection method based on the large visual language model according to the present embodiment; as shown in fig. 2, the target visual language model is trained based on the following steps:

and a data acquisition step: and acquiring various sample industrial images in various industrial scenes. The so-called sample industrial images include, but are not limited to, images of screws, gears, cables, steel, etc.

It should be noted that, the industrial images of the various samples include a large number of normal samples and a small number of abnormal samples, that is, the number of the abnormal samples is far smaller than that of the normal samples; for this purpose, in order to enrich the number of abnormal samples, it may be determined whether a defective area exists in each sample industrial image according to a defect flag of each sample industrial image, thereby performing sample type division on each sample industrial image to divide each sample industrial image into a normal sample or an abnormal sample; the normal sample is an industrial image without a defective area, and the abnormal sample is an industrial image with at least one defective area.

Sample data set construction: and counting the industrial defect forms corresponding to all the abnormal samples to obtain industrial defect forms of various different defect types, so as to provide expert experience for training of the subsequent target visual language model, thereby being beneficial to building a more comprehensive target visual language model and improving the accuracy of potential defects and the accuracy of fault detection. The industrial defect forms include, but are not limited to, the shape, length and other characteristics of industrial defects, such as elongated crack defect forms, circular spot defect forms and irregular breakage defect forms.

In addition, in order to simulate and generate a simulated sample image, a Bezier curve can be adopted to simulate each industrial defect form so as to generate at least one simulated defect form corresponding to each industrial defect form, such as a plurality of simulated defect forms of cracks, pits, spots and the like; after each simulated defect form is obtained, the obtained simulated defect form can be randomly filled with preset priori color information and noise data and then fused with each normal sample to obtain a simulated sample image corresponding to each simulated defect form; as the simulated sample image has great randomness, the generalization of the model is greatly expanded. The fusion can be used for obtaining a simulated sample image, and the simulated defect form can be pasted, overlapped and masked to a normal sample.

The formula of the Bezier curve generation algorithm is as follows:

B(t)＝(1-t) ² P ₀ +2t(1-t)P ₁ +t ² P ₂ ,t∈[0,1]；

wherein B (t) is simulated defect morphology data; p (P) ₀ ，P ₁ ，P ₂ The starting point, the control point and the ending point of the curve respectively can be determined according to the shapes of various industrial defects; t is a random parametric variable.

And generating an abnormality detection question text, an image description text, an abnormality detection positioning label and an abnormality detection answer label corresponding to each simulated sample image so as to fine-tune the initial visual language model and endow expert knowledge.

The image description text is called for providing a target in the image and a defect attribute description of the target, such as "this is an image of a steel sheet which is silvery and has no broken, scratched, knocked portion"; the called anomaly detection answer label is used for providing a target anomaly detection result in the image, including but not limited to whether an anomaly region exists, the category and the number of the anomaly region, if yes, the anomaly collision exists in the middle region of the input image;

in order to accurately judge the position of the defect, the called anomaly detection answer label comprises anomaly detection results of all areas in the image; for example, the image may be divided into a grid of 3×3, and the names upper left, upper right, upper left, middle, right, lower left, lower right, respectively, so that the descriptive contents can be generated from the abnormal region positions in the abnormality detection answer label.

Taking simulated sample images of various defect types, corresponding abnormality detection question texts and corresponding image description texts as samples, and constructing a sample data set by taking abnormality detection positioning labels and abnormality detection answer labels as labels;

sample data set processing step: the sample data set obtained in the sample data set constructing step can be directly used as a final training data set to carry out optimization training on the initial visual language model; or after one or more kinds of processing are performed on the sample data set, performing optimization training on the initial visual language model based on the processed sample data set; the processing includes, but is not limited to, image enhancement preprocessing and image preprocessing.

Model construction: constructing an initial visual language model based on an initial image encoding module, an initial image decoding module, an initial text encoding module, an initial feature extraction module, an initial feature fusion module and an initial language module;

model training: and performing iterative training on the initial visual language model constructed in the model construction step based on the sample and the label in the sample data set processed in the sample data processing step to obtain a target visual language model capable of efficiently and accurately outputting a corresponding abnormality detection answer text based on the industrial image and the corresponding abnormality detection question text.

According to the method provided by the embodiment, various industrial defect forms of different types are simulated through the Bezier curve generation algorithm to generate various simulated defect forms, the various simulated defect forms are combined with various normal samples to obtain rich simulated sample images of various different defect types, then, the initial visual language model is trained based on the simulated sample images of various different defect types, and even under the condition that abnormal samples are scarce, the target visual language model which can be based on industrial images of different scenes and corresponding abnormal detection problem texts can be trained efficiently and accurately only according to positive sample data and expert knowledge, and therefore, operators can be helped to rapidly locate abnormal areas and find abnormal reasons by carrying out multi-round dialogue.

In some implementations, the initial visual language model includes an initial image encoding module, an initial image decoding module, an initial text encoding module, an initial feature extraction module, an initial feature fusion module, and an initial language module;

FIG. 3 is a schematic diagram of the initial visual language model according to the present embodiment; as shown in fig. 3, the initial visual language model includes an initial image encoding module, an initial image decoding module, an initial text encoding module, an initial feature extraction module, an initial feature fusion module, and an initial language module; the output end of the initial image coding module and the output end of the initial text coding module are respectively connected with the input end of the initial image decoding module; the output end of the initial image decoding module is connected with the input end of the initial characteristic extraction module; the output end of the initial feature extraction module and the output end of the initial image coding module are respectively connected with the input end of the initial feature fusion module; the output end of the initial feature fusion module is connected with the input end of the initial language module. The so-called initial feature extraction module may be formed based on the construction of a ResNet34 (Residual Network 34, 34 layer Residual Network).

Optionally, the training of the initial visual language model may be divided into three frozen trim training phases to provide a priori knowledge of the target visual language model obtained by training, where the training includes:

the first stage, fixing parameters of an initial image encoding module, parameters of an initial feature extraction module, parameters of an initial feature fusion module and parameters of an initial language module, and performing overall training on an initial image decoding module and an initial text encoding module, wherein the method specifically comprises the following steps:

inputting each simulated sample image into an initial image coding module, extracting multi-scale characteristics of each simulated sample image to obtain first multi-scale image characteristics of each simulated sample image, inputting image description texts corresponding to each simulated sample image into an initial text coding module, carrying out text characteristic coding to obtain first text characteristics corresponding to each simulated sample image, inputting the first multi-scale image characteristics and the first text characteristics into an initial image decoding module, carrying out abnormality detection positioning after the image characteristics and the text characteristics are aligned, and obtaining abnormality detection positioning prediction results corresponding to each simulated sample image.

In some implementations, the step of the initial image decoding module performing anomaly detection localization further includes:

As shown in fig. 3, the initial image decoding module includes a feature alignment layer and an anomaly localization layer; the feature alignment layer comprises a feature pyramid layer and a first feature fusion layer and is used for performing alignment operation on text features and image features to obtain corresponding fusion features; the abnormal positioning layer is used for carrying out abnormal positioning according to the fusion characteristics so as to obtain corresponding abnormal detection positioning prediction results.

Optionally, the positioning prediction step of the initial image decoding module includes:

firstly, based on a linear layer in a feature pyramid layer, carrying out linear processing on each layer of image features output by an initial image coding module to obtain fusion image features corresponding to each layer of image features, namely, multi-scale fusion image features: for the first layer image characteristics output by the initial image coding module, carrying out linear processing on the first layer image characteristics based on a linear layer to obtain fusion image characteristics corresponding to the first layer image characteristics; and for each other layer image feature except the first layer image feature output by the initial image coding module, carrying out linear processing on the other layer image feature and the previous layer image feature of the other layer image feature after fusing based on a linear layer to obtain a fused image feature corresponding to the other layer image feature.

Then, based on the first feature fusion layer, carrying out multi-level feature fusion on the multi-scale fusion image features and the first text features to obtain a third fusion feature: for the fusion image features corresponding to the first layer image features, carrying out matrix multiplication on the fusion image features corresponding to the first layer image features and the first text features to obtain alignment features corresponding to the first layer image features; fusing the alignment feature corresponding to the image feature of the previous layer with the fusion image feature corresponding to the image feature of the other layer except the fusion image feature corresponding to the image feature of the first layer to obtain the alignment feature corresponding to the image feature of the other layer; and taking the alignment feature corresponding to the image feature of the last layer as a third fusion feature.

And then, inputting the third fusion characteristic into an abnormality positioning layer for abnormality positioning to obtain a corresponding abnormality detection positioning prediction result.

And then, determining an abnormality detection positioning loss function based on the abnormality detection positioning prediction result and the abnormality detection positioning label, and performing initial iterative training on the initial image decoding module and the initial text encoding module based on the abnormality detection positioning loss function.

The second stage, fixing the trained initial text coding module, the parameters of the initial feature extraction module, the parameters of the initial feature fusion module and the parameters of the initial language module, and carrying out overall training on the initial image decoding module and the initial image coding module after the initial training, wherein the method specifically comprises the following steps:

referring to the training step of the first stage, inputting each analog sample image into an initial image coding module, inputting an image description text corresponding to each analog sample image into a trained initial text coding module, inputting multi-scale image features output by the initial image coding module and text features output by the trained initial text coding module into a trained initial image decoding module, and obtaining a corresponding abnormality detection positioning prediction result; and determining an abnormality detection positioning loss function based on the corresponding abnormality detection positioning prediction result and the abnormality detection positioning label, and performing iterative training on the trained initial image decoding module and the trained initial image encoding module based on the abnormality detection positioning loss function.

Thus, the initial text coding module after the first stage training is used as a final initial text coding module after the training, and the initial image decoding module after the second stage training and the initial image coding module are used as a final initial image coding module after the training and a final initial image decoding module after the training.

In some embodiments, for the anomaly detection location prediction result and the anomaly detection location tag in the first and second stages, the determined anomaly detection location loss function may be determined based on weighted addition fusion of the classification loss function and the regression loss function to accurately perform classification and regression loss calculation.

Wherein the classification loss function is constructed based on a variable focus loss function VFL (p, q):

where α is the loss weight value for the foreground background; p is p ^γ Is the weight corresponding to different simulated sample images; p is the IACS of foreground-background prediction (Intersection Over Union-Aware Classification Score, cross-ratio perceptual classification score); q is the true IoU (Intersection Over Union, cross-over) score of the foreground and background.

Regression loss function L _dice The multi-class loss value calculation function is realized based on a measurement function for evaluating the similarity of two samples, the value range is between 0 and 1, and the larger the value is, the more similar the value is. The calculation formula is as follows:

wherein y is _i And (3) withRespectively representing the abnormal detection positioning labels of the marked pixels i and the abnormal detection positioning prediction results of the pixels i output by the initial image decoding module, wherein n=H×W is the total number of pixel points contained in the analog sample image; h and W are the height and width, respectively, of the analog sample image.

The anomaly detection locating loss function L is determined by weighting, adding and fusing based on the classification loss function and the regression loss function:

L＝αVFL+βL _dice ；

the α and β are weight coefficients of the classification loss function and the regression loss function, respectively, and may be specifically set according to actual requirements, for example, α and β are both set to 1.

It should be noted that, for the training process of the first stage and the second stage, the learning rate attenuation strategy may be a strategy combining the wakeup (learning rate preheating strategy) and the cosineAnning (learning rate annealing strategy), and the calculation formula of the learning rate lr (T) based on the wakeup function is as follows:

wherein T is _warmup For the maximum iteration number, T is the iteration number, lr _max Is the maximum learning rate. The strategy is helpful for slowing down the early fitting phenomenon of the mini-batch by the model in the original stage and keeping the distribution stable.

The calculation formula of the learning rate lr (T) based on CosineAnning is as follows:

wherein lr is _min For minimum learning rate lr _max For maximum learning rate, T _cur Representation ofCurrent iteration number, T _max Representing the maximum number of iterations. By this strategy the local minima are jumped out and a path to the global minimum is found.

The combination of the so-called wakeup and the CosineAnnealing (learning rate annealing strategy) can be that the learning rate is gradually increased in a mode of performing linear increment or exponential increment by the wakeup function in the initial training period, namely under the condition that the iteration number is smaller than a preset round; and in the later training period, namely under the condition that the iteration times are larger than the preset rounds, cosine annealing is carried out on the learning rate through a CosineAnnealing function, so that the performance and stability of the model are improved.

In the third stage, parameters of the trained initial image coding module, the trained initial image decoding module and the trained initial text coding module are fixed, and parameters of the initial feature extraction module and the initial feature fusion module are integrally trained, and the method specifically comprises the following steps:

inputting each simulated sample image into a trained initial image coding module for multi-scale feature extraction to obtain second multi-scale image features of each simulated sample image; inputting image description texts corresponding to each simulated sample image into a trained initial text encoding module to perform text feature encoding to obtain second text features corresponding to each simulated sample image, and inputting the second multi-scale image features and the second text features into a feature alignment layer of a trained initial image decoding module to perform text feature and image feature alignment to obtain first fusion features corresponding to each simulated sample image;

then, inputting the first fusion feature into an initial feature extraction module for depth feature extraction to obtain a second fusion feature, and inputting the second fusion feature and the second multi-scale image feature into the initial feature fusion module for feature fusion to obtain target fusion features corresponding to each simulated sample image;

In some embodiments, the initial feature fusion module includes a second feature fusion layer and a self-attention layer;

As shown in fig. 3, the initial feature fusion module includes a second feature fusion layer and a self-attention layer; the second feature fusion layer is used for carrying out fusion operation on the second fusion features and the second multi-scale image features so as to obtain corresponding target fusion features; the self-attention layer is used for carrying out feature extraction and self-attention operation on the second fusion features and the second multi-scale image features so as to assist the model in understanding the relevance among the features, and further carrying out weighting processing according to the relevance among the features, so that key information in the simulated sample images can be better captured, and target fusion features corresponding to the simulated sample images can be obtained.

Wherein, the calculation formula of the self-attention layer is as follows:

wherein,the length of the key vector Q, K, V is respectively different vectors obtained by multiplying the input matrix by different weights, namely, a query vector, a key vector and a value vector.

And then, inputting the target fusion characteristics and the abnormality detection question text corresponding to each simulation sample image into an initial language module for question-answer reasoning to obtain an abnormality detection answer prediction result corresponding to each simulation sample image.

Based on the abnormality detection answer prediction result and the abnormality detection answer label, determining an abnormality detection question-answer loss function, and performing iterative training on the initial feature extraction module and the initial feature fusion module based on the abnormality detection question-answer loss function.

Thus, the initial feature extraction module after the third stage training is used as a final initial feature extraction module after the training, and the initial feature fusion module after the third stage training is used as a final initial feature fusion module after the training.

The training initial image coding module is used as a target image coding module, the training initial feature fusion module is used as a target feature fusion module, and the initial language module is used as a target language module, so that the target visual language model can be constructed and formed.

It should be noted that the learning rate attenuation strategy of the third stage training may be an optimization strategy combining the approach and cosineAnning

According to the method provided by the embodiment, through multi-stage model training, knowledge and fine granularity semantic understanding capability required by anomaly detection can be provided for an initial language module, so that an anomaly detection task can be completed better, and the content of image data can be described more accurately; and through simulating the consistency constraint of the abnormal detection positioning prediction result of the sample image and the answer output by the initial language module, the interaction between the modules in the model is enhanced, so that the abnormal detection precision of the whole model is improved, and further, the efficient and accurate industrial quality inspection work by using computer vision to replace human eyes and the efficient and accurate positioning of the defect position by dialogue with the model are realized.

In some embodiments, the iteratively training the initial visual language model based on the sample data set to obtain the target visual language model includes:

Optionally, as shown in fig. 2, the sample data set processing step specifically further includes: in order to accelerate the training efficiency of the target visual language model, at least one data preprocessing may be performed on each simulated sample image in the sample data set to improve the quality of the sample image and normalize the sample image, thereby improving the quality and feature representation of the sample data, and further improving the performance and effect of the model in the training process.

The data preprocessing includes at least one of contrast enhancement processing, edge sharpening processing, smoothing filtering processing, and normalization processing. The contrast enhancement can increase the difference between different brightness levels in the image, so that details are more obvious, the model can be helped to capture details and features in the image better, and the accuracy of the model is improved; the edge sharpening processing can enhance the technology of object edges and details in the image, so that the edge characteristics in the image are more obvious, thereby improving the visual quality of the image and enhancing the contrast; the smoothing filter process reduces the difference between adjacent pixel values by blurring the pixel values in the image, thereby producing a smoothing effect, a technique for reducing noise in the image and reducing details in the image, so that the model can be better focused on the main features and information; the image normalization can adopt maximum and minimum value normalization to change the value range of the pixel value of the image from [0, 255] to [0,1], which is beneficial to better processing and comparing the pixel values among different images by the model, improves the stability and generalization capability of the model, and accelerates the network training.

In some embodiments, the iteratively training the initial visual language model based on the preprocessed sample data set to obtain the target visual language model includes:

Optionally, as shown in fig. 2, the sample data set processing step specifically further includes: after the preprocessed sample data set is obtained, at least one data enhancement process can be performed on each simulated sample image in the preprocessed sample data set, so that sample data in the sample data set is further enriched, the diversity of the sample data is greatly improved, the performance of a model is improved, and therefore the accuracy of anomaly detection is further improved.

The data enhancement process includes, but is not limited to, at least one of random rotation, random flip, random scaling, and random clipping; the random rotation is randomly performed within a set angle range, so that the diversity of data is enhanced, and the model is facilitated to learn defect detection at different angles; the random overturn is carried out to horizontally or vertically overturn the image according to a certain probability so as to enhance the diversity of data; random scaling and random cropping scales and cropping the image to change the size and viewing angle of the image, helping the model to handle defects of different sizes and locations.

The method for detecting the abnormality provided in this embodiment is compared with the existing method under the condition of the same parameter configuration by taking a specific industrial quality inspection problem as an example, and the effectiveness of the method for detecting the abnormality provided in this embodiment is verified.

Fig. 4 is a schematic simulation diagram of an anomaly detection method based on a large visual language model according to the present embodiment; as shown in fig. 4, the abscissa Epoch is the number of iterations of the model, and the ordinate Image-AUC (Area Under the Receiver Operating Characteristic Curve for Images, the area enclosed by the coordinate axis under the subject working characteristic curve of the Image) has a value ranging from 0 to 1, and a value closer to 1 indicates that the performance of the model is better. By comparing the visual language model (hereinafter also referred to as Attention) built by the fused self-Attention layer provided in this embodiment with the visual language model (hereinafter also referred to as No Attention) without adding the self-Attention layer, it can be known that the Image-AUC of the visual language model built by the fused self-Attention layer provided in this embodiment is better, thereby characterizing that the detection effect of the visual language model added with self-Attention provided in this embodiment is better, and further proving the effectiveness of the anomaly detection method provided in this embodiment.

The abnormality detection device based on the large visual language model provided by the invention is described below, and the abnormality detection device based on the large visual language model described below and the abnormality detection method based on the large visual language model described above can be referred to correspondingly with each other.

Fig. 5 is a schematic structural diagram of an anomaly detection device based on a large visual language model according to the present embodiment, where the device includes a feature extraction unit 501 configured to input an industrial image to be detected into a target image encoding module in a target visual language model, so as to obtain multi-scale image features of the industrial image to be detected; the feature fusion unit 502 is configured to input the multi-scale image features to a target feature fusion module in the target visual language model, so as to obtain fusion features; the detection unit 503 is configured to input the fusion feature and an abnormality detection question text corresponding to the industrial image to be detected to a target language module in the target visual language model, so as to obtain an abnormality detection answer text corresponding to the abnormality detection question text; the target visual language model is obtained by training based on simulated sample images with various defect types, an abnormality detection question text, an image description text, an abnormality detection positioning label and an abnormality detection answer label corresponding to each simulated sample image; each of the simulated sample images is a sample image generated based on various simulated defect morphologies and each of the normal samples; each simulated defect form is generated by simulating various different types of industrial defect forms based on a Bezier curve generation algorithm; the normal sample is a sample industrial image without a defective area.

According to the device provided by the embodiment, on one hand, various industrial defect forms of different types are simulated through a Bezier curve generation algorithm to generate various simulated defect forms, the various simulated defect forms are combined with various normal samples to obtain rich simulated sample images of various different defect types, and then, an initial visual language model is trained based on the simulated sample images of various different defect types, so that the efficiency and the accuracy of anomaly detection can be effectively improved even under the condition that abnormal samples are scarce; on the other hand, the multi-scale image feature extraction and feature fusion are carried out on the industrial image to be detected through the target visual language model, and the anomaly detection is carried out based on the fusion feature and the anomaly detection problem text, so that the image feature of the industrial defect is integrated into the large visual language model, and the image feature can be rapidly and accurately applied to the anomaly detection of various industrial scenes, and the anomaly detection efficiency and the anomaly detection accuracy are further improved

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 601, communication interface (Communications Interface) 602, memory 603 and communication bus 604, wherein processor 601, communication interface 602, memory 603 complete the communication between each other through communication bus 604. The processor 601 may invoke logic instructions in the memory 603 to perform the anomaly detection method based on the large visual language model provided by the methods described above.

Further, the logic instructions in the memory 603 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the anomaly detection method based on the large visual language model provided by the above methods.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the large visual language model-based anomaly detection method provided by the above methods.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An anomaly detection method based on a large visual language model, comprising the steps of:

2. The abnormality detection method based on a large visual language model according to claim 1, wherein the target visual language model is trained based on the steps of:

3. The abnormality detection method based on a large visual language model according to claim 2, wherein the initial visual language model includes an initial image encoding module, an initial image decoding module, an initial text encoding module, an initial feature extraction module, an initial feature fusion module, and an initial language module;

4. The large visual language model based anomaly detection method of claim 3, wherein the initial image decoding module comprises a feature alignment layer and an anomaly localization layer, the feature alignment layer comprising a feature pyramid layer and a first feature fusion layer;

5. The large visual language model based anomaly detection method of claim 3, wherein the initial feature fusion module comprises a second feature fusion layer and a self-attention layer;

6. The abnormality detection method based on a large visual language model according to any one of claims 2 to 5, wherein the iterative training of an initial visual language model based on the sample data set to obtain the target visual language model includes:

7. The method for anomaly detection based on a large visual language model according to claim 6, wherein the iterative training of the initial visual language model based on the preprocessed sample data set to obtain the target visual language model comprises:

8. An anomaly detection device based on a large visual language model, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the large visual language model-based anomaly detection method of any one of claims 1 to 7 when the program is executed.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the large visual language model-based anomaly detection method of any one of claims 1 to 7.