CN115564983A

CN115564983A - Target detection method and device, electronic equipment, storage medium and application thereof

Info

Publication number: CN115564983A
Application number: CN202210555245.5A
Authority: CN
Inventors: 孙磊; 苏浩; 陈浩森
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2023-01-03

Abstract

A target detection method, a target detection device, an electronic device, a storage medium and application thereof are provided, wherein the method comprises the following steps: obtaining a support sample x _s And query sample x _q Two image samples as inputs; supporting the sample x _s Is a normal sample without defects, the query sample x _q Is a sample to be tested; two backbone networks with the same structure and shared weight respectively support the sample x _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q ) (ii) a Will characteristic diagram G _w (x _s ) And G _w (x _q ) Inputting the feature enhancement network to obtain respective enhanced and/or suppressed feature maps v _s And v _q (ii) a Respectively aiming at enhanced and/or suppressed feature graphs v based on feature matching network _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，v _q ) (ii) a Measure the result H (v) _s ，v _q ) Inputting the result into a YOLO layer module for regression calculation, and predicting a query sample x based on the result of the regression calculation _q Defect location and/or confidence level. The method provided by the invention can be used for amplifying the training data of the small sample, improving the generalization performance, enhancing the real-time detection speed and better detecting the position.

Description

Target detection method and device, electronic equipment, storage medium and application thereof

Technical Field

The invention belongs to the technical field of industrial visual inspection, and particularly relates to a target detection method, a target detection device, target detection electronic equipment, a computer-readable storage medium and application of the target detection method in texture surface defect detection.

Background

Machine vision is the most closely integrated artificial intelligence technique for industrial applications. The machine vision technology is to analyze image data acquired by a sensor, realize tasks such as image classification and target positioning, and feed results back to corresponding equipment for subsequent operations. Among them, surface defect detection plays a very important role in the field of machine vision, and deep learning represented by Convolutional Neural Network (CNN) has been developed, and more deep learning models are applied to the field of surface defect detection.

However, in the above industrial scenario, the texture surface defect detection faces many challenges, for example, the illumination environment varies greatly, the defect types are various, the defect size varies greatly, and the texture surface defect detection is greatly affected by image noise caused by camera shake, interference caused by image background, and the like. Meanwhile, the training data samples are few, the cost for manufacturing the samples with the defects is too high, so that the number of the samples with the defects is small, and the features of the defect images are difficult to learn in a small number of samples by a general deep learning model. The generalization performance is low, the surface defect detection method based on deep learning has limited generalization performance, and once the style or category of the object to be detected changes, the difference from the training sample is large, and the model is difficult to ensure effective detection. The detection speed of the deep learning model is low, the deep learning model is difficult to be applied to some scenes with high detection real-time requirements, and the existing method can only classify whether the image has defects or not, but cannot identify the specific positions of the defects.

Therefore, it is still necessary and urgent to develop and design a single-sample learning target detection method capable of ensuring texture surface defects, and apply the method to an amplification training sample to improve the generalization performance of the deep learning surface defect detection method, classify whether an image has defects in real time, and identify the specific positions of the texture surface defects.

Disclosure of Invention

The invention aims to provide a target detection method, a target detection device, electronic equipment, a storage medium and application thereof, and aims to solve the technical problems of small number of training samples, low generalization performance of a surface defect detection method, poor detection real-time performance and inaccurate detection defect position in texture surface defect detection.

The purpose of the invention and the technical problem to be solved are realized by adopting the following technical scheme.

The first aspect of the present invention provides a target detection method, including the following steps: obtaining a support sample x _s And query sample x _q Two image samples as inputs; wherein the supporting sample x _s Is a defect-free normal sample, the query sample x _q Is a sample to be tested; two backbone networks with the same structure and shared weight respectively support the sample x _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q ) (ii) a The feature map G _w (x _s ) And G _w (x _q ) Inputting the feature enhancement network to obtain respective enhanced and/or suppressed feature maps v _s And v _q (ii) a Respectively carrying out the enhancement and/or the inhibition on the feature map v based on a feature matching network _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，，v _q ) (ii) a The measurement result H (v) _s ，v _q ) Inputting the result into a YOLO layer module to perform regression calculation, and predicting the query sample x based on the regression calculation result _q Defect location and/or confidence level.

Preferably, the two backbone networks with the same structure and shared weight respectively pair the support samples x _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q ) The method comprises the following steps: the backbone network is formed by YOLO-fastest and is marked as Gw; supporting the sample x _s And query sample x _q Respectively inputting the input ends of the backbone network to perform feature extraction; obtaining the corresponding characteristic graphs G _w (x _s ) And G _w (x _q )。

Preferably, the feature map G _w (x _s ) And G _w (x _q ) Inputting the characteristic enhancement network to obtain the characteristic graph v which is respectively enhanced and/or suppressed _s And v _q The method comprises the following steps: constructing the feature enhancement network based on an improved Non-Local attention mechanism; two feature maps G _w (x _s ) And G _w (x _q Inputting an input of the feature enhancement network; mutually strengthening the characteristics with stronger correlation, and mutually inhibiting the characteristics with weaker similarity; respectively outputting the respectively enhanced and/or suppressed feature maps v through the feature enhancement network _s And v _q (ii) a Wherein, the characteristic diagram G _w (x _s ) Has a dimension of w _s *h _s *c，G _w (x _q ) Has a dimension of w _q *h _q *c，h _s ，h _q Representation feature diagram G _w (x _s ) And G _w (x _q ) Height dimension of (d), w _s ，w _q Should represent the characteristic diagram G _w (x _s ) And G _w (x _q ) C is a feature map G _w (x _s ) And G _w (x _q ) The number of channels of (c).

Preferably, the mutually enhancing the features with stronger correlation and mutually suppressing the features with weaker similarity comprises: the feature map G _w (x _s ) Respectively convolving by two point-by-point convolution nets to make the characteristic diagram G _w (x _s ) The space size of (2) is not changed, the number of channels is reduced by half, and the result is respectively recorded as G (G) _w (x _s ) ) and

the feature map G _w (x _q ) Respectively convolving the feature maps G by two point-by-point convolution networks _w (x _q ) The space size of (A) is not changed, the number of channels is reduced by half, and the result is respectively denoted as G (G) _w (x _q ) And θ (G) _w (x _q ) ); will be provided with

And theta (G) _w (x _q ) Respectively reconstructed into two-dimensional matrices; matrix multiplication is carried out on the two-dimensional matrixes to obtain a dimension w _q h _q *w _s h _s A matrix of (a); let the dimension be w _q h _q *w _s h _s Inputting a network layer formed by a softmax function into the matrix to complete similarity calculation; output dimension w _q h _q *w _s h _s A matrix of (a); g (G) _w (x _s ) G and G (G) _w (x _q ) Reconstructed as a matrix; matrix multiplication is carried out on the matrix and the output of the softmax function respectively to obtain dimensionalities of

And

two matrices of (a); respectively have the dimensions of

And

are reconstructed into dimensions respectively

And

two feature maps of (a); the dimension is measured

And

respectively inputting the two characteristic graphs into two point-by-point convolution networks for convolution, and performing channel number-liter dimension; respectively comparing the channel number L-dimensional result with G _w (x _s ) And G _w (x _q ) Add to obtain v _s And v _q 。

Wherein the content of the first and second substances,

W _θ ，

and

coefficients which are all linear transformations, i, j representing the characteristic diagram

Ith or W of _θ G _W The (j) th element(s),

a similarity calculation function, v, representing the above feature map _s Has a dimension of w _s *h _s *c，v _q Has a dimension of w _q *h _q *c。

Preferably, the

The similarity calculation function calculates the similarity between two vectors using the radial basis function, as follows:

wherein, W _θ G _W (x _s ) _i ，

Representing two column vectors respectively.

Preferably, the feature matching network respectively performs matching on the enhanced and/or suppressed feature maps v _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，v _q ) The method comprises the following steps: dimension is respectively w _s *h _s * c, respectively enhanced and/or suppressed profile v _s And w _q *h _q * c, respectively enhanced and/or suppressed profile v _q Inputting a feature matching network; for v _s W of _s *h _s Vector sum v of dimensions c x1 _q W of _q *h _q Combining every two vectors with dimensions c 1 and calculating the similarity according to a similarity calculation formula; obtained dimension w _q *h _q *(w _s h _s c) A similarity feature map of (2); the similarity calculation formula is as follows:

similarity(v _s，i ，v _q，j )＝(v _s，i -v _q，j ) ² (ii) a Wherein the subscripts i andj represents v _s The ith vector sum v of _q The jth vector of (a); will have dimension w _q *h _q *(w _s h _s c) The similarity characteristic graph is subjected to grouping convolution to obtain a dimension w _q *h _q * c, similarity feature map; the convolution kernel size of the packet convolution is 1 x1, the step size is 1, and the packet number is c; sequentially dividing the dimension into w _q *h _q * c similarity profiles with respective enhanced and/or suppressed profiles v _q Along w _q *h _q * c, splicing the dimensions; the final output dimension of the feature matching network is w _q *h _q * Measurement result of (2 c) H (v) _s ，v _q )。

Preferably, said measuring result H (v) is _s ，v _q ) Inputting a YOLO layer to perform regression calculation, and predicting the query sample x based on the regression calculation result _q The defect location and/or confidence in (a) includes: the measurement result H (v) _s ，v _q ) Inputting a YOLO layer to perform regression calculation; obtaining a first error loss between the prediction result of the defect position and a truth label based on a loss function CIOU, and predicting the query sample x based on the first error loss _q The defect location in (a); and/or when the confidence is predicted, obtaining a second error loss between the predicted confidence and a truth label of the defect position based on a ternary loss function in the twin network, and predicting the query sample x based on the second error loss _q Confidence of the defect location in (a); the ternary loss function is:

TripleLoss＝∑max((1-y)y′+y(m-y′)，0)

wherein y is a label, 0 indicates no defect, and 1 indicates defect; y' represents the confidence coefficient output by the YOLO layer, and the value range of the confidence coefficient is [0,1]; m represents an edge margin, and m =1.

Preferably, in predicting the defect position, the detection method further includes: and filtering out the identification result of the defect position overlap by using a non-dominant maximum suppression algorithm (NMS), and outputting the position and the confidence coefficient of the defect in the query sample xq.

The second aspect of the present invention provides an object detection apparatus, which includes the following modules: a sample acquisition module for acquiring a support sample x _s And query sample x _q Two image samples as inputs; wherein the supporting sample x _s Is a normal sample without defects, the query sample x _q Is a sample to be tested; the backbone network module is composed of two backbone networks with the same structure and shared weight, and the backbone networks respectively support the sample x _s And query sample x _q Performing feature extraction to obtain corresponding feature maps G _W (x _s ) And G _w (x _q ) (ii) a A feature enhancement module for enhancing the feature map G _w (x _s ) And G _w (x _q ) Input feature enhancement module for obtaining respective enhanced and/or suppressed feature maps v _s And v _q (ii) a A feature matching module for respectively matching the enhanced and/or suppressed feature maps v based on the feature matching module _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，v _q ) (ii) a A YOLO layer module for converting the measurement result H (v) _s ，v _q ) Inputting the YOLO layer module to perform regression calculation, and predicting the query sample x based on the regression calculation result _q Defect location and/or confidence level.

A third aspect of the present invention provides an electronic device, comprising: a memory for storing non-transitory computer readable instructions; and a processor for executing the computer readable instructions such that the computer readable instructions, when executed by the processor, implement the object detection method of any one of claims 1 to 8.

A fourth aspect of the present invention provides a computer-readable storage medium, which includes computer instructions, when the computer instructions are executed on a device, cause the device to execute the object detection method described above.

The fifth aspect of the present invention provides an application of the above target detection method in texture surface defect detection.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the invention at least has the following advantages and beneficial effects:

1. the invention adopts a twin network structure, so that the input end of the target detection method is not an image sample, but a sample pair consisting of two image samples. By the structural mode, the number of training samples can be increased, so that the problem that the cost for manufacturing samples with defects is too high, and the number of samples with defects is small, so that the deep learning model is difficult to learn the target defect image features in a small number of samples is solved to a certain extent.

2. The method adopts a metric learning method in the twin network, and improves the generalization performance of the model by learning the similarity and difference between the corresponding characteristics of the input samples. The twin network-based metric learning method improves generalization performance in the texture surface defect detection of the single sample learning target detection method.

3. According to the method, the position and the confidence coefficient of the target defect are predicted by performing regression on the deep characteristic diagram through fusion transformation of a YOLOv3 target detection method and a model. The invention provides a designed feature enhancement network and a feature matching network, so that a detection method of YOLOv3 and a backbone network module can be better fused under the framework of a twin network, the purpose that whether an image has defects or not can be classified, the specific positions of target defects can be identified, and the application effect in texture surface defect detection is better.

4. On one hand, the lightweight network YOLO-fastest of the open source community is adopted as the backbone network in the invention. Compared with the classical YOLOv3 based on the DarkNet backbone network, the method has the advantages that the number of the used backbone network parameters is less, the calculation complexity is reduced, and the real-time performance is stronger. On the other hand, the target detection method and the target detection model for single-stage single-sample learning are realized through fusion and transformation of the YOLOv3 model, compared with the existing target detection method and the existing target detection model for single-sample learning in two stages, the method and the system are simpler in process and higher in calculation speed in model training and reasoning, and the real-time performance of target detection is enhanced.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are specifically described below with reference to the accompanying drawings.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a single-sample learning target detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data set used in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a method of constructing a data set in an embodiment of the invention;

FIG. 4 is a flow diagram of a technical framework for a single sample learning object detection model in an embodiment of the present invention;

FIG. 5 is a schematic diagram of the internal structure of a feature enhancement network in an embodiment of the invention;

FIG. 6 is a schematic diagram of the internal structure of a feature matching network in an embodiment of the present invention;

FIG. 7 is a comparison of the test results of an embodiment of the present invention with that of a classical YOL0v3-yolofastest on test set samples;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects thereof according to the present invention will be made with reference to the accompanying drawings and preferred embodiments.

In an industrial scene, the detection of texture surface defects based on deep learning faces a lot of difficulties. (1) training data samples are small: the cost of manufacturing the sample with the defect is too high, so the number of the sample with the defect is small, and the feature of the defect image is difficult to learn in a small number of samples by a general deep learning model. (2) low generalization performance: the surface defect detection method based on deep learning has limited generalization performance, and once the style or category of an article to be detected changes, the difference between the style or category and a training sample is large, so that the model is difficult to ensure effective detection. (3) poor real-time: the detection speed of the deep learning model is low, and the deep learning model is difficult to apply to scenes with strong requirements on detection real-time performance. (4) position detection inaccuracy: although there are related methods to solve the three problems, the existing methods can only classify whether the image has defects, but cannot identify the specific positions of the defects.

With the great development of deep learning represented by Convolutional Neural Networks (CNNs), more and more deep learning models are applied to the field of surface defect detection of delicate workpieces. Because deep learning is performed from a large amount of data, the performance of the deep learning method is far superior to that of a traditional characteristic extraction scheme based on sample manual design in terms of processing complex scenes such as small training data samples, low generalization performance, poor real-time performance, poor position detection effect and the like.

Aiming at the problems in the prior art in the field, the invention provides a single-sample learning target detection method which is applied to the field of texture surface defect detection, and comprises the following steps: obtaining a support sample x _s And query sample x _q Two image samples as inputs; wherein the supporting sample x _s Is a normal sample without defects, the query sample x _q Is a sample to be tested; two backbone networks with the same structure and shared weight respectively support the sample x _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x ₃ ) And G _w (x _q ) (ii) a The feature map G _w (x _s ) And G _w (x _q ) Inputting the feature enhancement network to obtain respective enhanced and/or suppressed feature maps v _s And v _q (ii) a Respectively carrying out the enhancement and/or the inhibition on the feature graph v based on a feature matching network _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，v _q ) (ii) a The measurement result H (v) _s ，v _q ) Inputting the result into a YOLO layer module to perform regression calculation, and predicting the query sample x based on the regression calculation result _q Defect location and/or confidence level. The target detection method provided by the invention realizes efficient and rapid defect position detection, and solves the problems of small training data sample, low generalization performance, poor real-time performance and poor position detection effect in the existing scheme in the field of texture surface defect detection to a certain extent.

In view of the problems in the above-mentioned technologies, an object of a preferred embodiment of the present invention is to provide a single sample learning target detection method for detecting defects on a texture surface. The overall framework of the technical scheme of the preferred embodiment of the invention is to perform fusion improvement on a twin network and a model structure of a target detection method of YOLOv3, and the framework structure and the detection flow are shown in fig. 1 and 4. The technical scheme of the invention specifically comprises the following steps:

step S1: obtaining a support sample x _s And query sample x _q Two image samples as inputs; wherein the supporting sample x _s Is a defect-free normal sample, the query sample x _q Is the sample to be tested. Specifically, the target detection method model is used as a single-sample learning target detection method model, and the input end of the target detection method model is a sample pair consisting of two image samples. In texture surface defect detection, an input sample is a normal sample without defects, referred to herein as a support sample, denoted as x _s The other sample is the sample to be tested, referred to herein as the query sample, denoted x _q . And after the two samples are respectively subjected to the target detection method model learned by the single sample, outputting the position and the confidence coefficient of the defect in the sample to be detected. When the texture pattern of the sample to be detected changes, only the normal sample at the input end needs to be replaced by the normal sample with the same texture pattern.

Before performing single sample learning for detecting texture surface defects, firstly, constructing a data set of a target detection method for single sample learning, wherein the step of constructing the data set comprises S11-S14: the steps are as follows:

s11: a data set suitable for the model in accordance with the preferred embodiment of the present invention is constructed on the basis of the data set of the german DAGM 2007. The data of this data set is an open source data set created in 2007 for industrial image processing competitions. The data set is artificially generated and contains a total of 10 classes of texture patterns, as shown in fig. 2. Each type of data set consists of 1000 non-defective images and 150 defective images. Each defective image will have a corresponding mask map used to mark the location of the defect.

S12: because the image size of the original data set is larger, each image and the mask map thereof are divided into 4 equal parts, then the coco format label corresponding to the defect image is generated according to the mask map again, and the coco format target frame after the 4 equal parts is processed by adopting the normalized central point abscissa, the normalized central point ordinate, the normalized width and the normalized height, as shown in fig. 3.

S13: and respectively storing the non-defective samples and the defective samples of each texture pattern into two folders, and then randomly pairing every two samples to construct sample pairs meeting the conditions.

S14: when the sample pairs are manufactured, the data set randomly selects the sample pairs consisting of 7 types of samples as a training set and a verification set, and the rest 3 types of samples are used as a test set. In the matching process, the data set randomly matches a homogeneous defect-free sample for each defective sample as a sample pair, and when all defective samples are matched, it is said to complete a round of matching. After data set preprocessing, 3032 defective samples are obtained from 10 types of samples. For the training set, randomly selecting 70% of samples with defects, and performing 5-round matching on each type to obtain 90120 sample pairs; for the verification set, using the remaining 30% of samples with defects, performing 1 round of matching on each category to obtain 2242 sample pairs in total; for the test set, 1 round of matching was performed for each class, resulting in a total of 1119 sample pairs.

The above design division of the training set and the verification set is to prevent the samples in the verification set from appearing in the training set, so as to ensure that the training set samples are not mixed in the verification set, thereby ensuring the reliability and the effectiveness of the evaluation result. The preferred embodiments of the present invention can be realized by three ways:

1. and (4) setting aside verification: when the method model is evaluated, data are divided into a training set, a verification set and a test set. (relatively well suited for large datasets).

2. k-fold verification: the training data is divided into k partitions of the same size (suitable for small data sets, not used by the preferred embodiment of the present invention).

3. (k-fold cross validation) repeated k-fold validation with a scrambled rhythm (few data sets available).

In a preferred embodiment of the present invention, to prevent overfitting, the following is used:

(1) Reducing the model specification of the method, namely reducing the number of parameters (determined by the number of layers and the number of neurons in each layer) learned by the method; and evaluating on the verification set to find out the method model specification corresponding to the optimal target detection method.

(2) Adding weight regularization, namely enabling the model weight corresponding to the target detection method to only take a smaller value, thereby limiting the complexity of the method model corresponding to the target detection method; l1 regularization, L2 regularization, and the like may be employed.

(3) Random deactivation (dropout) regularization is added, using dropout for a layer, i.e., some of the output features of the layer are randomly rejected (i.e., set to 0) during training, and the dropout ratio is the fraction of features set to 0, typically between 0.2 and 0.5.

The process of constructing the data set obtains a single sample graphic data set for detecting defects on the texture surface. On one hand, by constructing a sample pair, the step can realize the expansion of a training data set, thereby solving the problem of 'small sample training data' in the texture type surface defect detection; on the other hand, by adopting the idea of the twin network, the target detection method model with one image sample at the input end is improved into a sample pair with two different image samples at the input end, and the number of training samples can be increased by adopting a multi-image sample mode at the improved twin network input end, so that the problem that the deep learning target detection method model is difficult to learn target defect image characteristics in a small number of samples is solved to a certain extent.

S2, respectively carrying out pair support samples x by two backbone networks with the same structure and shared weight _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q ). That is, the backbone network is formed by YOLO-fastest and is denoted as Gw; supporting the sample x _s And query sample x _q Respectively inputting the input ends of the backbone network to perform feature extraction; obtaining the characteristic graphs G corresponding to the characteristic graphs _w (x _s ) And G _w (x _q )。

Specifically, the backbone network adopts classical YOLO, and further adopts YOLO-fast, which is denoted as Gw. Here, the network model is only abstracted into function mapping, so that variables may not be written, and G may be written _w (. Cndot.). The network model YOLO-fastest is an open-source network model and largely uses deep separable convolution; compared with a classic backbone network based on YOLOv3 of DarketNet, the method greatly reduces the number of parameters of the backbone network and reduces the operation complexity. The YOLO-fast emphasizes the real-time reasoning performance of a single core, low CPU occupation under the condition of meeting real-time conditions can meet certain real-time performance at a mobile terminal of a mobile phone, and certain real-time performance can be met on RK3399, raspberry Pi 4 and various Cortex-A53 low-cost and low-power-consumption devices, and the embedded devices are much weaker than the mobile terminal, so that the YOLO-fast has wider application prospect; and because the low-matching can be downward compatible, the calculation speed is improved, and the cost is lower. Steps S21 to S22 included in S2 are as follows.

S21, setting model hyper-parameters of the backbone network: setting the image size to 256 × 256, the batch size (batch size) to 32, the number of times that model weight updates occur in each Epoch (Epoch) to 40, the model of the backbone network using an adaptive moment estimation Adam optimizer; the adaptive Moment Estimation Adam optimizer combines the advantages of two optimization algorithms of AdaGrad and RMSProp, and comprehensively considers the First Moment Estimation (namely the mean value of the gradient) and the Second Moment Estimation (namely the non-centralized variance of the gradient) of the gradient, so as to calculate the updating step length. The formula of the adaptive moment estimation Adam optimizer is as follows:

where, theta denotes a parameter vector,

the exponential shift representing the mean of the gradient calculates a first order moment estimate of the bias correction,

the exponential shift of the gradient squared calculates a second raw moment estimate, θ, of the bias correction _t-1 Represents the update step at time t-1, θ _t Denotes the update step size at time t, α denotes the default learning rate, and α =0.001, e =10 ^-8 Avoiding the divisor changing to 0. As can be seen from the above expression, the updated step size calculation can be adaptively adjusted from two angles of gradient mean and gradient square, rather than being directly determined by the current gradient.

The updating of the parameters of the adaptive moment estimation Adam optimizer is not influenced by the gradient expansion transformation; the step size of the update can be limited to approximately the range of the initial learning rate; the hyper-parameters have good interpretability, usually do not need to be adjusted or only need little fine adjustment, can automatically adjust the learning rate to realize the step annealing process, are suitable for unstable objective functions with sparse gradients or large noises in the gradients, and are particularly suitable for single-sample learning objective scenes for defect detection on texture surfaces of large-scale data and parameters. The adaptive moment estimation Adam optimizer is simple and convenient to implement, efficient in calculation when the adaptive learning rate of each parameter is calculated, and low in memory requirement. The error used in each updating of the parameter needs to be controlled by a parameter, which is a Learning rate (step size), and the Learning rate (step size) is set to 0.001.

And S22, inputting the sample pairs in the training set into a YOLO-fastest backbone network. x is the number of _s And x _q Performing feature extraction to obtain a corresponding feature map G _w (x _s ) And G _w (x _q ). The backbone network adopts the idea of weight sharing in the twin network, namely two samples x at the input end _s And x _q Respectively through two backbone networks with the same structure, for x _s And x _q Extracting the characteristics to obtain a corresponding characteristic diagram G _w (x _s ) And G _w (x _q ). The weight sharing of the backbone network means that two samples at the input end pass through two backbone networks with the same structure respectively, and the weights of the two backbone networks are completely the same.

In the main framework of the preferred embodiment of the present invention, as shown in fig. 4, the idea of twin network is combined with the idea of YOLOv3, wherein the backbone network may be used as a replaceable part, and may further include, but is not limited to, backbone networks of various target detection method models of YOLO series, such as models of image classification backbone networks of MobileNet, resNet, shuffleNet, etc., which will not be described again here.

S3, generating a feature map G _w (x _s And G _w (x _q ) Inputting the characteristic enhancement network to obtain the characteristic graph v which is respectively enhanced and/or suppressed _s And v _q The method comprises the following steps: constructing the feature enhancement network based on an improved Non-Local attention mechanism; two feature maps G _w (x _s ) And G _w (x _q ) Inputting an input end of the feature enhancing network; mutually strengthening the characteristics with stronger correlation, and mutually inhibiting the characteristics with weaker similarity; respectively outputting the respectively enhanced and/or suppressed feature maps v through the feature enhancement network _s And v _q (ii) a Wherein, the characteristic diagram G _w (x _s ) Has a dimension of w _s *h _s *c，G _w (x _q Has a dimension of w _q *h _q *c；h _s ，h _q Representation feature diagram G _w (x _s ) And G _w (x _q ) Height dimension of (d), w _s ，w _q Characteristic diagram G should be presented _w (x _s ) And G _w (x _q ) C is a feature map G _w (x _s ) And G _w (x _q ) The number of channels of (c).

In particular, the feature enhancement network employs a modified Non-Local attention mechanism. The Non-Local attention mechanism is due to the field of computer vision, especially for dynamic video sequences, the dependency relationship between frames in a frame is very important, especially for a behavior classification task, the understanding of global contents and the relation between different frames, and the guiding effect on classification results is very strong. It is now common practice to increase the receptive field by a circular convolution network or by a deeper network to improve the understanding of the global content. Nevertheless, this approach is still relatively local, regardless of the temporal direction or spatial location, and thus presents the biggest problems: the long-distance information can not be transmitted back and forth; and the Ldeeper network has large calculation amount, but has low efficiency and difficult gradient optimization. Therefore, aiming at the long distance information transmission problem and improving the long distance dependence, the preferred embodiment of the present invention proposes non-local in the convolutional network based on the traditional non-local mean filtering method, that is: the response at a certain pixel point is the sum of the feature weights at all other points, and each point is associated with all other points to realize the Non-Local idea.

Before the feature enhancement network adopts the improved Non-Local attention mechanism, the target to be detected in the query sample is firstly defined to be the target belonging to the same category as the support sample. If the target to be detected exists in the query sample, the correlation between the characteristics of the target to be detected in the query sample characteristic diagram and the support sample characteristic diagram is stronger; the correlation between the features belonging to the background features or other non-detected targets in the query sample feature map and the support sample feature map is weak. The feature enhancement network is equivalent to regard the feature with weak correlation as noise and regard the feature with strong correlation as the target feature of the target to be detected. The target characteristics of the characteristic enhancement network are equivalent to the operation of classifying the background characteristics with weaker correlation as noise to be inhibited, and meanwhile, the characteristics of the target to be detected with stronger correlation are classified as enhancement characteristics to be detected to be enhanced. The above-mentioned attention mechanism using the modified Non-Local includes the following steps S31 to S35.

S31, as shown in FIG. 5, according to the characteristic diagram G output from the backbone network output terminal _w (x _s ) And G _w (x _q ) The feature map G is adopted at two input ends of the feature matching network layer _w (x _s ) And G _w (x _q ) Dimension is respectively w _s *h _s * c and w _q *h _q * c, wherein w _s ，h _s ，w _q ，h _q C represents the number of channels of the feature map, which is the spatial size of the feature map.

S32, mutually enhancing the features with strong correlation, and mutually suppressing the features with weak similarity includes: the feature map G _w (x _s ) Respectively convolving the feature maps G by two point-by-point convolution networks _w (x _s ) The space size of (c) is not changed, the number of channels is reduced by half, and the result is respectively recorded as G (G) _w (x _s ) ) and

respectively convolving the feature map Gw (xq) by two point-by-point convolution nets to enable the feature map G _w (x _q ) The space size of (A) is not changed, the number of channels is reduced by half, and the result is respectively denoted as G (G) _w (x _q ) And θ (G) _w (x _q ))。

Specifically, as shown in FIG. 5, the attention mechanism uses a characteristic diagram G _w (x _s ) And G _w (x _q ) As input, PW represents a point-by-point convolution net, i.e., the convolution kernel is 1 × 1, and the step size is 1.G _w (x _s ) Respectively performing two PW convolutions to make the space size of the characteristic diagram unchanged and reduce the number of channels by half, namely obtaining two calculation results

Are respectively denoted as G (G) _w (x _s ) ) and

in the same way, G _w (x _q ) After the two PW convolutions are respectively carried out, the space size of the characteristic diagram is not changed, and the number of channels is reduced by half, namely two calculation results are obtained

Are respectively denoted as G (G) _w (x _q ) And θ (G) _w (x _q )). Since the point-by-point convolution corresponds to a matrix operation of the feature map, the feature maps v, which are each enhanced and/or suppressed _s And v _q As follows:

in the formulas (1) and (2),

W _θ ，

and

Ith or W of _θ G _w The (j) th element(s),

and a similarity calculation function representing the feature map. v. of _s Has a dimension of w _s *h _s *c，v _q Has a dimension of w _q *h _q *c。g(G _w (x _s ) Equivalent to that in the above formula)

In the same way as above, the first and second,

is equivalent to

g(G _w (x _q ) Is equivalent to

θ(G _w (x _q ) Equivalent to W _θ G _w (x _s ) _i 。

S33: in the formulas (1) and (2) in step S32, it is necessary to calculate the inner product for each vector of the feature map and calculate the similarity, and in the convolution network in fig. 5, the inner product of the vectors is completed by means of matrix multiplication. Will be provided with

And theta (G) _w (x _q ) Respectively reconstructed into two-dimensional matrices; matrix multiplication is carried out on the two-dimensional matrixes to obtain a dimension w _q h _q *w _s h _s A matrix of (a); let the dimension be w _q h _q *w _s h _s Inputting the matrix of the similarity calculation result into a network layer formed by a softmax function to complete similarity calculation; output dimension is w _q h _q *w _s h _s Of the matrix of (a).

Specifically, as shown in FIG. 5, will

And theta (G) _w (x _q ) Respectively to the reconstruction function (reshape function) by which the reconstruction function is to be applied

And theta (G) _w (x _q ) To two-dimensional matrices; will be provided with

And theta (G) _w (x _q ) Two-dimensional matrices of) the transformation are matrix multiplied to obtain a dimension w _q h _q *w _s h _s A matrix of (a); the matrix is input into a softmax network layer for similarity calculation to complete the similarity calculation.

The calculation of the correlation and the enhancement or suppression of different features according to the correlation are all calculated in a formula at one time. Specifically, the dimension is w _q *h _q *w _s *h _s The real value vector (a 1, a2, a3, a4 \8230ai) of the feature matrix of (b 1, b2, b3, b4 \8230ai) is mapped into a (b 1, b2, b3, b4 \8230ai), wherein bi is a constant of 0-1, then the ordering can be carried out according to the size of bi, for example, the average value bm of bi is taken as a limit, and when the value of bi is greater than bm, a multidimensional bi which is more important than bm is taken to carry out an enhancing task; meanwhile, when the value of bi is smaller than bm, the one-dimensional bi with the minimum weight is taken for the task of weakening. By the above tasks of enhancement and attenuation based on feature vector weights, respectively, G can be made to be _w (x _s ) The correlation characteristic of the medium to small targets is G _w (x _q ) In strengthen, weaken G _w (x _s ) Middle correlation feature for small targets G _w (x _q ) The noise of (2); let G _w (x _s ) The correlation characteristic of the medium to small targets is G _w (x _s ) In strengthening, weakening G _w (x _s ) Of the sample noise. The Softmax function does not change the dimension of the input matrix, so the matrix dimension of its output is still w _q h _q *w _s h _s 。

S34: g (G) _w (x _s ) G and G (G) _w (x _q ) Reconstructed into a matrix; matrix multiplication is carried out on the matrix and the output of the softmax function respectively to obtain dimensionalities of

And

two matrices of (a); respectively have the dimensions of

And

are reconstructed into dimensions respectively

And

two characteristic diagrams of (1).

Specifically, as shown in FIG. 4, G (G) _w (x _s ) G and G (G) _w (x _q ) Respectively, to the reshape function, and G (G) is inputted through the reshape function _w (x _s ) G and G (G) _w (x _q ) Is converted into a matrix and then passed through the output w with the aforementioned softmax, respectively _q *h _q *w _s *h _s Performing matrix multiplication to obtain dimensions of

And

two matrices. The two matrixes are respectively input into a reshape function through which

And

matrix conversion to dimensionality

And

two characteristic diagrams of (1).

Preferred embodiment of the present invention to make the characteristic diagram G _w (x _s ) And G _w (x _q ) Respectively generate attention for each other to enhance the supporting sample x _s The obtained feature map G _w (x _s ) For about query sample x _q Generated feature map G _w (x _s ) The characteristics of the query sample with stronger medium relevance with respect to the small target are obtained by adopting a characteristic diagram G _w (x _q ) Performing one-time dimensionality reduction transformation to obtain a characteristic diagram G with reduced dimensionality _w (x _s ) The medium-relevance query sample is enhanced with respect to features of small targets.

S35: note that the above formula uses a summation operation, whereas in the network of fig. 5, the summation is done by matrix multiplication. Specifically, as shown in fig. 5, two feature maps obtained in the previous step are obtained

And

respectively inputting two PW convolutions to perform channel number dimension increasing, and respectively setting the dimension number of low dimension as

To know

In two feature map matrices

In the order of the channel of dimension c, the preferred embodiment of the present invention is to make the characteristic diagram G _w (x _s ) And G _w (x _q ) Respectively generate attention for each other to enhance the supporting sample x _s The obtained feature map G _w (x _s ) For query about sample x _q Produced ofSign graph G _w (x _s ) The characteristics of the query sample with stronger medium relevance with respect to the small target are obtained by adopting a characteristic graph G _w (x _q ) Performing one-time dimension-raising transformation, and obtaining a feature map G after dimension-raising _w (x _q ) The weaker medium correlation supporting samples are attenuated with respect to the characteristics of the noise of the small target. Finally, the w of the ascending dimension is added _s *h _s * c and w _q *h _q * c, each is independently of G _w (x _s ) And G _w (x _q ) Add to obtain v _s And v _q 。

In particular, by increasing w of dimension _s *h _s * c feature map and G _w (x _s ) Adding the feature maps to make G _w (x _s ) About ascending dimension w in feature map _s *h _s * c (the noise characteristics in the support sample and the query sample are removed, and the relevant corresponding characteristics in the support sample and the query sample are screened out), and the screened relevant corresponding characteristics are added into G _w (x _s ) In the characteristic diagram, let G _w (x _s ) The characteristic graph has stronger correlation characteristics in G due to the characteristic enhancement network _w (x _s ) The characteristic diagram is further enhanced, and further, the noise characteristic with weaker correlation is weakened. To further improve the signal-to-noise ratio of the input characteristic diagram between the support sample and the query sample, and to input the enhanced characteristic diagram, respectively denoted as v _s And v _q 。

In a preferred embodiment of the invention, said

in the formula (4), W _θ G _w (x _s ) _i ，

Representing two column vectors respectively. Since the linear operators in the calculation process are all matrix operations, a convolution network can be used for substitution, and a softmax network can be used for substitution in the similarity function calculation process. Finally, the computational process of the modified Non-Local attention mechanism described above may be replaced with a neural network as shown in FIG. 5.

In a preferred embodiment of the present invention, the feature enhancement network may further include multiple groups of improved Non-Local attention mechanism, which are respectively connected in cascade, and perform a multi-stage classification task based on feature vector weights for sample noise and small target features to further enhance sample features and attenuate sample noise. Finally, outputting a proper enhanced and/or weakened feature map obtained after processing, and calculating a feature vector similar to the feature map, which is not described in detail herein. Meanwhile, the feature enhancement network of the preferred embodiment of the present invention is improved based on the Non-Local attention mechanism, wherein the similarity measure function is used as an alternative part, and the similarity measure function in the preferred embodiment of the present invention can be replaced by using the measurement function including, but not limited to, the euclidean norm, the cosine distance and the like in the prior art.

The preferred embodiment of the invention has the effect that the calculation process of the target characteristic enhancement of the target to be detected can be replaced by basic neural network and matrix operation by designing the characteristic enhancement network structure of the improved Non-Local attention mechanism, thereby facilitating the training and deployment of the whole model.

S4, the feature matching network based features respectively carry out the enhancement and/or the suppression on the feature graph v _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，，v _q ). Specifically, the feature matching network is a network model designed in the preferred embodiment of the present invention for calculating similarity, and is intended to solve the similarity between the support sample feature map and the query sample feature map. The input layer of the feature matching network is a feature map v _s And v _q Dimension is respectively w _s *h _s * c and w _q *h _q * c, the output is the similarity between the two characteristic graphs and is marked as H (v) _s ，v _q ) This step includes the following steps S41 to S43.

S41, respectively setting the dimensions as w _s *h _s * c, respectively enhanced and/or suppressed profile v _s And w _q *h _q * c, respectively enhanced and/or suppressed profile v _q A feature matching network is input. That is, when G _w (x _s ) The characteristic graph has stronger correlation characteristics in G due to the characteristic enhancement network _w (x _s ) Further enhancement in the feature map. Or when G is _w (x _s ) The feature map has weaker correlation features at G due to the feature enhancing network _w (x _s ) The characteristic map is further suppressed and weakened. Or when G is _w (x _s ) The characteristic graph has stronger correlation characteristics in G due to the characteristic enhancement network _w (x _s ) Further enhancement in feature maps, and when G _w (x _s ) The feature map has weaker correlation features at G due to the feature enhancing network _w (x _s ) The characteristic map is further suppressed and weakened. And finally the dimensions are w _s *h _s * c, respectively enhanced and/or suppressed profile v _s And w _q *h _q * c, respectively enhanced and/or suppressed profile v _q A feature matching network is input.

S42, for v _s W of _s *h _s Vector sum v of dimensions c x1 _q W of _q *h _q Combining vectors with dimensions c x1 in pairs and calculating the similarity according to a similarity calculation formula; obtained dimension w _q *h _q *(w _s h _s c) A similarity feature map of (2); the similarity calculation formula is as follows:

similarity(v _s，i ，v _q，j )＝(v _s，i -v _q，j ) ² (5)

wherein the indices i and j denote v _s The ith vector sum v of _q The jth vector of (2).

Specifically, as shown in FIG. 5, v is _s Is split into or whenDo w _s *h _s Vector of dimension c x1, and v _q Split into or as w _q *h _q Vector of dimensions c x 1. Are respectively paired with v _s Vector sum v of _q The vectors are combined pairwise, and the similarity of the vectors and the vectors is calculated successively according to a formula (5).

The above calculation result of the similarity is still a vector. Specifically, as shown in fig. 5, the obtained similarity vectors are finally arranged in sequence with the dimension w _q *h _q *(w _s h _s c) The similarity feature map of (4).

S43, dividing the dimension into w _q *h _q *(w _s h _s c) The similarity characteristic graph is subjected to grouping convolution to obtain a dimension w _q *h _q * c, similarity feature map; and the convolution kernel size of the packet convolution is 1 x1, the step size is 1, and the packet number is c.

Specifically, as shown in fig. 5, the similarity feature map is subjected to grouping convolution, and features in the similarity feature map are further extracted and compressed, where the convolution kernel size is 1 × 1, the step size is 1, and the number of groups is c. Dimension of the feature map after convolution is w _q *h _q * c. At this time, the main information included in the result obtained after convolution is the similarity between the target to be detected in the query sample and the support sample, and the position information about the target to be detected may be lost.

S43, sequentially setting the dimension as w _q *h _q * c similarity profiles with respective enhanced and/or suppressed profiles v _q Along w _q *h _q * c, splicing the dimensions; the final output dimension of the feature matching network is w _q *h _q * Measurement result of (2 c) H (v) _s ，v _q ). Specifically, as shown in fig. 5, the feature map obtained by the convolution is combined with v _q (w _q *h _q * c) Splicing by a concat function along the dimension of the channel number to obtain the final output result of the feature matching network, namely H (v) _s ，v _q ) Of dimension w _q *h _q * And (2 c) a characteristic diagram.

The structure of the network in the preferred embodiment of the invention is as shown in the figureAs shown in FIG. 6, the similarity feature map in FIG. 6 is the result of similarity calculation, H (v) _s ，v _q ) Refers to the final output of the feature matching network, and ws hs (wq hq c) in fig. 5 refers to the dimension of the similarity profile. In the feature matching network of the preferred embodiment of the present invention, the similarity measure function is used as an alternative part, such as the technical solution of the preferred embodiment of the present invention includes but is not limited to the euclidean norm, the cosine distance equidistance measure function; the convolutional network used herein is not limited to convolutional networks or fully-connected networks with different numbers of layers and sizes of convolutional kernels as an alternative. Generally, the twin network-based metric learning method belongs to the category of single sample learning methods, so the preferred embodiment of the invention improves the generalization performance of the single sample learning target detection method model by learning the similarity and difference between the corresponding features of the input samples through the improved twin network metric learning concept.

S5, measuring the result H (v) _s ，v _q ) Inputting the result into a YOLO layer module to perform regression calculation, and predicting the query sample x based on the regression calculation result _q Defect location and/or confidence level. Specifically, a regression design of a YOLO layer is adopted in the regression network, and the regression design of the YOLO layer is used for H (v) according to a design scheme of the YOLO layer in YOLOv3 _s ，v _q ) Regression calculations were performed to predict the location and confidence of defects in the query sample by the convolution network in the YOLO layer. When predicting the defect position in the query sample, using CIOU as a loss function, if CIOU is used, the mAP can reach 49.21%, which is increased by 1.5 percentage points compared to GIOU. CIOU (D) refers to that IOU is changed into DIOU when the verification model evaluates mAP, and the accuracy rate of the model has a certain promotion space. S5 includes steps S51 to S52.

S51, a normal sample x without defects is obtained _s Sample x to be tested _q And inputting the detection model, and performing overlapping filtering on the position and the confidence coefficient of the sample to be detected output by the detection model to identify a result. The above steps include the following steps S511 to S512.

S511, surface deletion of texture is carried outWhen detecting a trap, let x _s Is a normal sample without defects, let x _q Is the sample to be tested. And outputting the position and the confidence coefficient of the defect in the sample to be detected after the model is processed.

The measurement result H (v) _s ，v _q ) Inputting a YOLO layer to perform regression calculation; obtaining a first error loss between the prediction result of the defect position and a truth label based on a loss function CIOU, and predicting the query sample x based on the first error loss _q The defect location in (2). When the defect position is predicted, a first error loss between the prediction result and the truth label is solved by using the CIOU in YOLOv5 as a loss function. The CIOU target frame loss function used in the preferred embodiment of the present invention fully refers to YOLOv5, which may be, but not limited to, mean square error loss function, IOU loss function, smooth-L1 or other target frame loss functions to realize the prediction of the defect location as an alternative, but the CIOU in YOLOv5 is used as the loss function, which is more comprehensive in design, and the DIOU takes into account the center distance between two detection frames. While CIOU takes into account three geometric factors, respectively: (1) an overlap area; (2) distance between the central points; and (3) aspect ratio. Through comparison and analysis, it is known that the CIOU increases an aspect ratio information parameter compared with the DIOU, and therefore, the CIOU can increase a penalty term of the aspect ratio, and the penalty term of the aspect ratio is a positive number, and is used for measuring the consistency of the aspect ratio of the feature map to be detected (v measures the consistency of the aspect ratio).

If the width and height of the real box and the prediction box are similar, the penalty term is 0, and the penalty term does not work. Intuitively, this penalty term therefore acts to control the width and height of the prediction box to be as close as possible to the width and height of the real box. Therefore, the CIOU loss function can control the width and the height of the prediction frame to enable the loss to be as small as possible so that the loss can be as fast as possible to be close to the width and the height of a real frame, the number of frame selection times of the width and the height of the prediction frame can be fast reduced, the width and the height of the prediction frame can be fast determined, the operation amount is simplified, and the calculation speed is further improved.

In actual detection, compared with the case that the CIOU frames the target, the CIOU can find a more appropriate frame position. And when the detected target is out of the frame, the position of the detected target can be accurately marked by using the GIOU loss function and using the CIOU loss function. Similarly, the GIOU marks the target completely when the target is framed, but the outline of the target cannot be accurately framed, so that the CIOU is used as a loss function, and compared with the GIOU which frames the target when the target is framed, the position and accuracy of framing of the detection frame are more appropriate, and the requirement of industrial high-accuracy detection can be met.

When predicting the confidence level, obtaining a second error loss between the prediction confidence level of the defect position and a truth label based on a ternary loss function in the twin network, and predicting the query sample x based on the second error loss _q Confidence of the defect location in (a); the ternary loss function is:

TripleLoss＝∑max((1-y)y′+y(m-y′)，0)

wherein y is a label, 0 indicates no defect, and 1 indicates defect; y' represents the confidence coefficient output by the YOLO layer, and the value range of the confidence coefficient is [0,1]; m represents the margin of the edge, and m =1.

In the field of target detection, a cross entropy loss function is used for judging the closeness degree of actual output and expected output, can measure the difference degree of two probability distributions in the same random variable, and is expressed as the difference between real probability distribution and predicted probability distribution in machine learning, the smaller the value of the cross entropy is, the better the model prediction effect is, and the problem that the gradient optimization is slow due to the fact that MSE is adopted in logistic regression is solved by the cross entropy loss function. But for positive samples, the larger the output probability, the smaller the loss; for negative samples, the smaller the output probability, the smaller the penalty. The loss function at this point is slow in the process of the generation of a large number of simple samples and may not be optimized to a satisfactory optimization result.

The Focal-local Loss function is a training method for adding an adjusting factor and a focusable parameter on the basis of balance cross entropy Loss to focus the Loss function on a difficult sample. The function can lead the model to be more concentrated in the quasi-classified samples during training by reducing the weight of the samples which are easy to classify, and the imbalance of positive and negative samples is adjusted. However, there are two approaches to solve the problem of unbalanced positive and negative sample numbers: designing a sampling strategy, namely resampling a small number of samples; the Loss Function is designed, generally, weight assignment is carried out on different types of samples so as to solve the problems of the proportion of positive and negative samples and serious unbalance of difficult and easy samples in target detection. The loss function reduces the weight occupied by a large number of simple negative samples in training, has the advantage of difficult sample mining, but is insufficient for training and predicting the simple samples.

The ternary loss function tripletloss is the distance between the minimum anchor point and a positive sample with the same identity and the minimum anchor point and a negative sample with a different identity. The objective of Tripletloss is to keep features of the same label as close in spatial position as possible while features of different labels are as far apart in spatial position as possible, and to not have features of a sample aggregate into a very small space requires that for two positive examples of the same class and one negative example, the negative example should be at least margin farther away than the positive example. It can be seen that after the Tripletloss study, the same type of Positive samples and Anchor are closer and closer, and the different types of Negative samples and Anchor are farther and farther.

After the preferred embodiment of the invention learns through the ternary loss function Tripletloss, the stationary samples and the Anchor of the same class are closer and closer, and the stationary samples and the Anchor of different classes are farther and farther. The method not only solves the problem that the cross entropy Loss function has deviation in the optimization direction of the model under the condition of sample class imbalance, but also solves the problem that the Focal-Loss function is biased to the quasi-classified sample, has the advantage of difficult sample mining, but has insufficient training prediction on the simple sample. Meanwhile, the preferred embodiment of the present invention adopts a ternary loss function as a confidence loss function, which avoids the dispersion of the features of the same label in the spatial position, and the aggregation of the features of different labels in the spatial position, and at the same time, can prevent the features of the sample from aggregating into a very small space.

In the preferred embodiment of the present invention, a ternary Loss function is used as the confidence Loss function, which may be, but not limited to, a cross entropy Loss function, a Focal-Loss function, or other confidence Loss functions commonly used in target detection or image classification, as an alternative, may also be used to achieve confidence prediction, and will not be described in detail herein.

S512: in predicting the defect location, the detection method further includes: filtering out the identification result of the defect position overlap by using a non-dominant maximum suppression algorithm (NMS), and outputting the query sample x _q The location and confidence of the defect. The non-master maximum suppression algorithm (NMS) is a widely used suppression method in the field of object detection.

In the YOLO layer structure of the preferred embodiment of the present invention, the position and confidence of the target are predicted by regression of the deep feature map using the design method and concept of the YOLO layer in YOLO 3. The feature enhancing network and the feature matching network designed and proposed by the preferred embodiment of the present invention can better integrate the YOLO layer of YOLO 3 under the framework of the twin network, which includes but is not limited to the YOLO layer of various object detection models of the YOLO series compared to the currently alternative solutions. In the YOLO layer according to the preferred embodiment of the present invention, the position and confidence of the target are predicted by performing regression on the deep feature map. The feature enhancement network and the feature matching network proposed and designed by the preferred embodiment of the invention predict the position and the confidence coefficient of the target by regressing the processed deep feature map, the prediction method and the efficiency are simpler and more efficient, and the calculation method and the concept of the YOLO layer in the YOLOv3 can be better fused under the framework of the twin network.

The invention also provides a target detection device for single sample learning, which comprises the following modules: a sample acquisition module for acquiring a support sample x _s And query sample x _q Two image samples as inputs; wherein the supporting sample x _s Is a normal sample without defects, the query sample x _q Is the sample to be tested. The backbone network module comprises two modules with the same structure and weightA shared backbone network, by which the support samples x are respectively paired _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q ). A feature enhancement module for enhancing the feature map G _w (x _s ) And G _w (x _q ) Input feature enhancement module for obtaining respective enhanced and/or suppressed feature maps v _s And v _q . A feature matching module for respectively matching the enhanced and/or suppressed feature maps v based on the feature matching module _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，，v _q ). A YOLO layer module for converting the measurement result H (v) _s ，v _q ) Inputting the result into a YOLO layer module to perform regression calculation, and predicting the query sample x based on the regression calculation result _q Defect location and/or confidence level.

The preferred embodiment of the present invention provides a backbone network module in a target detection device for single sample learning, which combines a twin network with a backbone network of YOLOv 3; the YOLO layer design according to the preferred embodiment of the present invention is a design method in which YOLO 3 is combined and fused, but instead of using the backbone network module of YOLO 3, a more lightweight open-source backbone network module yolofast is used. By combining the yolofast backbone network module of the preferred embodiment of the invention with the Yoloov 3 target detection method model, the preferred embodiment of the invention can realize the detection of the defect position by adopting a more efficient target detection algorithm, thereby achieving the technical effect of detecting the position of the surface defect in real time.

The invention relates to a defect detection single sample learning target detection device, which adopts a designed feature enhancement module and a feature matching module, performs feature enhancement and feature matching on target features under the framework of a twin network, more effectively combines YOLOv3, performs regression calculation on a fused YOLO layer, and combines a twin network structure to ensure that the input end of the defect detection single sample learning target detection device is not an image sample any more but a sample pair consisting of two image samples. In this way, the preferred embodiment of the present invention can expand the number of training samples, thereby solving the problem of "small training data samples" in surface defect detection to some extent.

The preferred embodiment of the invention combines the concept of metric learning in an improved twin network structure, and improves the generalization performance of the model by learning the similarity and difference between corresponding features of input samples, and further, the preferred embodiment of the invention provides a single-sample learning target detection device for defect detection.

As shown in fig. 8, a preferred embodiment of the present invention also provides an electronic device, the object detection electronic device 800 comprising: a memory 801 for storing non-transitory computer readable instructions; and a processor 802 for executing the computer readable instructions, such that the computer readable instructions, when executed by the processor, implement the single sample learning object detection method described above.

The preferred embodiment of the present invention also provides a computer-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor, the processor is enabled to implement the single sample learning object detection method.

Those skilled in the art should understand that, in order to solve the technical problem of how to obtain a good user experience, the present embodiment may also include well-known structures such as a communication bus, an interface, and the like, and these well-known structures should also be included in the protection scope of the preferred embodiment of the present invention. For the detailed description and the technical effects of the present embodiment, reference may be made to the corresponding descriptions in the foregoing embodiments, which are not repeated herein.

On one hand, the single-sample learning target detection electronic equipment provided by the preferred embodiment of the invention replaces a backbone network module in a standard YOLOv3 scheme by using a lightweight YOLO-fast network design, and compared with a classic YOLOv3 based on a DarkNet backbone network, the backbone network used by the preferred embodiment of the invention has fewer parameters, lower computational complexity and stronger real-time property. On the other hand, the preferred embodiment of the invention realizes a single-stage single-sample learning target detection method model more effectively by using the YOLOv3 target detection method and adopting YOLO-fastest, simplifies the complexity of the single-sample learning target detection method model and reduces the operand of the single-sample learning target detection method model compared with the existing two-stage single-sample learning target detection method model. The electronic equipment for target detection of single-sample learning provided by the preferred embodiment of the invention improves the capability of the model for metric learning by adding the feature matching module, thereby achieving the technical effect of further enhancing the generalization performance of the model of the method.

The preferred embodiment of the present invention also provides a computer-readable storage medium, which includes computer instructions, when the computer instructions are run on a device, the device executes the above-mentioned target detection method of single sample learning.

The apparatus, the device, the computer-readable storage medium, and the computer program product or the chip provided by the preferred embodiment of the present invention are all configured to execute the corresponding methods provided above, so that the beneficial effects achieved by the apparatus, the device, the computer-readable storage medium, and the computer program product or the chip can refer to the beneficial effects in the corresponding methods provided above, and are not described herein again.

The invention also provides application of the target detection method of single sample learning in texture surface defect detection. When the method is applied to texture surface defect detection, the target detection method model is subjected to deep learning through a single-sample learning target detection method model for texture surface defect detection, then the texture surface defects are identified on the basis of learning, the identification result is subjected to comparative analysis, and the results of evaluating the target detection method model in the preferred embodiment of the invention are as follows.

In this embodiment, a comparison is made between classical YOLO and the preferred embodiment of the present invention, and when YOLO-fastest is also used as a backbone network, the comparison of the detection performances of the two is shown in fig. 2 and 3 and table 1, where Seen Classes indicate that these Classes are all Classes appearing in a training data set and belong to a verification set; the UnsheClasses indicates that the Classes are all Classes which do not appear in the training data set and belong to the test set. Here, sen represents texture classes that appear in the training set, 2,3,4,6,7,8,9, respectively. Useen is a texture class that indicates that no texture class appears in the training set, and is 1,5, and 10 respectively. Since the 10 classes are randomly classified into sen and unsen in a 7: 3 ratio, the order of class numbers is disturbed.

TABLE 1

In a preferred embodiment of the present invention, there are 3032 training sample images of Yoloov 3-yolofastest; the training sample pairs of the preferred embodiment of the present invention are paired two by two for many times to obtain 90120 pairs. From the detection result corresponding to Seen, the data results of the target detection performance in 7 randomly acquired categories of the preferred embodiment of the present invention are all improved correspondingly compared with YOLOv 3-yolofastest. The results demonstrate that the detection method provided by the preferred embodiment of the present invention solves the difficulty of the "small sample training data" in the field of texture surface defect detection to a certain extent compared with the detection method of OLOv 3-yolofastest.

From the results of Unssen, it can be seen that the generalization ability of the preferred embodiment of the present invention is stronger than that of classical Yolov 3; from the results of Seen, the preferred embodiment of the present invention has a stronger learning power on the problem of a small sample training data set. That is, the method proposed by the preferred embodiment of the present invention has better performance than the classical YOLOv3 on both the verification set and the test set. Fig. 7 shows the detection effects of the classical YOLOv3 and the preferred embodiment of the present invention on the test set sample, and it can be seen that the preferred embodiment of the present invention can also perform effective defect detection on texture classes that do not appear in the training set, whereas the classical YOLOv3 has situations of missing detection, false detection, etc. during the defect detection process, and the corresponding true value accuracy cannot meet the requirement of high-precision industrial target detection. The result data of the detection prove that the preferred embodiment of the invention can effectively solve the two problems of 'small sample training data' and 'low generalization performance' in the technical background while realizing the detection of the surface defect position. As shown in table 1, the preferred embodiment of the present invention improves the generalization performance in the target detection process, and the results shown in table 1 just verify that the generalization performance is improved, thereby effectively avoiding missed detection and false detection.

The speed of detecting a defect sample on GTX1660S can reach about 20ms in the embodiment, while the speed of detecting the classical Yolov3 based on DarkNet needs about 200ms. This result demonstrates that the preferred embodiment of the present invention can effectively solve the problem of "poor real-time" existing in the background of the art.

The preferred embodiment of the present invention differs from the prior art solutions.

(1) The existing technical scheme for detecting the texture surface defects can detect the positions of the defects, and the calculation speed is possibly high enough. However, these schemes have very low generalization performance and cannot detect surface defects of textures of different types.

(2) The existing technical scheme for detecting the texture surface defects can classify whether the images have defects or not, can effectively learn under the condition of less number of defective image samples, and has stronger generalization performance and higher instantaneity. For example, the method is based on a classification model improved by a twin network (VGG 16, resNet50, mobileNetv3, etc.), but these schemes can only classify images and cannot obtain the specific position of the defect, or obtain the position of the defect by using a method with laggard performance and high computational complexity, such as a sliding window.

Although the preferred embodiments of the present invention have been disclosed in the foregoing description, the present invention should not be construed as limited to the particular embodiments set forth herein, and it will be understood by those skilled in the art that various changes and modifications can be made without departing from the scope of the invention.

Claims

1. A target detection method is characterized by comprising the following steps:

obtaining a support sample x _s And query sample x _q Two image samples as inputs; wherein the supporting sample x _s Is a normal sample without defects, the query sample x _q Is a sample to be tested;

respectively supporting a sample x by two backbone networks with the same structure and shared weight _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q )；

The feature map G _w (x _s ) And G _w (x _q ) Inputting the feature enhancement network to obtain respective enhanced and/or suppressed feature maps v _s And v _q ；

Respectively carrying out the enhancement and/or the inhibition on the feature map v based on a feature matching network _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，v _q )；

The measurement result H (v) _s ，v _q ) Inputting the result into a YOLO layer module to perform regression calculation, and predicting the query sample x based on the regression calculation result _q Defect location and/or confidence level.

2. The method of claim 1, wherein the two backbone networks with the same structure and weight sharing respectively support the sample x _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q ) The method comprises the following steps:

the backbone network is formed by YOLO-fastest and is marked as Gw;

supporting the sample x _s And query sample x _q Respectively inputting the data to the input end of the backbone network for feature extraction;

obtaining the corresponding characteristic graphs G _w (x _s ) And G _w (x _q )。

3. The object detection method according to claim 1, wherein the feature map G is generated by a computer _w (x _s ) And G _w (x _q ) Inputting the characteristic enhancement network to obtain the characteristic graph v which is respectively enhanced and/or suppressed _s And v _q The method comprises the following steps:

constructing the feature enhancement network based on an improved Non-Local attention mechanism;

two feature maps G _w (x _s ) And G _w (x _q ) Inputting an input of the feature enhancement network;

mutually enhancing the characteristics with stronger correlation, and mutually inhibiting the characteristics with weaker similarity;

respectively outputting the respectively enhanced and/or suppressed feature maps v through the feature enhancement network _s And v _q ；

Wherein, the characteristic diagram G _w (x _s ) Has a dimension of w _s *h _s *c，G _w (x _q ) Has a dimension of w _q *h _q *c；h _s ，h _q Representation feature diagram G _w (x _s ) And G _w (x _q ) Height dimension of (a), w _s ，w _q Should represent the characteristic diagram G _w (x _s ) And G _w (x _q ) C is a feature map G _w (x _s ) And G _w (x _q ) The number of channels of (2).

4. The method of claim 3, wherein the mutually enhancing the features with stronger correlation and the mutually suppressing the features with weaker similarity comprises:

the feature map G _w (x _s ) Respectively convolving the feature maps G by two point-by-point convolution networks _w (x _s ) The space size of (c) is not changed, the number of channels is reduced by half, and the result is respectively recorded as G (G) _w (x _s ) ) and

the feature map G _w (x _q ) Respectively convolving by two point-by-point convolution nets to make the characteristic diagram G _w (x _q ) The space size of (A) is not changed, the number of channels is reduced by half, and the result is respectively denoted as G (G) _w (x _q ) And θ (G) _w (x _q ))；

Will be provided with

And θ (G) _w (x _q ) Respectively reconstructed into two-dimensional matrices; matrix multiplication is carried out on the two-dimensional matrixes to obtain a dimension w _q h _q *w _s h _s A matrix of (a); let the dimension be w _q h _q *w _s h _s Inputting a network layer formed by a softmax function into the matrix to complete similarity calculation; output dimension is w _q h _q *w _s h _s A matrix of (a);

g (G) _w (x _s ) G and G (G) _w (x _q ) Is heavy)Forming a matrix; matrix multiplication is carried out on the matrix and the output of the softmax function respectively to obtain dimensionalities of

And

two matrices of (a); respectively have the dimensions of

And

are reconstructed into dimensions respectively

And

two feature maps of (2);

the dimension is measured

And

respectively inputting the two characteristic graphs into two point-by-point convolution networks for convolution, and performing channel dimension increasing; respectively comparing the channel number L-dimensional result with G _w (x _s ) And G _w (x _q ) Add to obtain v _s And v _q ；

Wherein the content of the first and second substances,

W _θ ，

and

Ith or W of _θ G _W The (j) th element(s),

a similarity calculation function representing the feature map; v. of _s Has a dimension of w _s *h _s *c，v _q Has a dimension of w _q *h _q *c。

5. The object detection method of claim 4, characterized in that said

The similarity calculation function calculates the similarity between two vectors using the radial basis functions as follows:

wherein, W _θ G _w (x _s ) _i ，

Representing two column vectors respectively.

6. The method according to claim 5, wherein the feature matching network is used to match the enhanced and/or suppressed feature maps v, respectively _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，v _q ) The method comprises the following steps:

dimension is respectively w _s *h _s * c, respectively enhanced and/or suppressed profile v _s And w _q *h _q * c, respectively enhanced and/or suppressed profile v _q Inputting a feature matching network;

for v _s W of _s *h _s Vector sum v of dimensions c x1 _q W of _q *h _q Combining vectors with dimensions c x1 in pairs and calculating the similarity according to a similarity calculation formula; obtained dimension w _q *h _q *(w _s h _s c) A similarity feature map of (2); the similarity calculation formula is as follows:

similarity(v _s，i ，v _q，j )＝(v _s，i -v _q，j ) ² ；

wherein the indices i and j denote v _s The ith vector sum v of _q The jth vector of (a);

dimension is w _q *h _q *(w _s h _s c) The similarity characteristic graph is subjected to grouping convolution to obtain a dimension w _q *h _q * c, similarity feature map; wherein the convolution kernel size of the grouped convolution is 1 x1, and the step size is 1;

sequentially dividing the dimension into w _q *h _q * c similarity characteristic diagram and respectively enhanced and/or suppressed characteristic diagram v _q Along w _q *h _q * c, splicing the dimensions;

the final output dimension of the feature matching network is w _q *h _q * Measurement result H (v) of (2 c) _s ，v _q )。

7. The method of claim 6, wherein the measuring result H (v) is obtained by a method of measuring a plurality of target objects _s ，v _q ) Input YOLO layer go backA regression calculation for predicting the query sample x based on the regression calculation result _q The defect location and/or confidence in (a) includes:

the measurement result H (v) _s ，v _q ) Inputting a YOLO layer to perform regression calculation;

obtaining a first error loss between the predicted result of the defect position and the truth label based on a loss function CIOU, and predicting the query sample x based on the first error loss _q The defect location of (2); and/or

When predicting the confidence level, obtaining a second error loss between the prediction confidence level of the defect position and a truth label based on a ternary loss function in the twin network, and predicting the query sample x based on the second error loss _q Confidence of the defect location in (a);

the ternary loss function is:

TripleLoss＝∑max((1-y)y′+y(m-y′)，0)

8. The object detection method of claim 7, wherein in predicting the defect location, the detection method further comprises: filtering out the identification result of the defect position overlap by using a non-dominant maximum suppression algorithm (NMS), and outputting the query sample x _q The location and confidence of the defect.

9. An object detection device, comprising the following modules:

a sample acquisition module for acquiring a support sample x _s And query sample x _q Two image samples as inputs; wherein the supporting sample x _s Is a normal sample without defects, the query sample x _q Is a sample to be tested;

the backbone network module is composed of two backbone networks with the same structure and shared weight and comprisesThe backbone network respectively supports the sample x _s And query sample x _q Extracting the features to obtain corresponding feature maps G _w (x _s ) And G _w (x _q )；

A feature enhancement module for enhancing the feature map G _w (x _s ) And G _w (x _q ) Input feature enhancement module for obtaining respective enhanced and/or suppressed feature maps v _s And v _q ；

A feature matching module for respectively matching the enhanced and/or suppressed feature maps v based on the feature matching module _s And v _q Performing similarity measurement, and outputting measurement result H (v) _s ，v _q )；

A YOLO layer module for converting the measurement result H (v) _s ，v _q ) Inputting the YOLO layer module to perform regression calculation, and predicting the query sample x based on the regression calculation result _q Defect location and/or confidence level.

10. An electronic device, comprising:

a memory for storing non-transitory computer readable instructions; and

a processor for executing the computer readable instructions such that the computer readable instructions, when executed by the processor, implement the object detection method of any one of claims 1 to 8.

11. A computer readable storage medium comprising computer instructions which, when run on a device, cause the device to perform the object detection method of any one of claims 1 to 8.

12. Use of the object detection method of any one of claims 1 to 8 in texture-like surface defect detection.