CN112967251B

CN112967251B - Picture detection method, training method and device of picture detection model

Info

Publication number: CN112967251B
Application number: CN202110239211.0A
Authority: CN
Inventors: 唐吉霖; 袁燚; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2024-06-04
Anticipated expiration: 2041-03-03
Also published as: CN112967251A

Abstract

The invention provides a picture detection method, a training method of a picture detection model and a device thereof, relating to the technical field of picture detection, wherein the picture detection method comprises the following steps: acquiring a picture to be detected; inputting the picture to be detected into a pre-established picture detection model so as to output a reconstructed picture corresponding to the picture to be detected through the picture detection model; calculating the reconstruction error of the picture to be detected and the reconstructed picture; the picture is detected based on the reconstruction error. According to the picture detection method, the training method and the training device of the picture detection model, as the picture detection model is provided with the multi-level feature space and the self-encoder network structure with distribution constraint, the mapping complexity from an input picture to the feature space is improved, so that an abnormal sample is more likely to fall outside the feature space of a normal sample, confusion between the normal sample and the abnormal sample is reduced, and the accuracy of abnormal detection is improved.

Description

Picture detection method, training method and device of picture detection model

Technical Field

The present invention relates to the field of image detection technologies, and in particular, to an image detection method, a training method for an image detection model, and an apparatus thereof.

Background

Abnormal picture detection aims at identifying and detecting abnormal picture samples in a given picture set that do not conform to an expected pattern or are different from other normal pictures in the dataset. In practical application, there are countless different situations in abnormal pictures, and each possible situation cannot be accurately and comprehensively predicted in advance.

In addition, most of the data sets are normal samples, the number of abnormal samples is quite rare in comparison, the proportion of positive and negative samples is seriously unbalanced, and the cost for obtaining the abnormal samples is quite high. Thus, unlike typical computer vision tasks that can be solved in a supervised manner by means of a large amount of labeled training data, outlier picture detection tasks are usually solved in an unsupervised manner.

In recent years, in the field of unsupervised abnormal picture detection, a picture reconstruction-based method is mainly used, but in this way, only the characteristic representation of a given picture sample under a single level is generally considered, and it is difficult to fully consider information of different levels of the same picture, so that it is also difficult to effectively distinguish a normal sample from an abnormal sample in a single-level characteristic space, confusion is easily caused, erroneous judgment is caused, and the accuracy of abnormal detection is reduced.

Disclosure of Invention

Accordingly, the present invention is directed to a method for detecting a picture, a method for training a picture detection model, and a device for training a picture detection model, so as to alleviate the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for detecting a picture, including: acquiring a picture to be detected; inputting the picture to be detected into a pre-established picture detection model so as to output a reconstructed picture corresponding to the picture to be detected through the picture detection model; the picture detection model is configured with a self-encoder network structure with multi-level characteristic space and distribution constraint; calculating the reconstruction errors of the picture to be detected and the reconstructed picture; and detecting the picture based on the reconstruction error.

Preferably, in one possible implementation manner, the multi-level feature space and distribution constraint self-encoder network structure includes a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure includes a multi-level feature space; the step of inputting the picture to be detected into a pre-established picture detection model to output a reconstructed picture corresponding to the picture to be detected through the picture detection model comprises the following steps: inputting the picture to be detected into the multi-level self-encoder structure, and embedding the picture to be detected into the multi-level characteristic space through the multi-level self-encoder structure to obtain the characteristic distribution of the picture to be detected; performing feature fitting on the feature distribution of the picture to be detected and the priori distribution pre-configured in the multi-level feature space; and inputting the characteristic fitting result into the multi-level decoder structure for picture recombination so as to generate a reconstructed picture corresponding to the picture to be detected.

Preferably, in one possible implementation manner, the multi-level self-encoder structure is an encoder structure based on a convolutional neural network, and the multi-level self-encoder structure includes at least one convolutional structure, wherein the convolutional structure includes a preset convolutional layer and an active layer, and the convolutional layer is provided with a preset convolutional kernel and a step size; inputting the picture to be detected into the multi-level self-encoder structure, embedding the picture to be detected into the multi-level feature space through the multi-level self-encoder structure, and obtaining the feature distribution of the picture to be detected comprises the following steps: inputting the picture to be detected into the multi-level self-encoder structure, and obtaining the hierarchical characteristic information of the picture to be detected through at least one convolution structure; and embedding the hierarchical feature information into the multi-level feature space to obtain feature distribution of the picture to be detected, wherein the feature distribution comprises low-order representation of the picture to be detected and high-order representation of the picture to be detected.

Preferably, in a possible implementation manner, the multi-level feature space further comprises a high-order feature discriminator and a low-order feature discriminator; the step of performing feature fitting on the feature distribution of the picture to be measured and the pre-configured prior distribution in the multi-level feature space comprises the following steps: inputting a low-order representation included in the feature distribution of the picture to be detected to the low-order feature discriminator, and inputting a high-order representation included in the feature distribution of the picture to be detected to the high-order feature discriminator; fitting the feature distribution of the picture to be measured through the high-order feature discriminant, the low-order feature discriminant and the priori distribution which is preconfigured in the multi-level feature space.

Preferably, in one possible implementation manner, the step of calculating the reconstruction error between the picture to be measured and the reconstructed picture includes: respectively calculating first norm distances of the picture to be detected and the reconstructed picture, and second norm distances of the picture to be detected and the reconstructed picture in the multi-level feature space; and calculating the reconstruction error of the picture to be detected and the reconstructed picture based on the first norm distance and the second norm distance.

Preferably, in one possible implementation manner, the first norm distance and the second norm distance are each an L2 norm example; the reconstruction error is expressed as:

Wherein E _i (·) represents the feature representation obtained from the feature map of the i-th layer of the multi-level self-encoder structure in the self-encoder network structure, and α and β represent weight super-parameters.

Preferably, in one possible implementation manner, the step of detecting the picture based on the reconstruction error includes: judging whether the reconstruction error is smaller than a preset error threshold value or not; if yes, determining the picture to be detected as a normal picture; and if not, determining the picture to be detected as an abnormal picture.

In a second aspect, the embodiment of the present invention further provides a training method for a picture detection model, where the picture detection model is configured with a self-encoder network structure with a multi-level feature space and a distribution constraint; the multi-level self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure comprises a multi-level feature space; the method comprises the following steps: acquiring a pre-established picture set, wherein the picture set comprises a preset number of normal image samples; inputting the normal image sample into a picture detection model to be trained, and carrying out picture reconstruction on the normal image sample to obtain a reconstructed image sample corresponding to the normal image sample; calculating a loss function of the normal image sample and the reconstructed image sample; and carrying out iterative optimization on model parameters of the picture detection model to be trained based on the loss function so as to train the picture detection model to be trained.

Preferably, in one possible embodiment, the method further comprises: recording the iteration times of the iteration optimization; ending the iterative optimization if the iterative times reach a preset iterative threshold value; and saving the model parameters of the current picture detection model to be trained.

In a third aspect, an embodiment of the present invention further provides a picture detection apparatus, including: the first acquisition module is used for acquiring a picture to be detected; the reconstruction module is used for inputting the picture to be detected into a pre-established picture detection model so as to output a reconstructed picture corresponding to the picture to be detected through the picture detection model; the picture detection model is configured with a self-encoder network structure with multi-level characteristic space and distribution constraint; the first calculation module is used for calculating the reconstruction errors of the picture to be detected and the reconstructed picture; and the detection module is used for detecting the picture based on the reconstruction error.

In a fourth aspect, the embodiment of the present invention further provides a training device for a picture detection model, where the picture detection model is configured with a self-encoder network structure with multi-level feature space and distribution constraints; the multi-level self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure comprises a multi-level feature space; the device comprises: the second acquisition module is used for acquiring a pre-established picture set, wherein the picture set comprises a preset number of normal image samples; the input module is used for inputting the normal image sample into a picture detection model to be trained, carrying out picture reconstruction on the normal image sample, and obtaining a reconstructed image sample corresponding to the normal image sample; a second calculation module for calculating a loss function of the normal image sample and the reconstructed image sample; and the optimization module is used for carrying out iterative optimization on the model parameters of the picture detection model to be trained based on the loss function so as to train the picture detection model to be trained.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the methods described in the first and second aspects when the processor executes the computer program.

In a sixth aspect, embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the methods of the first and second aspects described above.

The embodiment of the invention has the following beneficial effects:

According to the picture detection method, the training method and the training device for the picture detection model, when picture detection is carried out, pictures to be detected can be obtained, the pictures to be detected are input into the picture detection model established in advance, reconstructed pictures corresponding to the pictures to be detected are output through the picture detection model, and reconstruction errors of the pictures to be detected and the reconstructed pictures are calculated; the image detection model is configured with a self-encoder network structure with multi-level feature space and distribution constraint, so that an input image sample can be embedded into the multi-level feature space by utilizing the self-encoder network structure with multi-level feature space and distribution constraint in the image detection process to more accurately encode and distinguish different image samples, the mapping complexity from the input image to the feature space is improved, the abnormal sample is more easily located outside the feature space of the normal sample, confusion between the normal sample and the abnormal sample is reduced, and the accuracy of the abnormal detection is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an unsupervised anomaly picture detection;

fig. 2 is a flowchart of a picture detection method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a process of detecting a picture according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a self-encoder network structure according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a structure of a discriminator according to the embodiment of the invention;

FIG. 6 is a flowchart of a training method of a picture detection model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a picture detecting apparatus according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a training device for a picture detection model according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Currently, in recent years, in the field of unsupervised abnormal picture detection, a picture reconstruction-based method is mainly used. The method utilizes the whole structure of the encoder-decoder (such as a self-encoder, GAN (GENERATIVE ADVERSARIAL Networks, generating type countermeasure network)) to learn parameterized feature embedding and picture reconstruction of normal picture samples, and detects abnormal samples by calculating and comparing reconstruction errors between input pictures and reconstructed pictures.

For ease of understanding, fig. 1 shows a schematic diagram of the unsupervised outlier detection shown, which mainly comprises two parts of an encoder network and a decoder network, namely the encoder and decoder shown in fig. 1.

Specifically, the encoder network mainly receives a given picture sample to be detected (i.e., an input picture in fig. 1) as input, and calculates to obtain a characteristic representation of the sample through a convolutional neural network; the decoder network takes the characteristic representation of the picture sample output by the encoder as input, and uses a convolutional neural network to reconstruct and generate the input sample to be detected. In the training process, the model shown in fig. 1 can only observe normal samples, and the weight parameters of the model are iteratively updated and optimized through a gradient descent algorithm to minimize the loss function between the input normal samples and the reconstructed normal samples, so that the reconstruction quality of the normal samples is continuously improved, the reconstruction error of the normal samples is reduced, and the expected reconstruction model is finally obtained.

Specifically, the loss function of the model defines the degree of similarity between the reconstructed normal picture and the input normal picture, and the smaller the loss function, the more similar the two images are. Given a trained model, in the test process, because the model is trained only with normal samples as input, the parameters of the model cannot accurately reconstruct abnormal samples, so that the abnormal samples will generate higher reconstruction errors, and the normal samples will have lower reconstruction errors. Thus, by utilizing the difference, the calculated reconstruction error can be compared with a preset threshold value to detect and distinguish abnormal samples in the image collection.

However, the above method generally only considers the feature representation of a given picture sample under a single level, and does not fully consider the information of different levels of the same picture, such as the structural information of the bottom layer, the category semantic information of the high layer, and the like, so that the normal sample and the abnormal sample cannot be effectively distinguished in the single-level feature space, confusion is easily caused, erroneous judgment is caused, and the accuracy of abnormal detection is reduced. Meanwhile, the distribution of the characteristic representation of the normal sample is not limited by the method explicitly and priori, so that the degree of freedom is too large, a compact characteristic space is difficult to encode, and the generalization capability of a decoder is difficult to be limited effectively.

Based on the above, the image detection method, the training method of the image detection model and the training device of the image detection model provided by the embodiment of the invention can effectively alleviate the problems.

For the sake of understanding the present embodiment, first, a detailed description is given of a picture detection method disclosed in the present embodiment.

In a possible implementation manner, the embodiment of the present invention provides a picture detection method, and fig. 2 shows a flowchart of the picture detection method, specifically, as shown in fig. 2, the method includes the following steps:

Step S202, obtaining a picture to be detected;

Step S204, inputting the picture to be detected into a pre-established picture detection model so as to output a reconstructed picture corresponding to the picture to be detected through the picture detection model;

Wherein, the picture detection model is configured with a self-encoder network structure with multi-level characteristic space and distribution constraint;

Specifically, the above-mentioned picture detection model is usually a detection model implemented based on a neural network model, and because the picture detection model used in the embodiment of the present invention is configured with a multi-level feature space and a self-encoder network structure with distribution constraint, when the above-mentioned picture detection model is constructed, the embodiment of the present invention needs to construct an encoder network and a decoder network required by the self-encoder network structure at the same time, and the multi-level feature space corresponding to the self-encoder network structure, so that when a picture is detected, an input picture to be detected can be embedded into a feature space with a complementary multi-level structure, and further a reconstructed picture corresponding to the picture to be detected is obtained.

Step S206, calculating the reconstruction errors of the picture to be detected and the reconstructed picture;

Step S208, detecting the picture based on the reconstruction error.

In practical use, the above reconstruction error may also be referred to as an abnormal score of the picture to be detected and the reconstructed picture, and in general, the calculated reconstruction error mainly characterizes the similarity of the picture to be detected and the reconstructed picture in the original pixel space and the multi-level feature space, which is favorable for distinguishing the normal sample and the abnormal sample better, so as to improve the abnormal detection performance, so that the picture can be detected by using the above reconstruction error. Specifically, in the step S208, when the picture is detected, it may be determined whether the reconstruction error is smaller than a preset error threshold; if yes, determining the picture to be detected as a normal picture; if not, determining the picture to be detected as an abnormal picture.

Therefore, the picture detection method provided by the embodiment of the invention can acquire the picture to be detected when the picture is detected, input the picture to be detected into the pre-established picture detection model, output the reconstructed picture corresponding to the picture to be detected through the picture detection model, and calculate the reconstruction error of the picture to be detected and the reconstructed picture; the image detection model is configured with a self-encoder network structure with multi-level feature space and distribution constraint, so that an input image sample can be embedded into the multi-level feature space by utilizing the self-encoder network structure with multi-level feature space and distribution constraint in the image detection process to more accurately encode and distinguish different image samples, the mapping complexity from the input image to the feature space is improved, the abnormal sample is more easily located outside the feature space of the normal sample, confusion between the normal sample and the abnormal sample is reduced, and the accuracy of the abnormal detection is improved.

In practical use, the multi-level characteristic space and distribution constraint self-encoder network structure in the embodiment of the invention comprises a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure also comprises a multi-level characteristic space;

Specifically, considering the pyramidal feature hierarchy of the convolutional neural network (CNN, convolutional Neural Networks, convolutional neural network), feature maps (feature maps) of different layers in the network may contain information of different layers of the input picture. In general, however, the shallower feature map generally contains more low-level structural information, such as edges, lines, and corners. At the same time, deeper feature maps typically capture more high-level semantic information about category labels. Therefore, the characteristics of different layers of the network actually encode information representations of different layers of the input picture, and the complementation of the characteristics of the lower layer and the higher layer can effectively improve the performance of identifying and distinguishing different picture samples. Therefore, the multi-level self-encoder structure included in the self-encoder network structure in the embodiment of the invention can embed the input picture samples into a feature space with a complementary hierarchical structure so as to more accurately encode and distinguish different picture samples. In other words, the multi-level feature space improves the complexity of mapping from the input picture to the feature space, and can make the abnormal sample more easily fall outside the feature space of the normal sample.

Based on the theory, in the step S204, the step of outputting the reconstructed picture through the picture detection model may include the following steps: inputting a picture to be measured into a multi-level self-encoder structure, and embedding the picture to be measured into a multi-level feature space through the multi-level self-encoder structure to obtain feature distribution of the picture to be measured; performing feature fitting on feature distribution of the picture to be detected and pre-configured prior distribution in a multi-level feature space; and inputting the characteristic fitting result into a multi-level decoder structure to carry out picture recombination, so as to generate a reconstructed picture corresponding to the picture to be detected.

For easy understanding, fig. 3 shows a schematic diagram of a process of detecting a picture provided by an embodiment of the present invention, specifically, in fig. 3, a self-encoder network structure is a multi-level self-encoder structure, and the multi-level self-encoder structure is an encoder structure based on a convolutional neural network, and the multi-level self-encoder structure includes at least one convolutional structure, where the convolutional structure includes a preset convolutional layer and an active layer, and the convolutional layer is provided with a preset convolutional kernel and a preset step size;

Specifically, since the self-encoder network structure includes a multi-level self-encoder structure and a multi-level decoder structure, the multi-level self-encoder structure and the multi-level decoder structure are shown in fig. 3, wherein the multi-level self-encoder structure is simply represented as an encoder in fig. 3 for convenience of explanation, and only a limited number of convolution structures including preset convolution layers and activation layers are simply shown in fig. 3.

Further, the multi-level decoder structure in fig. 3 is also generally simply represented as a decoder, and for ease of illustration, the convolution structure included in a limited number of decoders, i.e., a limited number of convolution layers and active layers, is also simply shown in fig. 3. The middle part in fig. 3 is a multi-level feature space part.

Specifically, fig. 4 shows a schematic diagram of a possible self-encoder network structure, including a multi-level self-encoder structure and a multi-level decoder structure, where in fig. 4, an encoder including four convolution layers is shown, and in fig. 4, one of the convolution layers is denoted by ConvDown modules, that is, the encoder in fig. 4 includes four ConvDown modules, namely, convDown, convDown, convDown, 256 and ConvDown respectively, where numbers after each ConvDown module indicate the number of convolution kernels of the ConvDown module, and in general, the ConvDown module may spatially downsample the feature map, and may gradually increase the number of feature maps, reduce spatial resolution, and increase depth. In addition, in practical use, each ConvDown module may include a convolution layer with a 3×3 (convolution kernel size) and a step size of 2, and then a ReLU activation layer, and the specific ConvDown module may be set according to the practical use, which is not limited in this embodiment of the present invention.

Further, in fig. 4, a decoder including a plurality of convolution structures is also shown, where the convolution structure of the decoder in fig. 4 includes a plurality ConvUp of modules, and the plurality ConvUp of modules may sample in space, while gradually reducing the number of feature maps, and finally output a reconstructed picture. Each ConvUp module included in the decoder in fig. 4 may also include a deconvolution layer with a 3×3 (convolution kernel size) and a step size of 2, and then a ReLU active layer, and the specific ConvUp module may also be set according to the actual use condition, which is not limited in this embodiment of the present invention.

The middle part in fig. 4 is generally denoted by FC, that is, FC128, FC1024, FC256, and FC2048 in fig. 4, and this part of the fully connected layer is also referred to as a feature embedding part, which generally takes feature maps of a shallower layer and a deeper layer of the encoder network as input, and uses the fully connected layer (FC) to generate low-level (low-level) and high-level (high-level) feature representations of the input picture sample, respectively, where FC128, FC1024 correspond to the low-level feature representations, FC256 and FC2048 correspond to the high-level feature representations, and the following numbers represent parameters of the fully connected layer, which may be specifically set according to the actual use situation, and the embodiment of the present invention is not limited thereto.

Therefore, based on the network structure shown in fig. 4, the step of obtaining the feature distribution of the picture to be measured by using the self-encoder network structure includes the following steps:

Inputting a picture to be measured into a multi-level self-encoder structure, namely, into a multi-level self-encoder structure (encoder) at the left side in fig. 3 and 4, and obtaining level characteristic information of the picture to be measured through at least one convolution structure; embedding the hierarchical feature information into a multi-hierarchical feature space to obtain feature distribution of the picture to be detected, wherein the feature distribution comprises low-order representation of the picture to be detected and high-order representation of the picture to be detected.

Typically, the low-level representation includes low-level structural information of the picture to be measured, such as edges, lines, corners, etc., and the high-level representation includes high-level structural information of the picture to be measured, such as more high-level semantic information about category labels, etc. When the encoder embeds the picture to be measured into the multi-level feature space to obtain the feature distribution of the picture to be measured, the feature distribution of the picture to be measured can be subjected to feature fitting with the priori distribution which is pre-configured in the multi-level feature space.

Specifically, in order to facilitate feature fitting of feature distribution of a picture to be detected and pre-configured prior distribution in a multi-level feature space, the multi-level feature space in the embodiment of the invention further comprises a high-order feature discriminator and a low-order feature discriminator; that is, the high-order feature discriminator and the low-order feature discriminator shown in fig. 3 require inputting low-order representations included in the feature distribution of the picture to be measured to the low-order feature discriminator and inputting high-order representations included in the feature distribution of the picture to be measured to the high-order feature discriminator when performing feature fitting; and then fitting the feature distribution of the picture to be tested through the high-order feature discriminant, the low-order feature discriminant and the priori distribution which is preconfigured in the multi-level feature space.

In actual use, the preconfigured prior distribution may be a normal distribution, for example, may be a standard normal distribution N (0, 1) to explicitly constrain and control the feature distribution of the normal sample. In other embodiments, the preconfigured a priori distribution may take other forms, which are specific to actual use, and the embodiments of the present invention are not limited thereto.

The above-mentioned process of implementing the above-mentioned feature fitting by introducing a series of feature discriminants, such as the above-mentioned high-order feature discriminant and low-order feature discriminant, into the embedded multi-level feature space, aim at correctly distinguishing the feature representation obtained by encoding the encoder network from the normal sample input and the feature representation obtained by randomly sampling from the prior normal distribution; at the same time, the encoder network should make its encoding result mimic as much as possible that obtained by sampling from a priori normal distribution to fool the feature discriminators. Based on the above process, the two networks are mutually opposed, constantly iterated and optimized, and the respective parameters are adjusted, so that the feature identifier network can not accurately determine the real source of the output result of the encoder network.

In summary, based on the schematic diagram of the self-encoder network structure shown in fig. 3, the self-encoder network structure can be used to learn parameterized feature embedding and feature reconstruction of normal picture samples, respectively, and detect and distinguish abnormal samples by calculating and comparing reconstruction errors between input pictures and reconstructed pictures. Specifically, the self-encoder network structure comprises two parts, namely a multi-level self-encoder structure and a multi-level decoder structure. The multi-level self-encoder structure is responsible for embedding an input picture to be detected into a feature space with a complementary hierarchical structure to obtain feature distribution of the picture to be detected, and simultaneously, performing feature fitting on the feature distribution of the picture to be detected and a priori distribution which is pre-configured in the multi-level feature space; the multi-level decoder structure, i.e. the above decoder, aims to learn the specific mapping from the prior feature space to the normal picture space accordingly, generate a reconstructed picture, and calculate the reconstruction error to realize the anomaly detection based on the threshold comparison.

In addition, in the embodiment of the invention, the characteristics of different layers of the convolutional neural network of the self-encoder network structure are considered to actually encode the identity information of different layers of the input picture (such as the picture to be detected), and the complementation of the characteristics of the lower layer and the higher layer can effectively improve the performance of identifying and distinguishing different picture samples. Therefore, the self-encoder network structure shown in fig. 3 can embed the input picture samples into a feature space with a complementary multi-level structure, and the multi-level feature space improves the complexity of mapping the input picture to the feature space, so that the abnormal samples are more likely to fall outside the feature space of the normal picture samples.

In addition, the priori distribution pre-configured in the multi-level feature space can effectively improve the compactness of the normal sample feature space obtained by encoding and reduce the possibility and the contingency of confusion caused by mapping the abnormal sample to the normal sample feature space; on the other hand, once such a parameterized prior feature space is defined, the above multi-level decoder structure (i.e., decoder network) will accordingly learn the specific mapping from the parameterized prior feature space to the normal picture space instead of learning from all other possible unspecified mixed feature spaces, thereby limiting the generalization capability of the decoder to a bounded manifold range, effectively suppressing the generalization performance of the network on abnormal samples, and making the abnormal samples more prone to higher reconstruction errors.

For ease of understanding, fig. 5 shows a schematic structural diagram of a discriminator, which includes a high-order feature discriminator and a low-order feature discriminator, as shown in fig. 5, respectively, and the two are similar in structure, and only differ in parameters of the fully-connected layer FC, and in actual use, the high-order feature discriminator and the low-order feature discriminator mainly serve to determine whether a given feature representation is "true", that is, distinguish a feature representation (fake) obtained by encoding an encoder network from a feature representation (real) obtained by randomly sampling from a priori normal distribution. Typically an encoder is made up of multiple fully connected layers, i.e., FC layers, the input of which is a particular feature representation, and the output of which is a scalar representing the probability that the feature representation is "true" (i.e., not encoded by the encoder). The above-mentioned discriminator is embedded into the multi-level feature space shown in fig. 3, so as to correctly distinguish the feature representation obtained by encoding the input normal sample by the encoder network from the feature representation obtained by randomly sampling from the prior normal distribution, and then output the corresponding reconstructed picture, and perform subsequent calculation of the reconstruction error.

Further, in the embodiment of the present invention, when the reconstruction error is calculated, a first norm distance between the picture to be measured and the reconstructed picture and a second norm distance between the picture to be measured and the reconstructed picture in the multi-level feature space are generally calculated respectively; and then calculating the reconstruction error of the picture to be detected and the reconstructed picture based on the first norm distance and the second norm distance.

Specifically, the first norm distance and the second norm distance are each L2 norm examples; the reconstruction error is expressed as:

wherein E _i (·) represents the feature representation obtained from the feature map of the i-th layer of the multi-level self-encoder structure in the self-encoder network structure, and α and β represent the weight super-parameters. X represents the picture to be measured, The reconstructed picture is represented and a () represents the above reconstruction error, also called anomaly score.

Based on the self-encoder network structure, the picture detection method provided by the embodiment of the invention can embed the input picture sample into the feature space with the complementary hierarchical structure by utilizing the multi-level self-encoder structure so as to more accurately encode and distinguish different picture samples, thereby improving the complexity of mapping from the input picture to the feature space, enabling the abnormal sample to more easily fall outside the feature space of the normal sample, reducing confusion between the normal sample and the abnormal sample, and improving the accuracy of abnormal detection.

In addition, in the picture detection process, the feature distribution of the normal samples can be restricted explicitly, the compactness of the feature space of the normal samples obtained by encoding is improved effectively, the possibility of mapping the abnormal samples to the feature space of the normal samples is reduced, the generalization capability of a decoder is limited, and the performance of abnormal detection is further improved.

In practical use, the image detection model can be obtained by a countermeasure training method, so the embodiment of the invention also provides a training method of the image detection model, and the image detection model is configured with a self-encoder network structure with multi-level characteristic space and distribution constraint.

The multi-level self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, and the multi-level self-encoder structure comprises a multi-level feature space.

Specifically, as shown in fig. 6, a flowchart of a training method of a picture detection model includes the following steps:

Step S602, a pre-established picture set is obtained, wherein the picture set comprises a preset number of normal image samples;

step S604, inputting a normal image sample into a picture detection model to be trained, and carrying out picture reconstruction on the normal image sample to obtain a reconstructed image sample corresponding to the normal image sample;

step S606, calculating a loss function of the normal image sample and the reconstructed image sample;

and step 608, performing iterative optimization on model parameters of the picture detection model to be trained based on the loss function so as to train the picture detection model to be trained.

Further, the iteration times of the iteration optimization are required to be recorded in the iteration optimization process; if the iteration times reach a preset iteration threshold, ending the iteration optimization; and saving the model parameters of the current picture detection model to be trained.

In practical use, the image detection model obtained through training by the method can be called in image detection, specifically, the model parameters of the image detection model can be recorded to carry out image detection, then normal or abnormal images to be detected are input, at this time, the image detection model can generate corresponding reconstructed images according to the input images to be detected, then the reconstruction errors are calculated, and the reconstructed images are compared with the preset error threshold values to detect the abnormal images.

On the basis of the foregoing embodiment, the embodiment of the present invention further provides a picture detection apparatus, as shown in fig. 7, which includes:

A first obtaining module 70, configured to obtain a picture to be measured;

A reconstruction module 72, configured to input the picture to be detected to a pre-established picture detection model, so as to output a reconstructed picture corresponding to the picture to be detected through the picture detection model; the picture detection model is configured with a self-encoder network structure with multi-level characteristic space and distribution constraint;

A first calculation module 74, configured to calculate a reconstruction error between the picture to be measured and the reconstructed picture;

A detection module 76 for detecting the picture based on the reconstruction error.

Further, the embodiment of the invention also provides a training device of the picture detection model, wherein the picture detection model is configured with a self-encoder network structure with multi-level characteristic space and distribution constraint; the multi-level characteristic space and distribution constraint self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, and the multi-level self-encoder structure comprises a multi-level characteristic space; specifically, as shown in fig. 8, a training device for a picture detection model is shown in a schematic structural diagram, where the training device includes:

A second obtaining module 80, configured to obtain a pre-established picture set, where the picture set includes a preset number of normal image samples;

The input module 82 is configured to input the normal image sample to a picture detection model to be trained, and reconstruct a picture of the normal image sample to obtain a reconstructed image sample corresponding to the normal image sample;

a second calculation module 84 for calculating a loss function of the normal image samples and the reconstructed image samples;

And an optimization module 86, configured to iteratively optimize model parameters of the to-be-trained picture detection model based on the loss function, so as to train the to-be-trained picture detection model.

The image detection device and the image detection model training device provided by the embodiment of the invention have the same technical characteristics as the image detection method and the image detection model training method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the method when executing the computer program.

Embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.

Further, an embodiment of the present invention further provides a schematic structural diagram of an electronic device, as shown in fig. 9, where the electronic device includes a processor 91 and a memory 90, where the memory 90 stores computer executable instructions that can be executed by the processor 91, and the processor 91 executes the computer executable instructions to implement the above-mentioned picture detection method or the training method of the picture detection model.

In the embodiment shown in fig. 9, the electronic device further comprises a bus 92 and a communication interface 93, wherein the processor 91, the communication interface 93 and the memory 90 are connected by means of the bus 92.

The memory 90 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and the at least one other network element is implemented via at least one communication interface 93 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 92 may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 92 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one bi-directional arrow is shown in fig. 9, but not only one bus or one type of bus.

The processor 91 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 91 or by instructions in the form of software. The processor 91 may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory, and the processor 91 reads the information in the memory, and combines the hardware thereof to complete the picture detection method or the training method of the picture detection model in the foregoing embodiment.

The computer program product of the image detection method, the training method of the image detection model and the device provided by the embodiment of the invention comprises a computer readable storage medium storing program codes, wherein the instructions included in the program codes can be used for executing the method described in the method embodiment, and specific implementation can be referred to the method embodiment and will not be repeated here.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the apparatus described above, which is not described herein again.

In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood by those skilled in the art in specific cases.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention for illustrating the technical solution of the present invention, but not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present invention is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A picture detection method, comprising:

Acquiring a picture to be detected;

Inputting the picture to be measured into a multi-level self-encoder structure, and embedding the picture to be measured into a multi-level characteristic space through the multi-level self-encoder structure to obtain characteristic distribution of the picture to be measured; the multi-level characteristic space and distribution constraint self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure comprises a multi-level characteristic space; performing feature fitting on the feature distribution of the picture to be detected and the priori distribution pre-configured in the multi-level feature space; inputting the characteristic fitting result into the multi-level decoder structure for picture recombination so as to generate a reconstructed picture corresponding to the picture to be detected; the multi-level feature space comprises a high-order feature discriminator and a low-order feature discriminator;

Calculating the reconstruction errors of the picture to be detected and the reconstructed picture;

and detecting the picture based on the reconstruction error.

2. The method of claim 1, wherein the multi-level self-encoder structure is a convolutional neural network-based encoder structure, and wherein the multi-level self-encoder structure comprises at least one convolutional structure comprising a preset convolutional layer and an active layer, and wherein the convolutional layer is provided with a preset convolutional kernel and step size;

Inputting the picture to be detected into the multi-level self-encoder structure, embedding the picture to be detected into the multi-level feature space through the multi-level self-encoder structure, and obtaining the feature distribution of the picture to be detected comprises the following steps:

Inputting the picture to be detected into the multi-level self-encoder structure, and obtaining the hierarchical characteristic information of the picture to be detected through at least one convolution structure;

and embedding the hierarchical feature information into the multi-level feature space to obtain feature distribution of the picture to be detected, wherein the feature distribution comprises low-order representation of the picture to be detected and high-order representation of the picture to be detected.

3. The method according to claim 2, wherein the step of fitting features of the feature distribution of the picture to be measured to a pre-configured prior distribution in the multi-level feature space comprises:

Inputting a low-order representation included in the feature distribution of the picture to be detected to the low-order feature discriminator, and inputting a high-order representation included in the feature distribution of the picture to be detected to the high-order feature discriminator;

Fitting the feature distribution of the picture to be measured through the high-order feature discriminant, the low-order feature discriminant and the priori distribution which is preconfigured in the multi-level feature space.

4. The method according to claim 1, wherein the step of calculating a reconstruction error of the picture to be measured and the reconstructed picture comprises:

Respectively calculating first norm distances of the picture to be detected and the reconstructed picture, and second norm distances of the picture to be detected and the reconstructed picture in the multi-level feature space;

and calculating the reconstruction error of the picture to be detected and the reconstructed picture based on the first norm distance and the second norm distance.

5. The method of claim 4, wherein the first and second norm distances are each L2 norm examples;

the reconstruction error is expressed as:

Wherein E _i (·) represents a feature representation obtained from a feature map of an ith layer of a multi-layer self-encoder structure in the self-encoder network structure, alpha and beta represent weight super-parameters, X represents a picture to be measured, The reconstructed picture is represented.

6. The method of claim 1, wherein detecting the picture based on the reconstruction error comprises:

Judging whether the reconstruction error is smaller than a preset error threshold value or not;

if yes, determining the picture to be detected as a normal picture;

and if not, determining the picture to be detected as an abnormal picture.

7. The training method of the picture detection model is characterized in that the picture detection model is configured with a self-encoder network structure with multi-level characteristic space and distribution constraint;

the multi-level self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure comprises a multi-level feature space; the multi-level feature space comprises a high-order feature discriminator and a low-order feature discriminator;

The method comprises the following steps:

acquiring a pre-established picture set, wherein the picture set comprises a preset number of normal image samples;

Inputting the normal image sample into a picture detection model to be trained, and carrying out picture reconstruction on the normal image sample to obtain a reconstructed image sample corresponding to the normal image sample; calculating a loss function of the normal image sample and the reconstructed image sample;

Performing iterative optimization on model parameters of the picture detection model to be trained based on the loss function so as to train the picture detection model to be trained;

The picture detection model embeds a picture to be detected, which is input into a multi-level self-encoder structure, into a multi-level feature space through the multi-level self-encoder structure to obtain feature distribution of the picture to be detected; the multi-level characteristic space and distribution constraint self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure comprises a multi-level characteristic space; performing feature fitting on the feature distribution of the picture to be detected and the priori distribution pre-configured in the multi-level feature space; and inputting the characteristic fitting result into the multi-level decoder structure for picture recombination so as to generate a reconstructed picture corresponding to the picture to be detected.

8. The method of claim 7, wherein the method further comprises:

Recording the iteration times of the iteration optimization;

Ending the iterative optimization if the iterative times reach a preset iterative threshold value;

and saving the model parameters of the current picture detection model to be trained.

9. A picture detection apparatus, comprising:

the first acquisition module is used for acquiring a picture to be detected;

The reconstruction module is used for inputting the picture to be detected into a multi-level self-encoder structure, and embedding the picture to be detected into a multi-level feature space through the multi-level self-encoder structure to obtain feature distribution of the picture to be detected; the multi-level characteristic space and distribution constraint self-encoder network structure comprises a multi-level self-encoder structure and a multi-level decoder structure, wherein the multi-level self-encoder structure comprises a multi-level characteristic space; performing feature fitting on the feature distribution of the picture to be detected and the priori distribution pre-configured in the multi-level feature space; inputting the characteristic fitting result into the multi-level decoder structure for picture recombination so as to generate a reconstructed picture corresponding to the picture to be detected; the multi-level feature space comprises a high-order feature discriminator and a low-order feature discriminator;

the first calculation module is used for calculating the reconstruction errors of the picture to be detected and the reconstructed picture;

and the detection module is used for detecting the picture based on the reconstruction error.

10. The training device of the picture detection model is characterized in that the picture detection model is provided with a self-encoder network structure with multi-level characteristic space and distribution constraint;

The device comprises:

the second acquisition module is used for acquiring a pre-established picture set, wherein the picture set comprises a preset number of normal image samples;

The input module is used for inputting the normal image sample into a picture detection model to be trained, carrying out picture reconstruction on the normal image sample, and obtaining a reconstructed image sample corresponding to the normal image sample;

a second calculation module for calculating a loss function of the normal image sample and the reconstructed image sample;

the optimization module is used for carrying out iterative optimization on the model parameters of the picture detection model to be trained based on the loss function so as to train the picture detection model to be trained;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of the preceding claims 1-8 when the computer program is executed by the processor.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of the preceding claims 1-8.