CN117409285A

CN117409285A - Image detection method and device and electronic equipment

Info

Publication number: CN117409285A
Application number: CN202311719012.5A
Authority: CN
Inventors: 李晨; 江腾飞; 王嘉磊; 皮成祥; 张健; 王珊
Original assignee: Shining 3D Technology Co Ltd
Current assignee: Shining 3D Technology Co Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-01-16
Anticipated expiration: 2043-12-14
Also published as: CN117409285B

Abstract

The embodiment of the disclosure provides an image detection method, an image detection device and electronic equipment. The current frame image acquired by the lens module and the multi-frame reference image acquired before and/or after the current frame image are acquired, and the shallow local feature image of the current frame image and the shallow local feature image of each frame reference image are extracted. And then fusing the shallow local feature images of the reference images of each frame into the shallow local feature images of the current frame image, and selecting a specific fusion mode to strengthen the feature values at the position with higher similarity in the shallow local feature images of the current frame image and the shallow local feature images of the reference images in the fusion process, so as to obtain a fused shallow local feature image.

Description

Image detection method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to an image detection method, an image detection device and electronic equipment.

Background

When a scene of an image is acquired by using the lens module, the situation that the lens module is stained usually exists, and the quality of the acquired image is seriously affected by the stain existing in the lens module, so that the subsequent use of the image is affected. Therefore, it is necessary to timely detect that the lens module has dirt, so as to prompt the user to treat the dirt of the lens. However, the prior art is easily interfered by various conditions, and the part similar to dirt in some images is mistakenly identified as dirt, so that the detection result is inaccurate. Therefore, it is desirable to provide a more accurate solution for detecting lens fouling.

Disclosure of Invention

The disclosure provides an image detection method, an image detection device and electronic equipment.

According to a first aspect of embodiments of the present disclosure, there is provided an image detection method, the method including:

acquiring a current frame image and a reference image acquired by a lens module, wherein the reference image comprises a plurality of frame images acquired before and/or after the current frame image;

respectively carrying out feature extraction on the current frame image and the reference image to obtain a shallow local feature map of the current frame image and a shallow local feature map of the reference image;

Fusing the shallow local feature map of the reference image into the shallow local feature map of the current frame image to obtain a fused shallow local feature map, wherein in the fusion process, the feature value at a target pixel position in the shallow local feature map of the current frame image is enhanced, and the similarity between the feature value at the target pixel position and the feature value at the corresponding pixel position in the shallow local feature map of the reference image is higher than a preset similarity;

and determining whether dirt exists in the lens module based on the fused shallow local feature map.

According to a second aspect of embodiments of the present disclosure, there is provided an image detection method, the method including:

acquiring a to-be-detected image currently acquired by a lens module on three-dimensional scanning equipment;

inputting the image to be detected into a pre-trained detection model, and judging the current scanning environment type of the three-dimensional scanning equipment through the detection model; wherein the detection model is trained based on the following:

acquiring at least two frames of sample images, wherein the scanning environment types corresponding to the at least two frames of sample images are the same;

respectively extracting features of the at least two frames of sample images by using a preset initial model to obtain global features of each of the at least two frames of sample images;

And determining target loss based on the difference between the similarity of the global features of every two sample images in the at least two frame sample images and a preset similarity threshold, and adjusting model parameters of the initial model based on the target loss so as to train and obtain the detection model.

According to a third aspect of embodiments of the present disclosure, there is provided an image detection method, the method including:

acquiring sample image triplets, wherein each sample image triplet comprises a first sample image, a second sample image with the same scanning environment type as the first sample image, and a third sample image with a scanning environment type different from the scanning environment type of the first sample image;

respectively extracting features of the first sample image, the second sample image and the third sample image through a preset initial model to obtain respective global features;

And determining a target loss based on the similarity of the global features of the second sample image and the global features of the first sample image and the similarity of the global features of the third sample image and the global features of the first sample image, and adjusting model parameters of the initial model based on the target loss so as to train and obtain the detection model.

According to a fourth aspect of embodiments of the present disclosure, there is provided an image detection method, the method including:

extracting features of the image to be detected to obtain global features;

judging whether the current scanning environment type of the three-dimensional scanning equipment is a target environment or not according to the proximity degree of the global feature and a preset feature clustering center; the feature cluster center is a cluster center of global features of a plurality of frames of sample images, and the sample images are images acquired by three-dimensional scanning equipment in a target environment.

According to a fifth aspect of embodiments of the present disclosure, there is provided an image detection apparatus including:

the acquisition module is used for acquiring a current frame image and a reference image acquired by the lens module, wherein the reference image comprises a plurality of frame images acquired before and/or after the current frame image;

The feature extraction module is used for carrying out feature extraction on the current frame image and the reference image respectively to obtain a shallow local feature map of the current frame image and a shallow local feature map of the reference image;

the fusion module is used for fusing the shallow local feature map of the reference image into the shallow local feature map of the current frame image to obtain a fused shallow local feature map, wherein in the fusion process, the feature value at a target pixel position in the shallow local feature map of the current frame image is enhanced, and the similarity between the feature value at the target pixel position and the feature value at the corresponding pixel position in the shallow local feature map of the reference image is higher than the preset similarity;

and the prediction module is used for determining whether dirt exists in the lens module or not based on the fused shallow local feature map. According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor, a memory, computer instructions stored in the memory for execution by the processor, when executing the computer instructions, implementing the method as mentioned in the first aspect, the second aspect, the third aspect and/or the fourth aspect.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed, implement the method mentioned in the first, second, third and/or fourth aspects above.

In the embodiment of the disclosure, when the lens module is used for detecting the dirt based on the image acquired by the lens module, the characteristic of fixing the dirt position in the multi-frame image acquired by the lens module can be integrated into the detection mechanism, so that the accuracy of the detection result is improved. The method comprises the steps of acquiring a current frame image acquired by a lens module, acquiring multi-frame reference images acquired before and/or after the current frame image by the lens module, and extracting features of the current frame image and the multi-frame reference images to obtain a shallow local feature map of the current frame image and a shallow local feature map of each frame of reference image. The shallow local features mainly include texture, edge, edges and angles and other information in the image, namely the shallow local features cover the dirty features in the image, in order to highlight the dirty features in the shallow local feature images, the shallow local feature images of the reference images of each frame can be fused into the shallow local feature images of the current frame image, and a specific fusion mode can be selected, so that in the fusion process, the feature values in the positions, which are higher in similarity with the shallow local feature images of the reference images, of the shallow local feature images of the current frame image are enhanced, and further the fused shallow local feature images are obtained, and whether the dirty exists in the lens module is determined based on the fused shallow local feature images. Because the positions of the dirt in each frame of image are fixed, the positions with high similarity in the shallow local feature images of each frame of image are the positions where the dirt is located, and the feature values of the positions are enhanced, so that the dirt features become more obvious in the finally obtained fused shallow local feature images, and further the dirt detection result obtained based on the fused shallow local feature images is more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 is a schematic diagram of an image detection method according to an embodiment of the present disclosure.

Fig. 2 is a flowchart of an image detection method according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a method for detecting whether a lens module is dirty or not according to an embodiment of the disclosure.

Fig. 4 is a schematic diagram of a method for detecting whether a lens module is dirty or not according to an embodiment of the disclosure.

Fig. 5 is a schematic diagram of a method for detecting whether a lens module is dirty, fog is present, and scanning environment types according to an embodiment of the disclosure.

Fig. 6 is an architecture diagram of a detection model of an embodiment of the present disclosure.

Fig. 7 is a flowchart of an image detection apparatus according to an embodiment of the present disclosure.

Fig. 8 is a schematic diagram of a logic structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order to better understand the technical solutions in the embodiments of the present disclosure and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

When a scene of an image is acquired by using the lens module, the situation that the lens module is stained usually exists, and the quality of the acquired image is seriously affected by the stain existing in the lens module, so that the subsequent use of the image is affected. Therefore, it is necessary to timely detect that the lens module has dirt, so as to prompt the user to treat the dirt of the lens.

For example, taking the case of scanning teeth of a patient by using an oral cavity three-dimensional scanning device and taking a scene of reconstructing a tooth model of the patient as an example, if dirt exists in a lens module of the three-dimensional scanning device, the dirt exists in each frame of image acquired by the three-dimensional scanning device, so that the effect of three-dimensional model obtained by three-dimensional reconstruction by using the image is seriously affected, and a user needs to rescan, so that the user experience is seriously affected. Therefore, it is necessary to detect the dirt condition of the lens module in time and prompt the user to clean in time.

The applicant finds that at present, when the dirt detection of the lens module is performed, a detection model can be obtained by training a large number of sample images marked with whether dirt exists in advance, and then the detection model is used for detecting an image to be detected, so that a detection result is obtained. In this detection mode, the model predicts the detection result based on the features of the current frame image only, and is easily interfered by various situations, for example, the portions similar to dirt in some images are mistakenly identified as dirt. However, the applicant considers that there is a significant characteristic for the image collected by the lens module in view of the dirty scene of the lens module: that is, the positions of the dirt in the multi-frame images collected by the lens module are fixed. At present, when a model is used to detect whether a lens module is dirty, the characteristics of a single frame image collected by the lens module are extracted, and a detection result is predicted based on the characteristics of the single frame image, so that the characteristics under the scene are not utilized, and the accuracy of the detection result is low.

Based on this, the embodiment of the application provides an image detection method, when the lens module is used for detecting the dirt based on the image collected by the lens module, the characteristic of fixing the dirt position in the multi-frame image collected by the lens module can be integrated into a detection mechanism, so that the accuracy of a detection result is improved. For example, as shown in fig. 1, a current frame image acquired by a lens module and multiple frames of reference images acquired before and/or after the current frame image is acquired by the lens module can be acquired, and feature extraction is performed on the current frame image and the multiple frames of reference images, so as to obtain a shallow local feature map of the current frame image and a shallow local feature map of each frame of reference image. The shallow local features mainly include texture, edge, edges and angles and other information in the image, namely the shallow local features cover the dirty features in the image, in order to highlight the dirty features in the shallow local feature images, the shallow local feature images of the reference images of each frame can be fused into the shallow local feature images of the current frame image, and a specific fusion mode can be selected, so that in the fusion process, the feature values in the positions, which are higher in similarity with the shallow local feature images of the reference images, of the shallow local feature images of the current frame image are enhanced, and further the fused shallow local feature images are obtained, and whether the dirty exists in the lens module is determined based on the fused shallow local feature images. Because the positions of the dirt in each frame of image are fixed, the positions with high similarity in the shallow local feature images of each frame of image are the positions where the dirt is located, and the feature values of the positions are enhanced, so that the dirt features become more obvious in the finally obtained fused shallow local feature images, and further the dirt detection result obtained based on the fused shallow local feature images is more accurate.

The image detection method provided by the embodiment of the application can be executed by various electronic devices with detection capability, for example, mobile phones, computers, cloud servers and the like. The electronic device may be a device for capturing an image, or may be another device communicatively connected to the device for capturing an image, which is not limited in the embodiments of the present application.

As shown in fig. 2, a flowchart of an image detection method provided in an embodiment of the present application specifically includes the following steps:

s202, acquiring a current frame image and a reference image acquired by a lens module, wherein the reference image comprises a plurality of frame images acquired before and/or after the current frame image;

in step S202, a current frame image acquired by the lens module and a reference image acquired by the lens module may be acquired, and the reference image may be a multi-frame image acquired before and/or after the current frame image. For example, 2n+1 frame images continuously collected by the lens module may be obtained, the n+1 frame image is taken as a current frame image, and the front N frame image and the rear N frame image of the image are taken as reference images, and of course, the 2n+1 frame image may also be a non-continuously collected image.

S204, respectively extracting the characteristics of the current frame image and the reference image to obtain a shallow local characteristic image of the current frame image and a shallow local characteristic image of the reference image;

in step S204, feature extraction may be performed on the current frame image and each frame of reference image, to obtain a shallow local feature map of the current frame image and a shallow local feature map of each frame of reference image. The shallow local features are some low-level features of the image, such as color, texture, edges, etc., of the image, and the shallow local features include features of more pixels, i.e., include more details. The features such as dirt in the image are shallow local features, namely can be reflected by a shallow local feature map.

The feature extraction of the current frame image and the reference image to obtain the shallow local feature map may be implemented through some feature extraction networks, for example, a mobilenet v2 network, a res net network, a VGG network, a Transform network, and so on.

S206, fusing the shallow local feature map of the reference image into the shallow local feature map of the current frame image to obtain a fused shallow local feature map, wherein in the fusion process, the feature value at a target pixel position in the shallow local feature map of the current frame image is enhanced, and the similarity between the feature value at the target pixel position and the feature value at the corresponding pixel position in the shallow local feature map of the reference image is higher than a preset similarity;

Considering that the lens dirt is usually a fixed position in the multi-frame image acquired by the lens module, namely, the lens dirt features are present at the same position in the shallow local feature map of the multi-frame image, and the dirt is the same, namely, the feature values of the positions of the dirt in the multi-frame shallow local feature map are close. Based on this, in step S206, the shallow local feature map of each frame of reference image may be fused into the shallow local feature map of the current frame image one by one, and an appropriate fusion manner may be selected, so that in the fusion process, the feature value of the target pixel position in the shallow local feature map of the current frame image is enhanced, where the similarity between the feature value of the target pixel position and the feature value of the corresponding pixel position in the shallow local feature map of the reference image is higher than the preset similarity. The position corresponding to the dirty feature in the shallow local feature map of the current frame image has high probability of being the position with higher similarity of the feature value in the shallow local feature map of the current frame image and the feature value in the shallow local feature map of the reference image, so that the feature value on the target pixel position with higher similarity of the two frames of feature maps can be enhanced in the fusion process, and the dirty feature in the finally obtained fused shallow local feature map is enhanced, namely the dirty feature becomes more obvious.

In order to enhance the dirty features in the shallow Local feature map of the current frame image in the fusion process, so that the dirty features are more obvious, when the shallow Local feature map of the reference image is fused with the shallow Local feature map of the current frame image, a fusion mode based on an attention mechanism can be adopted, or a pyramid pooling module (Pyramid Pooling Module, PPM) is adopted to fuse the shallow Local feature map of the current frame image, or a cavity space pyramid pooling module (Atrous Spatial Pyramid Pooling, ASPP) is adopted to fuse the shallow Local feature map of the current frame image, or a Non-Local module (Non-Local) is adopted to fuse the shallow Local feature map of the current frame image.

S208, determining whether dirt exists in the lens module or not based on the fused shallow local feature map.

In step S208, after the fused shallow local feature map is obtained, it may be determined whether the lens module has dirt based on the fused shallow local feature map. As the dirty characteristic in the fused shallow local characteristic diagram is enhanced and becomes more remarkable, the dirty prediction result obtained based on the fused characteristic diagram is more accurate. The embodiment of the application is not limited, and the embodiment of the application can directly perform the dirt detection of the lens module based on the fusion shallow local feature map, and can also perform the dirt detection of the lens module by combining the fusion shallow local feature map and other types of features of the current frame image. If the detection result indicates that the lens module is dirty, the user can be prompted to clean the dirty. The prompting modes include but are not limited to: voice prompts, pop-up window prompts, bell sound prompts, vibration prompts, etc.

Considering that the shallow local feature map is only a few low-level features of the reflected image, the detection result may not be accurate enough by performing the dirt detection of the lens module based on the shallow local features only. Therefore, in some embodiments, as shown in fig. 3, when determining whether there is a dirt in the lens module based on the fused shallow local feature map, feature extraction may be further performed on the current frame image to obtain a global feature of the current frame image, where the global feature is a higher-level feature extracted from the image, and includes some higher-level semantic information in the image, and then the global feature and the shallow local feature map may be fused to obtain a fused feature, and then whether there is a dirt in the lens module is determined based on the fused feature. The global features may be represented in the form of feature graphs or feature vectors, which are not limited in the embodiments of the present application.

In some embodiments, if the global feature is represented by a global feature vector, when the global feature and the shallow local feature map are fused to obtain a fused feature, the shallow local feature map may be pooled to obtain a shallow local feature vector, and then the shallow local feature vector and the global feature vector are fused. The fusion method includes various modes, for example, the shallow local feature vector and the global feature vector can be directly spliced to obtain the fusion feature. Alternatively, the shallow local feature vector and the global feature vector may be superimposed to obtain the fusion feature, where the superimposition refers to adding the values of the corresponding positions of the shallow local feature vector and the global feature vector to obtain a new feature vector.

In some embodiments, as shown in fig. 4, before the shallow local feature map of the reference image is fused to the shallow local feature map of the current frame image, the shallow local feature map of the reference image and the shallow local feature map of the current frame image may be first subjected to a maximum pooling process, and then the shallow local feature map of the reference image after the maximum pooling process is fused to the shallow local feature map of the current frame image after the maximum pooling process. The largest pooling process can extract the significant features of the local area in the shallow local feature map, namely, the features of the dirty area level (relative to the pixel level) can be extracted, so that the dirty features are more significant. In addition, the maximum pooling treatment can also reduce the size of the shallow local feature map, so that the subsequent fusion operation can be accelerated.

In some embodiments, when the shallow local feature map of the reference image is fused to the shallow local feature map of the current frame image to obtain the fused shallow local feature map, the shallow local feature map of each frame of the reference image and the shallow local feature map of the current frame image may be fused respectively, where, in order to make the feature value of the target pixel position in the shallow local feature map of the current frame image be enhanced after the fusion, for each pixel position in the shallow local feature map of the current frame image, if the similarity between the feature value of the pixel position and the feature value of the corresponding pixel position in the shallow local feature map of the reference image is higher, the fusion weight of the feature value of the pixel position is greater. In this way, the feature value at the pixel position with higher similarity with the shallow local feature map of the reference image in the shallow local feature map of the current frame image can be enhanced. After the shallow local feature map of each frame of reference image is fused with the shallow local feature map of the current frame of image, the shallow local feature maps of each frame obtained by fusion can be subjected to superposition processing to obtain the fused shallow local feature map, wherein the superposition processing is to add the feature values of the corresponding pixel positions on the shallow local feature maps of each frame.

In some embodiments, the lens module may be a lens module on a three-dimensional scanning device. The three-dimensional scanning device can be an oral cavity scanner, a facial scanner, an industrial scanner and a professional scanner, and can be used for three-dimensional reconstruction of teeth, faces, human bodies, industrial products, industrial equipment, cultural relics, artworks, artificial limbs, medical appliances, buildings and other articles.

Considering that for a three-dimensional scanning device, in the use process, a certain temperature difference exists between a scanning environment and an external environment, so that a scene that fog exists on a lens module often occurs. For example, taking an oral cavity three-dimensional scanning device as an example, due to a certain temperature difference between an oral cavity environment and an external environment, fog exists in the lens module, and the oral cavity scanning is affected. Therefore, the fog detection can be carried out on the lens module, when the fog exists on the lens module, the defogging function can be started firstly, the fog on the lens module is removed, the image is collected again, the influence of the fog on the collected image is avoided, and the subsequent three-dimensional reconstruction process is further influenced.

In addition, for the three-dimensional scanning device, there are generally multiple scanning scenes, and under different scanning scenes, the impurity condition around the target object to be reconstructed, the brightness degree of the scanning environment and the like may have large differences, so that when the three-dimensional scanning device acquires an image, the three-dimensional scanning device has proper operation parameters, or when the three-dimensional scanning device acquires an image to reconstruct the image, the three-dimensional scanning device has proper processing modes when processing the image.

For example, taking an oral three-dimensional scanning device as an example, there are two use scenes of intraoral scanning and extraoral scanning, because the intraoral environment and the extraoral environment have large differences, for example, the intraoral environment has the interferents such as gum, tongue, cheek soft tissues and the like, and the extraoral environment does not have the problems. Or a lower brightness in the intraoral environment, etc. compared to the intraoral environment. Thus, there is a difference in the manner in which the images are processed for both scanning environments, either later in the acquisition of the images or in the reconstruction of the teeth using the acquired images.

In view of the foregoing, different modes of operation may be set for different usage scenarios, and appropriate processing modes, algorithm parameters, and/or device operation modes may be configured for each mode of operation. In one embodiment, the operation mode of the three-dimensional scanning device may include an operation mode of each device in the three-dimensional scanning device and a processing mode of three-dimensional scanning software matched with the three-dimensional scanning device.

Before the three-dimensional scanning equipment is used for acquiring the image, the current scanning environment type can be detected, and after a proper working mode is selected based on the current scanning environment type, the image acquisition, the three-dimensional reconstruction and other works can be performed.

It should be noted that, for the oral cavity three-dimensional scanning device, two usage scenarios, namely, intraoral scanning and intraoral scanning, may be further subdivided according to specific application scenarios, then different working modes may be set for each subdivided usage scenario, and appropriate processing modes, algorithm parameters, and/or device operation modes may be configured for each working mode.

In some embodiments, as shown in fig. 5, the image detection method not only can perform dirt detection on the lens module based on the image, but also can perform one or more of detection of lens fog and detection of scanning environment type at the same time, thereby improving detection efficiency. For example, feature extraction can be performed on the current frame image to obtain global features of the current frame image, feature extraction is performed on the reference image to obtain global features of each frame of reference image, the global features of the reference image are fused into the global features of the current frame image to obtain first fused global features, and then the current scanning environment type of the three-dimensional scanning device is judged based on the first fused global features. For example, using an oral three-dimensional scanning device as an example, it may be determined whether the type of scanning environment is an intraoral environment or an extraoral environment.

Or in some embodiments, feature extraction may be performed on the current frame image to obtain global features of the current frame image, feature extraction may be performed on the reference image to obtain global features of each frame of reference image, the global features of the reference image are fused into the global features of the current frame image to obtain second fused global features, and then whether fog exists on the lens module is determined based on the second fused global features.

The feature extraction of the current frame image and the reference image to obtain global features may be implemented through some feature extraction networks, for example, a mobilenet v2 network, a ResNet network, a VGG network, a Transform network, and so on.

In some embodiments, the image detection method may be implemented by a pre-trained detection model, for example, the detection of the lens module dirt may be implemented by the detection model, or the detection of the lens module dirt and the detection of the lens fog may be implemented by the detection model simultaneously, or the detection of the lens dirt and the detection of the scanning environment type may be implemented by the detection model simultaneously, or the detection of the lens dirt, the detection of the lens fog and the detection of the scanning environment type may be implemented by the detection model simultaneously.

For example, taking two tasks of detecting lens dirt and detecting scanning environment type as an example, the detection model can be trained by the following ways: the current frame sample image and the reference frame sample image may be acquired, where the current frame sample image and the reference frame sample image are acquired by the same three-dimensional scanning device, and the reference sample image includes multiple frame images acquired before and/or after the current frame sample image. The current frame sample image carries a label, and the label is used for indicating whether the lens module for collecting the current frame sample image comprises dirt and a scanning environment type corresponding to the current frame sample image, namely the scanning environment type of the three-dimensional scanning device when collecting the current frame sample image.

Then the current frame sample image and the reference sample image can be input into a preset initial model, the initial model respectively performs feature extraction on the current frame sample image and the reference sample image to obtain a shallow local feature image and global features of the current frame sample image, a shallow local feature image and global features of the reference sample image, then the shallow local feature image of the reference sample image is fused into the shallow local feature image of the current frame sample image to obtain a fused shallow local feature image (for distinguishing the fused shallow local feature image from the inference stage, hereinafter referred to as a sample fused shallow local feature), the sample fused shallow local feature image and the global features of the current frame sample image are fused to obtain fused features (for distinguishing the fused features from the inference stage, hereinafter referred to as sample fused features), then whether the lens module comprises dirt or not can be judged based on the sample fused features, and the first loss is determined based on the judgment result and the difference of the labels.

Meanwhile, the initial model may fuse the global feature of the current frame sample image with the global feature of the reference sample image to obtain a first sample fused global feature (for distinguishing from the first fused global feature of the inference stage, hereinafter referred to as a first sample fused global feature), determine a scanning environment type corresponding to the current frame sample image based on the first sample fused global feature, and determine a second loss based on the difference between the determination result and the tag. Model parameters of the initial model may then be adjusted based on the first loss and the second loss to train to obtain the detection model.

In some embodiments, as shown in fig. 6, the model of the initial model may include the following subnetworks: the system comprises a feature extraction sub-network, a shallow local feature fusion sub-network, a first global fusion sub-network, a second global fusion sub-network and a loss joint optimization sub-network.

And the feature extraction sub-network is used for respectively carrying out feature extraction on the current frame sample image and the reference sample image to obtain a shallow local feature image and global features of the current frame sample image and shallow local feature images and global features of the reference sample image. The feature extraction sub-network may be a mobilenet v2 network, a res net network, a VGG network, a Transform network, or the like.

And the shallow local feature fusion sub-network is used for fusing the shallow local feature map of the reference sample image into the shallow local feature map of the current frame sample image to obtain a sample fused shallow local feature map.

And the first global fusion sub-network is used for fusing the sample fusion shallow local feature map and the global features of the sample image of the current frame to obtain sample fusion features.

And the second global fusion sub-network is used for fusing the global features of the sample image of the current frame and the global features of the reference sample image to obtain the first sample fusion global features.

Losing the joint optimization sub-network; the method is used for judging whether the lens module comprises dirt or not based on the sample fusion characteristics, determining a first loss based on the judgment result and the difference of the labels, judging the type of the scanning environment based on the first sample fusion global characteristics, determining a second loss based on the judgment result and the difference of the labels, and then adjusting model parameters of an initial model based on the first loss and the second loss to train to obtain the detection model.

In some embodiments, as shown in fig. 6, if the detection model can be used to simultaneously implement three tasks of detecting lens dirt, detecting lens fog, and detecting a scanning environment type, the tag carried by the current frame image is further used to indicate whether the lens module for collecting the current frame image includes fog, and the initial model further includes: third global fusion subnetwork: and the method is used for fusing the global features of the sample image of the current frame with the global features of the reference sample image to obtain second sample fusion global features.

Besides the functions, the loss combined optimization sub-network can be used for judging whether fog exists on the lens module based on the second sample fusion global feature, determining a third loss based on the difference between a judging result and a label, and adjusting model parameters of an initial model in the first loss, the second loss and the third loss so as to train and obtain the detection model. For example, different weights may be set for the three losses, the three losses are weighted and summed based on the weights to obtain the target loss, and then model parameters of the initial model are adjusted based on the target loss to train to obtain the detection model.

In the related art, three different detection models need to be trained in advance for detecting the dirt of the lens, detecting the fog of the lens and detecting the type of the scanning environment, and the detection models are respectively used for realizing the detection of the dirt of the lens, so that the processing mode has lower efficiency. In the embodiment of the application, considering that the three types of detection are generally required before the three-dimensional scanning device scans the target object to be reconstructed, after determining that the lens is free from dirt and fog and determining the type of the scanning environment, the corresponding working mode can be selected for carrying out the subsequent scanning operation. Thus, it is contemplated in the embodiments of the present application to combine three detection tasks, implemented using one detection model. Meanwhile, in order to realize three detection tasks, a new architecture of a detection model is provided in the embodiment of the present application, as shown in fig. 6, the detection model includes a feature extraction sub-network, a shallow local feature fusion sub-network, a first global fusion sub-network, a second global fusion sub-network, a third global fusion sub-network, and a loss joint optimization sub-network, where features extracted by the feature extraction sub-network are shared by the three detection tasks, so that features can be prevented from being extracted for the three detection tasks respectively, and detection efficiency is improved. The shallow local feature fusion sub-network and the first global fusion sub-network are used for realizing dirty detection, the second global fusion sub-network is used for realizing detection of the type of scanning environment, the third global fusion sub-network is used for realizing detection of lens fog, the loss joint optimization sub-network is used for determining total loss based on the difference between the detection result of three tasks and the real result (namely the result indicated by the label), and the model is trained based on the loss adjustment model parameters. Through the mode, three tasks can be completed through one model, and meanwhile, the detection efficiency can be improved.

In some embodiments, after determining the current scanning environment type of the three-dimensional scanning device based on the first fused global feature, the current working mode of the three-dimensional scanning device may be switched to a target working mode matched with the scanning environment type, or the user may be prompted to switch the current working mode of the three-dimensional scanning device to the target working mode matched with the scanning environment type; when the three-dimensional scanning equipment is in different working modes, the operation modes of devices on the three-dimensional scanning equipment are different when the three-dimensional scanning equipment is used for acquiring images, and/or the processing modes of processing the images are different in the process of three-dimensional reconstruction of a target object by using the images acquired by the three-dimensional scanning equipment. The processing mode comprises processing steps, adopted processing algorithms, processing parameters used in processing and the like for processing the image. The operation mode of the device may be an operation parameter or state of various devices in the three-dimensional scanning apparatus, for example, an on state of a light compensating lamp in the three-dimensional scanning apparatus, or a brightness parameter of the light compensating lamp, or an on state of an anti-fog module, etc. In some embodiments, consider that a three-dimensional scanning device, when scanning a target object to reconstruct a three-dimensional model of the target object, has the following two scenarios: one is a scene in which there are more impurities around the target object to be reconstructed, and the other is a scene in which there are fewer or no impurities around the target object to be reconstructed. For example, taking an oral three-dimensional scanning device as an example, in a scene of scanning real teeth of a user in an oral cavity, the user includes a tongue, soft tissues on the buccal side, gums and other numerous interferents around the teeth in the oral cavity, so that the acquired image includes more impurities. In the case of scanning a tooth model outside the oral cavity, for example, a tooth model made of paraffin, metal, resin, or the like is scanned, and the periphery of the tooth does not include interference, so that the acquired image has fewer impurities.

Therefore, the working modes of the three-dimensional scanning device may include a first working mode corresponding to the first type of scene and a second working mode corresponding to the second type of scene, and when the working mode is the first working mode, the processing steps of the image acquired by the three-dimensional scanning device include the target steps in the three-dimensional reconstruction process. When the working mode is the second working mode, the processing step does not comprise the target step. Wherein the target step is for determining data in the image that are not relevant for a three-dimensional reconstruction of the target object to be reconstructed, such that these irrelevant data are ignored when generating the three-dimensional model of the target object. For example, taking oral scanning as an example, irrelevant data can refer to tongue, buccal soft tissue, gum and other areas in an image, image data corresponding to the areas in the image can be identified first, the data are not used for participating in three-dimensional reconstruction, and interference of the data on a subsequent three-dimensional reconstruction process can be eliminated. In addition, the processing method can be adaptively adjusted for each sub-mode by subdividing a certain operation mode into a plurality of sub-modes. Such as when determining data that is not relevant for the three-dimensional reconstruction of a target object to be reconstructed, such as a tooth, the decision criteria may be adjusted.

In some embodiments, if it is determined that there is fog on the lens module based on the second fused global feature, the three-dimensional scanning device may be controlled to start a defogging function, by automatically detecting whether there is fog on the lens module, and automatically starting the defogging function of the three-dimensional scanning device when there is fog, so as to remove the fog of the lens, so that the quality of the acquired image is prevented from being affected by the fog of the lens module, and further the subsequent three-dimensional reconstruction is prevented from being affected.

In some embodiments, considering some scan environment types, the scenes they cover are diverse, i.e. can be subdivided into multiple subtypes. For example, taking the type of extraoral scanning environment of an oral three-dimensional scanning device as an example, the extraoral environment is generally various, for example, the scene of scanning a tooth model outside the mouth, and since the material of the tooth model generally includes many categories, for example, metal, resin, paraffin, etc., the sample image for training the detection model is difficult to cover all the scenes of the extraoral environment, so that it is difficult for the detection model trained by using the sample image to accurately predict the scenes that are not covered by some sample images, resulting in poor accuracy of the detection result of the pre-trained scanning environment detection model. In order to improve the accuracy of the detection model to the prediction results of the scan environment type, in some embodiments, a mechanism of contrast learning may be introduced when training the initial model to obtain the detection model, that is, a contrast loss may be introduced when determining the loss of the initial model. By introducing contrast loss, the initial model can be constrained, so that when the initial model performs feature extraction on images, the extracted features of the images of the same class (namely images with the same scanning environment type) are pulled up as far as possible, and conversely, the extracted features of the images of different classes are pulled up as far as possible.

For example, in some embodiments, after the current frame sample image and the reference sample image are input into the initial model, the initial model may determine at least one set of sample image pairs from the current frame sample image and the reference sample image, where two of the sample image pairs correspond to the same type of scanning environment. Then, the similarity of the global features of the two frames of sample images in at least one set of sample image pairs can be determined, a fourth loss is determined based on the difference between the similarity and a preset similarity threshold, and then model parameters of an initial model can be adjusted based on the first loss, the second loss and the fourth loss so as to train to obtain the detection model. In general, if the types of the scanning environments corresponding to the two frames of sample images are the same, the features of the two frames of sample images should be similar, so a similarity threshold may be preset, if the types of the scanning environments of the two frames of sample images are the same, the similarity of the features of the two frames of sample images should be very close to the similarity threshold, so a fourth loss may be determined based on the proximity degree of the two frames of sample images, and the fourth loss may be used as a constraint condition of the model to train the initial model, so that when the trained model performs feature extraction on the images, the extracted features of the images with the same type of the scanning environments are also closer.

Of course, in some embodiments, if the detection model further has a lens fog detection function, the model parameters of the initial model may be adjusted in combination with the first loss, the second loss, the third loss, and the fourth loss to train to obtain the detection model.

For example, different weights may be set for the four losses, the four losses are weighted and summed based on the weights to obtain the target loss, and then model parameters of the initial model are adjusted based on the target loss to train to obtain the detection model. For example, the target loss may be determined based on the following equation (1):

wherein,for the loss of the scan environment type, a1 is the weight of the loss of the scan environment type, +.>For contrast loss of scan environment type, a2 is the weight of contrast loss of scan environment type, +.>For loss of lens smudge, a3 is the weight of loss of lens smudge; />The loss of lens mist, a4, is the weight of the loss of lens mist.

In some embodiments, after the current frame sample image and the reference sample image are input into the initial model, the initial model may determine at least one set of sample image triplets from the current frame sample image and the reference sample image, each set of sample image triplets includes a first sample image, a second sample image having a scan environment type identical to that of the first sample image, and a third sample image having a scan environment type different from that of the first sample image, and then may perform feature extraction on the first sample image, the second sample image, and the third sample image, respectively, to obtain respective global features, determine a fifth loss based on a similarity of the global features of the second sample image to the global features of the first sample image, and a similarity of the global features of the third sample image to the global features of the first sample image, and then adjust model parameters of the initial model based on the first loss, the second loss, and the fifth loss, to train to obtain the detection model. For two frames of sample images with the same scanning environment type, the similarity of global features of the two frames of sample images should be higher than that of two frames of sample images with different scanning environment types, a fifth loss is determined based on the principle, and the fifth loss is used for restraining an initial model so as to adjust model parameters of the initial model, so that the feature extraction of the image by a detection model obtained through training can be more accurate, and the detection result of the detection of the scanning environment type is also more accurate.

Of course, in some embodiments, if the detection model further has a lens fog detection function, the model parameters of the initial model may be adjusted in combination with the first loss, the second loss, the third loss, and the fifth loss, so as to train to obtain the detection model.

In some embodiments, it is considered that the scenes corresponding to certain scan environment types tend to be more similar and not very different. For example, taking the type of intraoral scanning environment of the three-dimensional oral scanning device as an example, considering that intraoral environments are often similar, the difference is not large, so that for images acquired by intraoral environments, the image features are often similar, i.e. the feature vectors of the images are distributed in a certain approximate range in the feature space. Therefore, for the detection of a target environment (such as an intraoral environment) which is often similar to a corresponding scene, a plurality of sample images collected by the three-dimensional scanning device under the target environment can be obtained in advance, then the sample images are subjected to feature extraction, global features of the sample images are obtained, and a feature cluster center of the global features of the sample images is determined. For example, feature vectors may be obtained that characterize global features of the sample image, and the cluster centers of these global feature vectors are determined. Then, the current scanning environment type of the three-dimensional scanning device can be determined based on the current frame image acquired by the three-dimensional scanning device, for example, feature extraction can be performed on the current frame image to obtain global features of the current frame image, and then whether the current scanning environment type of the three-dimensional scanning device is a target environment or not is judged according to the proximity degree of the features of the current frame image and the feature clustering center. For example, if the feature of the current frame image is very close to the feature cluster center, it is indicated that the type of the scanning environment corresponding to the current frame image is the target environment, where the proximity degree may be represented by a distance between the feature cluster center and the current frame image, for example, may be represented by a euclidean distance, a manhattan distance, or the like, and if the distance between the feature cluster center and the feature cluster center is smaller than a preset distance threshold, the feature cluster center and the feature cluster center are considered to be close to each other, that is, it is determined that the type of the scanning environment corresponding to the current frame image is the target environment.

Further, an embodiment of the present application further provides an image detection method, where the method includes:

Of course, the initial model may also be used to predict the type of the scanning environment corresponding to the sample image, obtain the total loss based on the difference between the predicted result and the real result and the target loss, and adjust the model parameters of the initial model based on the total loss, so as to train and obtain the detection model.

The specific training manner of the detection model may refer to the descriptions in the foregoing embodiments, which are not repeated herein.

extracting features of the image to be detected to obtain global features;

The specific implementation details of the detection method may refer to the descriptions in the above embodiments, and are not repeated here.

It will be appreciated that the solutions described in the above embodiments may be freely combined to obtain a new solution in the absence of any conflict, and for reasons of brevity, the embodiments of the present disclosure are not limited to the examples.

Accordingly, an embodiment of the present disclosure further provides an image detection apparatus, as shown in fig. 7, including:

an obtaining module 71, configured to obtain a current frame image and a reference image collected by a lens module, where the reference image includes a plurality of frame images collected before and/or after the current frame image;

a feature extraction module 72, configured to perform feature extraction on the current frame image and the reference image, to obtain a shallow local feature map of the current frame image and a shallow local feature map of the reference image;

a fusion module 73, configured to fuse the shallow local feature map of the reference image to a shallow local feature map of the current frame image to obtain a fused shallow local feature map, where in the fusion process, a feature value at a target pixel position in the shallow local feature map of the current frame image is enhanced, and a similarity between a feature value at the target pixel position and a feature value at a corresponding pixel position in the shallow local feature map of the reference image is higher than a preset similarity;

And a prediction module 74, configured to determine whether there is dirt in the lens module based on the fused shallow local feature map.

The specific steps of the image detection method performed by the apparatus may refer to the descriptions in the method embodiments, and are not repeated herein.

Further, an embodiment of the disclosure further provides an electronic device, as shown in fig. 8, where the device includes a processor 81, a memory 82, and computer instructions stored in the memory 82 and executable by the processor 81, where the processor 81 executes the computer instructions to implement the method according to any one of the foregoing embodiments.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the previous embodiments.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that the disclosed embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in essence or a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present disclosure.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, which should also be considered as the protection scope of the embodiments of this disclosure.

Claims

1. An image detection method, the method comprising:

2. The method of claim 1, wherein the determining whether there is contamination in the lens module based on the fused shallow local feature map comprises: extracting features of the current frame image to obtain global features of the current frame image; fusing the global feature and the shallow local feature map to obtain a fused feature; determining whether dirt exists in the lens module based on the fusion characteristics; and/or

The global feature is represented by a global feature vector, the global feature and the shallow local feature map are fused to obtain a fusion feature, and the fusion feature comprises the following steps: carrying out pooling treatment on the shallow local feature map to obtain a shallow local feature vector; splicing the shallow local feature vector with the global feature vector to obtain the fusion feature; or superposing the shallow local feature vector and the global feature vector to obtain the fusion feature.

3. The method of claim 1, wherein fusing the shallow local feature map of the reference image into the shallow local feature map of the current frame image comprises: respectively carrying out maximum pooling treatment on the shallow local feature map of the reference image and the shallow local feature map of the current frame image; fusing the shallow local feature map of the reference image after the maximum pooling treatment into the shallow local feature map of the current frame image after the maximum pooling treatment; and/or

Fusing the shallow local feature map of the reference image into the shallow local feature map of the current frame image to obtain a fused shallow local feature map, including: respectively fusing the shallow local feature map of each frame of reference image with the shallow local feature map of the current frame image, wherein for each pixel position in the shallow local feature map of the current frame image, if the similarity between the feature value of the pixel position and the feature value of the corresponding pixel position in the shallow local feature map of the reference image is higher, the fusion weight of the feature value of the pixel position is larger; and carrying out superposition processing on the shallow local feature images of each frame obtained by fusion to obtain the fused shallow local feature image.

4. The method of claim 1, wherein the lens module is a lens module on a three-dimensional scanning device, the method further comprising:

extracting features of the current frame image to obtain global features of the current frame image;

extracting features of the reference image to obtain global features of the reference image;

fusing the global features of the reference image into the global features of the current frame image to obtain a first fused global feature; judging the current scanning environment type of the three-dimensional scanning equipment based on the first fusion global features; and/or fusing the global features of the reference image into the global features of the current frame image to obtain a second fused global feature; and judging whether fog exists on the lens module based on the second fusion global feature.

5. The method of claim 4, wherein after determining the current scanning environment type of the three-dimensional scanning device based on the first fused global feature, the method further comprises:

switching the current working mode of the three-dimensional scanning equipment to a target working mode matched with the scanning environment type, or prompting a user to switch the current working mode of the three-dimensional scanning equipment to the target working mode matched with the scanning environment type;

When the three-dimensional scanning equipment is in different working modes, the processing modes of processing the images are different in the process of three-dimensional reconstruction of the target object by utilizing the images acquired by the three-dimensional scanning equipment, and/or the operation modes of devices in the three-dimensional scanning equipment are different in the process of acquiring the images by utilizing the three-dimensional scanning equipment.

6. The method of claim 4, wherein if it is determined that fog is present on the lens module based on the second fused global feature, controlling the three-dimensional scanning device to turn on a defogging function.

7. The method of claim 4, wherein the method is performed by a pre-trained detection model that is trained based on:

acquiring a current frame sample image and a reference sample image acquired by three-dimensional scanning equipment, wherein the reference sample image comprises a plurality of frame images acquired before and/or after the current frame sample image; the current frame sample image carries a label, and the label is used for indicating whether a lens module for collecting the current frame sample image comprises dirt and a scanning environment type corresponding to the current frame sample image;

Inputting the current frame sample image and the reference sample image into a preset initial model, and executing the following operations by the initial model:

respectively extracting features of the current frame sample image and the reference sample image to obtain a shallow local feature image and global features of the current frame sample image, and shallow local feature images and global features of the reference sample image;

fusing the shallow local feature map of the reference sample image into the shallow local feature map of the current frame sample image to obtain a sample fused shallow local feature map, fusing the sample fused shallow local feature map with the global feature of the current frame sample image to obtain a sample fused feature, judging whether the lens module comprises dirt or not based on the sample fused feature, and determining a first loss based on the judgment result and the difference of the labels;

fusing the global features of the current frame sample image and the global features of the reference sample image to obtain a first sample fusion global feature, judging the type of the scanning environment based on the first sample fusion global feature, and determining a second loss based on the judgment result and the difference of the labels;

Model parameters of the initial model are adjusted based on the first loss and the second loss to train to obtain the detection model.

8. The method of claim 7, wherein the initial model comprises:

the characteristic extraction sub-network is used for respectively carrying out characteristic extraction on the current frame sample image and the reference sample image to obtain a shallow local characteristic image and a global characteristic of the current frame sample image, and a shallow local characteristic image and a global characteristic of the reference sample image;

shallow local feature fusion subnetwork: the shallow local feature map used for fusing the reference sample image to the shallow local feature map of the current frame sample image to obtain a sample fused shallow local feature map;

first global fusion subnetwork: the method comprises the steps of fusing the sample fusion shallow local feature image with the global feature of the current frame sample image to obtain a sample fusion feature;

a second global fusion subnetwork: the method comprises the steps of fusing global features of a current frame sample image with global features of a reference sample image to obtain first sample fusion global features;

losing the joint optimization sub-network; the method comprises the steps of judging whether the lens module comprises dirt or not based on the sample fusion characteristics, and determining a first loss based on a judgment result and a difference of the labels; and judging the type of the scanning environment based on the first sample fusion global feature, determining a second loss based on the judgment result and the difference of the label information, and adjusting model parameters of the initial model based on the first loss and the second loss so as to train and obtain the detection model.

9. The method of claim 8, wherein the tag is further used to indicate whether fog is present on the lens module, the initial model further comprising:

third global fusion subnetwork: the method comprises the steps of fusing global features of a sample image of a current frame with global features of a reference sample image to obtain second sample fused global features;

losing the joint optimization sub-network; and the method is also used for judging whether fog exists on the lens module based on the second sample fusion global feature, determining a third loss based on the difference between a judging result and the label information, and adjusting model parameters of the initial model based on the first loss, the second loss and the third loss so as to train and obtain the detection model.

10. The method of claim 7, wherein after inputting the current frame sample image and the reference sample image into the initial model, the initial model is further configured to: determining at least one group of sample image pairs from the current frame sample image and the reference sample image, wherein the scanning environment types corresponding to two frames of sample images in the sample image pairs are the same; determining the similarity of global features of two frames of sample images in the at least one group of sample images, and determining a fourth loss based on the difference between the similarity and a preset similarity threshold; adjusting model parameters of the initial model based on the first loss, the second loss and the fourth loss to train to obtain the detection model; and/or

After the current frame sample image and the reference sample image are input into the initial model, the initial model is further configured to perform the following operations: determining at least one set of sample image triples from the current frame sample image and the reference sample image; each sample image triplet comprises a first sample image, a second sample image with the same scanning environment type as the first sample image, and a third sample image with a scanning environment type different from the scanning environment type of the first sample image; determining a fifth penalty based on a similarity of global features of the second sample image to global features of the first sample image and a similarity of global features of the third sample image to global features of the first sample image; model parameters of the initial model are adjusted based on the first loss, the second loss, and the fifth loss to train to obtain the detection model.

11. The method of claim 1, wherein the lens module is a lens module on a three-dimensional scanning device, the method further comprising:

Judging whether the scanning environment type of the three-dimensional scanning equipment when the current frame image is acquired is a target environment or not according to the proximity degree of the global feature and a preset feature clustering center; the feature cluster center is a cluster center of global features of a plurality of frames of sample images, and the sample images are images acquired by three-dimensional scanning equipment in a target environment.

12. An image detection method, the method comprising:

13. An image detection method, the method comprising:

14. An image detection method, the method comprising:

extracting features of the image to be detected to obtain global features;

15. An image detection apparatus, the apparatus comprising:

And the prediction module is used for determining whether dirt exists in the lens module or not based on the fused shallow local feature map.

16. An electronic device comprising a processor, a memory, computer instructions stored in the memory for execution by the processor, when executing the computer instructions, implementing the method of any one of claims 1-14.

17. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1-14.