CN117934478A

CN117934478A - Defect detection method, device, equipment and medium

Info

Publication number: CN117934478A
Application number: CN202410335832.2A
Authority: CN
Inventors: 赖锦祥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-03-22
Filing date: 2024-03-22
Publication date: 2024-04-26
Anticipated expiration: 2044-03-22
Also published as: CN117934478B

Abstract

The embodiment of the application discloses a defect detection method, device, equipment and medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, and the method comprises the following steps: acquiring detection data of a target object; if the detection data comprises a color image and target modal data, performing feature extraction on the color image and the target modal data to obtain target detection features, wherein the target detection features are fused with the features of the color image and the features of the target modal data, performing feature extraction on the color image with different scales through a main network of a pre-trained target detection model to obtain multi-scale color map features, and performing feature fusion on the target detection features and the multi-scale color map features to obtain fusion features; and performing defect detection on the target object according to the fusion characteristics to obtain a defect detection result of the target object. The technical scheme of the embodiment of the application improves the robustness of defect detection, avoids missing detection and ensures the accuracy and reliability of defect detection.

Description

Defect detection method, device, equipment and medium

Technical Field

The present application relates to the field of defect detection technology, and in particular, to a defect detection method, a defect detection apparatus, an electronic device, and a computer readable storage medium.

Background

At present, various defects exist in the actual production process of articles, if bad articles flow into the market, loss can be caused, at present, manual spot inspection is generally adopted in factories, the detection efficiency is low, the effect is poor, and the full inspection cannot be performed; in order to improve the accuracy and detection efficiency of object defect detection, automatic visual defect detection equipment slowly appears, and the detection equipment performs defect detection based on a deep learning target detection algorithm, namely, through learning of a large amount of RGB color image type defect data, good identification accuracy can be realized, but the detection mode is simple under common RGB color image imaging, so that the detection algorithm is difficult to detect, and defect omission is easily caused.

Disclosure of Invention

The embodiment of the application provides a defect detection method, a defect detection device, electronic equipment, a computer readable storage medium and a computer program product, which improve the robustness of defect detection, avoid missing detection and ensure the accuracy and reliability of defect detection.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a defect detection method including: acquiring detection data of a target object; if the detection data comprises a color image and target modal data, extracting features of the color image and the target modal data to obtain target detection features, wherein the target detection features are fused with the features of the color image and the features of the target modal data; extracting features of different scales from the color image through a main network of a pre-trained target detection model to obtain multi-scale color image features, and carrying out feature fusion on the target detection features and the multi-scale color image features to obtain fusion features; and performing defect detection on the target object according to the fusion characteristics to obtain a defect detection result of the target object.

According to an aspect of an embodiment of the present application, there is provided a defect detecting apparatus including: the acquisition module is used for acquiring detection data of the target object; the feature processing module is used for extracting features of the color image and the target modal data to obtain target detection features if the detection data comprise the color image and the target modal data; the feature fusion module is used for extracting features of different scales from the color image through a main network of a pre-trained target detection model to obtain multi-scale color image features, and carrying out feature fusion on the target detection features and the multi-scale color image features to obtain fusion features; and the defect detection module is used for carrying out defect detection on the target object according to the fusion characteristics to obtain a defect detection result of the target object.

In an embodiment of the present application, the feature processing module is further configured to select a graph type mode branch of the target detection model to perform feature extraction on the color image and the target mode data, respectively, if the detection data includes the color image and the target mode data, and the data type of the target mode data includes a vector graph type, so as to obtain a color image feature and a vector graph feature, respectively; performing feature fusion on the color map features and the vector map features to obtain target map features; and carrying out convolution processing on the target graph characteristics to obtain the target detection characteristics.

In an embodiment of the present application, the object detection model further includes at least one feature fusion module; the feature processing module is further used for inputting the target graph features to a first convolution layer of the feature fusion module to carry out convolution processing to obtain first features; performing nonlinear transformation processing on the first feature according to an activation function to obtain a second feature; and inputting the second feature to a second convolution layer of the feature fusion module to carry out convolution processing to obtain the target detection feature.

In an embodiment of the application, the graph-type modal branch includes a first convolution layer and a second convolution layer; the feature processing module is further used for inputting the color image and the target modal data to a first convolution layer of the graph type modal branch to perform feature extraction, so as to obtain color graph features and vector graph features; performing feature stitching processing on the color map features and the vector map features to obtain stitching features; and inputting the spliced features to a second convolution layer of the graph type modal branch to perform feature extraction to obtain the target graph features.

In an embodiment of the application, the graph-type modal branch further comprises a self-attention layer; the model selection module is further used for inputting the color map features and the vector map features to the self-attention layer so as to calculate the similarity of the color map features and the vector map features and generate attention weights according to the similarity; performing feature alignment processing on the color map features and the vector map features according to the attention weight; and performing feature stitching on the color map features and the vector map features after feature alignment processing to obtain the stitching features.

In an embodiment of the present application, the feature processing module is further configured to input the color image to a first convolution layer of a single-mode branch of the target detection model for feature processing if the detection data includes the color image; inputting the characteristics output by the first convolution layer of the single-mode branch into a second convolution layer of the single-mode branch for characteristic processing to obtain convolution graph characteristics; and carrying out convolution processing on the convolution graph characteristics to obtain the target detection characteristics.

In an embodiment of the present application, the backbone network includes a plurality of backbone network layers connected in sequence; the feature fusion module is further used for inputting the color image to a first backbone network layer for first-scale feature extraction to obtain first-scale color image features, and carrying out feature fusion processing on the first-scale color image features and the target detection features to obtain first fusion features; inputting the first fusion feature to a next backbone network layer for second-scale feature extraction to obtain a second-scale color map feature, and carrying out feature fusion processing according to the second-scale color map feature and the target detection feature to obtain a second fusion feature; and inputting the second fusion feature to a next backbone network layer, and carrying out feature fusion processing again until a final backbone network layer outputs a color map feature with a target scale, and obtaining the fusion feature according to the color map feature with the target scale and the target detection feature.

In an embodiment of the present application, the target detection model further includes a plurality of feature fusion modules connected in sequence, where the number of feature fusion modules is the same as the number of backbone network layers; the feature fusion module is further used for carrying out convolution processing on the target detection feature through a next feature fusion module connected with the first feature fusion module to obtain a depth detection feature corresponding to the color map feature of the second scale; the target detection feature is the feature output by the first feature fusion module; and carrying out feature addition processing on the color map features of the second scale and the depth detection features to obtain the second fusion features.

In an embodiment of the present application, the feature fusion module is further configured to generate a first weight of the second-scale color map feature according to the second-scale color map feature; generating a second weight corresponding to the depth detection feature according to the first weight; and carrying out feature weighted summation on the color map features of the second scale and the depth detection features according to the first weight and the second weight to obtain the second fusion features.

In an embodiment of the present application, the defect detection module is further configured to input the fusion feature to a detection head module of the target detection model for defect detection; and obtaining the defect type and the defect frame of the target object output by the detection head module, and obtaining the defect detection result according to the defect type and the defect frame.

In an embodiment of the present application, the obtaining module is further configured to image, under a plurality of illumination conditions, a front portion of the target object by using a plurality of photographing devices to obtain a plurality of images, where each image corresponds to an illumination condition; acquiring luminosity information corresponding to each image, calculating a normal vector of each pixel point according to the luminosity information corresponding to each image so as to generate a normal vector diagram corresponding to the front part of the target object, and taking the normal vector diagram as the target modal data; and selecting a color image corresponding to the front part from a plurality of images, and generating detection data corresponding to the front part of the target object based on the color image and the target modal data.

In an embodiment of the present application, the apparatus further includes a training module, where the training module is configured to obtain sample detection data including a sample color image and sample modal data of a sample object, and a marked sample defect result carried by the sample object; inputting the sample detection data into a multi-modal branch of a model to be trained, and inputting the sample color image into a single-modal branch of the model to be trained to obtain corresponding sample detection characteristics; carrying out feature processing on the sample color images in different scales through a backbone network of the model to be trained to obtain multi-scale sample color map features, and carrying out feature fusion on the sample detection features and the multi-scale sample color map features to obtain sample fusion features; performing defect detection on the sample fusion characteristics through a detection head of the model to be trained to obtain a sample defect detection result; and constructing model loss according to the difference of the sample defect detection result and the marked sample defect result, and adjusting model parameters of the model to be trained according to the model loss to obtain the target detection model.

According to one aspect of an embodiment of the present application, an electronic device is provided, including one or more processors; and storage means for storing one or more computer programs which, when executed by the one or more processors, cause the electronic device to implement the defect detection method as described above.

According to an aspect of an embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to perform the defect detection method as described above.

According to an aspect of the embodiments of the present application, there is provided a computer program product including a computer program stored in a computer-readable storage medium, a processor of an electronic device reading and executing the computer program from the computer-readable storage medium, causing the electronic device to execute the defect detection method as above.

In the technical scheme provided by the embodiment of the application, the detection data of the target object is obtained, if the detection data comprises a color image and target modal data, the color image and the target modal data are subjected to characteristic extraction to obtain target detection characteristics, the target detection characteristics are fused with the characteristics of the color image and the characteristics of the target modal data, then the color image is subjected to characteristic extraction of different scales through a backbone network of a target detection model to obtain multi-scale color image characteristics, namely rich characteristic representations with different scales are extracted from the color image to comprehensively capture visual information of the image, and the target detection characteristics and the multi-scale color image characteristics are subjected to characteristic fusion to obtain fusion characteristics, so that the multi-mode multi-scale characteristics are adaptively fused, the color image and the target modal data are comprehensively utilized under different scales, the defect detection is further carried out according to the fusion characteristics, the information from different modes and different scales is comprehensively utilized, the target object can be more comprehensively understood, the robustness of the defect detection is improved, the omission is avoided, and the accuracy and the reliability of the defect detection are ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of an implementation environment in which the present application is directed.

Fig. 2 is a flowchart illustrating a defect detection method according to an exemplary embodiment of the present application.

Fig. 3 is a schematic diagram illustrating another defect detection method according to an exemplary embodiment of the present application.

Fig. 4 is a schematic diagram illustrating another defect detection method according to an exemplary embodiment of the present application.

Fig. 5 is a flowchart illustrating another defect detection method according to an exemplary embodiment of the present application.

Fig. 6 is a flowchart illustrating another defect detection method according to an exemplary embodiment of the present application.

Fig. 7 is a flowchart illustrating another defect detection method according to an exemplary embodiment of the present application.

Fig. 8 is a flowchart illustrating another defect detection method according to an exemplary embodiment of the present application.

Fig. 9 is a flowchart illustrating another defect detection method according to an exemplary embodiment of the present application.

Fig. 10 is a flowchart illustrating another defect detection method according to an exemplary embodiment of the present application.

FIG. 11 is a schematic flow diagram of the overall structure of defect detection according to an exemplary embodiment of the present application.

Fig. 12 is a block diagram of a backbone network of an object detection model according to an exemplary embodiment of the present application.

Fig. 13 is a block diagram showing a structure of a defect detecting apparatus according to an exemplary embodiment of the present application.

Fig. 14 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the content and operations/, nor do they necessarily have to be performed in the order described. For example, some operations may be decomposed, and some operations may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Also to be described is: in the present application, the term "plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The technical scheme of the embodiment of the application also relates to the field of Cloud technology, and before the technical scheme of the embodiment of the application is introduced, the Cloud technology is introduced briefly.

Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support.

Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.

The training process of the target detection model in the embodiment of the application can be obtained through resource training provided by cloud computing. The technical scheme of the embodiment of the application also relates to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI). AI is a theory, method, technique, and application system that utilizes a digital computer or a digital computer-controlled machine to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, AI is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. AI is the design principle and the realization method of researching various intelligent machines, and the machines have the functions of perception, reasoning and decision.

The machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of AI, which is the fundamental way for computers to have intelligence, which applies throughout the various areas of AI. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The technical scheme of the embodiment of the application relates to a machine learning technology in AI, in particular to a target detection model obtained based on training of the machine learning technology, thereby realizing defect detection, and the technical scheme of the embodiment of the application is described in detail as follows:

Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment according to the present application. The implementation environment includes a terminal 10 and a server 20.

The terminal 10 is configured to acquire detection data of a target object and transmit the detection data to the server 20.

The server 20 is configured to extract features of the color image and the target modal data to obtain target detection features if the detection data includes the color image and the target modal data, wherein the target detection features are fused with features of the color image and features of the target modal data, then extract features of different scales of the color image through a backbone network of a pre-trained target detection model to obtain multi-scale color map features, fuse the target detection features and the multi-scale color map features to obtain fused features, and finally detect defects of the target object according to the fused features to obtain a defect detection result of the target object.

The server may also send the defect detection result of the target object to the terminal, so that the terminal may perform response processing on the target object according to the defect detection result, for example, repair the defect of the target object, or eliminate the target object.

In some embodiments, the server 20 may also obtain the detection data of the target object by itself, then obtain the target detection feature according to the detection data, further extract the features of different scales from the color image through the backbone network of the pre-trained target detection model to obtain the multi-scale color map feature, and perform feature fusion on the target detection feature and the multi-scale color map feature to obtain the fusion feature, so as to perform defect detection based on the fusion feature.

In some embodiments, the terminal 10 may also implement defect detection separately, that is, the terminal 10 acquires the detection data of the target object, so as to perform defect detection through the target detection feature corresponding to the detection data and the color map feature corresponding to the color map image.

The terminal 10 may be any electronic device capable of acquiring the modal data of the target object, such as a smart phone, a tablet, a notebook computer, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, or an aircraft, and the server 20 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content distribution networks), and basic cloud computing services such as big data and artificial intelligent platforms, which are not limited herein.

The terminal 10 and the server 20 previously establish a communication connection through a network so that the terminal 10 and the server 20 can communicate with each other through the network. The network may be a wired network or a wireless network, and is not limited in this regard.

It should be noted that: the embodiment of the invention can detect and process defects of various objects, and can be applied to various scenes, including but not limited to objects in various scenes such as cloud technology, AI (ARTIFICIAL INTELLIGENCE ), intelligent traffic, auxiliary driving and the like.

Specifically, if the technical scheme of the embodiment of the application is applied to an intelligent traffic scene, the terminal can be a vehicle-mounted terminal, the vehicle-mounted terminal runs with a target detection model, the vehicle-mounted terminal acquires detection data of a vehicle tire, under the condition that the detection data comprises a color image and target mode data, the color image and the target mode data are subjected to feature extraction to obtain target detection features, further, the color image is subjected to feature extraction of different scales through a main network of the target detection mode, then, the target detection features and the multi-scale color map features are subjected to feature fusion to obtain fusion features, and further, the vehicle tire is subjected to defect detection based on the fusion features, so that the usability of the vehicle tire is determined.

It should be noted that, in the specific embodiment of the present application, the detection data relates to the object, when the embodiment of the present application is applied to the specific product or technology, the permission or consent of the object needs to be obtained, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.

Various implementation details of the technical solutions of the embodiments of the present application are set forth in detail below.

As shown in fig. 2, fig. 2 is a flowchart illustrating a defect detection method according to an embodiment of the present application, which may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by both the terminal and the server, and in the embodiment of the present application, the defect detection method may include S210 to S240, which will be described in detail below, by being performed by the server as an example.

S210, acquiring detection data of a target object.

In the embodiment of the application, the target object can be any article, such as a lithium battery, a toy car, a mobile phone accessory and the like; the detection data refer to data detected by a target object under different aspects or modes, wherein the modes can be understood as different characterization or acquisition modes of the data, such as images and texts belong to different modes; the detection data in the embodiment of the present application at least includes a color image, that is, an RGB image in which each pixel is composed of three components of red (R), green (G), and blue (B).

In an example, the color image included in the detection data may be one color image or may be a plurality of color images of different orientations of different target objects, which is not limited herein.

In an example, a target object can be shot through a camera to obtain a color image, and only the color image is used as detection data of the target object; the product description of the target object and the color image can be used as the detection data of the target object; and a normal vector diagram of the target object can be obtained, and the color image and the normal vector diagram are used as detection data.

And S220, if the detection data comprise the color image and the target modal data, extracting the characteristics of the color image and the target modal data to obtain target detection characteristics, wherein the characteristics of the color image and the characteristics of the target modal data are fused with the target detection characteristics.

In the case where the detection data includes a color image and target modality data, the detection data is represented as multi-modality data, wherein the target modality data refers to detected data of the target object in a specified modality, such as text data or normal vector diagram data, without limitation.

In the embodiment of the application, the detection data comprises a color image and target modal data, and the characteristic extraction processing is required to be carried out on both the color image and the target modal data to obtain the target detection characteristic.

In an example, feature extraction processing may be performed on the color image and the target model data to obtain image features of the color image and data features of the target modal data, so as to obtain the target detection feature through feature fusion or stitching of the image features of the color image and the data features of the target modal data.

S230, extracting features of different scales from the color image through a main network of the pre-trained target detection model to obtain multi-scale color map features, and carrying out feature fusion on the target detection features and the multi-scale color map features to obtain fusion features.

It should be noted that, the target detection model in the embodiment of the present application is a model for detecting an article defect, which is obtained by training in advance, and includes a backbone network, where the backbone network extracts different scale features of a color image and fuses with the target detection features; the method comprises the steps of inputting a color image into a target detection model, and extracting features of different scales from the color image by a backbone network of the target detection model through convolution and other operations, thereby extracting abstract features representing image content.

It should be noted that the dimensions of a feature generally refer to the spatial extent or granularity in which the feature is described; the change of the feature scale can also be realized through different layers of the network, and the image content features can be extracted through different layers of the backbone network under different layers and different scales, so that multi-scale color map features are output; wherein, the shallower network generally extracts low-level and small-scale features, and the deeper network can extract higher-level and large-scale features; namely, the lower-level color map features corresponding to smaller scales are more focused on the detail or local information of the image, the higher-level color map features corresponding to larger scales are more focused on the global or abstract features of the image, and then the target detection features and the multi-scale color map features are subjected to feature fusion to obtain fusion features, so that the information on the levels can be comprehensively utilized to realize multi-scale feature fusion to obtain fusion features, and the understanding and detection capability of the target object are improved.

In an example, feature fusion of the target detection feature and the multi-scale color map feature to obtain a fusion feature may be aimed at the color map feature under the current scale, and fusion is performed with the target detection feature to obtain the fusion feature under the current scale feature, and further the color map feature under the next scale feature is obtained through the fusion feature under the current scale feature, and further fusion is performed with the target detection feature to obtain the fusion feature.

In another example, feature fusion is performed on the target detection feature and the multi-scale color map feature to obtain a fusion feature, which may be that feature extraction is performed on different scales of features according to the target detection feature to obtain a multi-scale target detection feature, and feature fusion is performed on the multi-scale target detection feature and the multi-scale color map feature to obtain the fusion feature.

S240, performing defect detection on the target object according to the fusion characteristics to obtain a defect detection result of the target object.

The fusion feature fuses the color map feature and the target detection feature, and represents comprehensive understanding of the model on the target in the detection data, wherein information from different sources and different scales is considered, and further, defect detection is carried out on the target object according to the fusion feature, so that an accurate defect detection result aiming at the target object is obtained; the defect detection result comprises a defect type, a defect size and a defect position of the target object.

In an example, the target detection model further includes a detection network, and the fusion feature is input into the detection network to obtain a defect detection result.

In the embodiment of the application, the detection data of the target object is obtained, if the detection data comprises a color image and target modal data, the color image and the target modal data are subjected to characteristic extraction to obtain target detection characteristics, the target detection characteristics are fused with the characteristics of the color image and the characteristics of the target modal data, then the color image is subjected to different-scale characteristic extraction through a backbone network of a target detection model to obtain multi-scale color image characteristics, namely rich characteristic representations with different scales are extracted from the color image to comprehensively capture visual information of the image, the target detection characteristics and the multi-scale color image characteristics are subjected to characteristic fusion to obtain fusion characteristics, the multi-mode multi-scale characteristic self-adaptive fusion is realized, the color image and the target modal data are comprehensively utilized under different scales, further the defect detection is performed according to the fusion characteristics, the information from different modes and scales is comprehensively utilized, the model can more comprehensively understand the target object, thereby improving the robustness of the defect detection, avoiding missing detection, and ensuring the accuracy and reliability of the defect detection.

In an embodiment of the present application, another defect detection method is provided, which may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in the embodiment of the present application, the defect detection method is described by taking the method performed by the server as an example, as shown in fig. 5, and S220 is extended to S310 to S330 on the basis of S210 to S240 shown in fig. 2, where S310 to S330 are described in detail below.

S310, if the detection data comprises color images and target modal data, and the data types of the target modal data comprise vector diagram types, selecting diagram type modal branches of the target detection model to respectively conduct feature extraction on the color images and the target modal data so as to respectively obtain color diagram features and vector diagram features.

It should be noted that, the object detection model in the embodiment of the present application is compatible with input of single-mode and multi-mode data; the target detection model comprises at least two branches, and different branches can be used for extracting the characteristics of different modal data; in one example, the object detection model includes a single-mode branch for extracting image features of the color image; the object detection model also includes a multi-modal branch that can extract not only image features of a color image, but also data features of other modal data. In another example, the object detection model includes a plurality of branches, each branch for extracting data features of particular modality data, such as graph-type modality branches for extracting features of image types; the text modality branch is used to extract features of the text type.

In the embodiment of the present application, if the detection data includes a color image and target modal data, and the data type of the target modal data includes a vector diagram type, which is a diagram type that is the same as the type of the color image, and the target modal data corresponding to the vector diagram type is a normal vector diagram, where the normal vector diagram is an image describing a surface normal vector and generally includes three channels (e.g., X, Y and Z components); then, a pattern type modal branch of the target detection model can be selected to respectively conduct feature extraction on the color image and the normal vector image, and color image features and vector image features are respectively obtained; the color map features represent texture, color, etc. information of the target object, while the normal vector map features represent normal information of the target object surface.

In other embodiments, if the data type of the target modal data includes a text type, selecting a graph type modal branch of the target detection model to perform feature extraction on the color image to obtain a color graph feature, and selecting a text modal branch of the target detection model to perform feature extraction on the target modal data to obtain a text feature.

It can be understood that the color map features and the obtaining of the color map features through the backbone network are different features extracted by different feature extraction modes.

S320, carrying out feature fusion on the color map features and the vector map features to obtain target map features.

In an example, feature fusion may be performed on the color map features and the vector map features through a map type modal branch of the target detection model, for example, feature stitching may be performed on the color map features and the vector map features to obtain target map features, where the target map features are comprehensive representations of the color image and the normal vector map, and may include higher-level and more abstract information.

S330, carrying out convolution processing on the target graph characteristics to obtain target detection characteristics.

In an example, further convolution processing may be performed on the target graph features to extract and enhance information related to the defect detection task in the target graph features, such as the position, size, etc. of the target, so as to obtain more advanced target detection features; the number of channels of the feature is adjusted through convolution processing, wherein the dimension of the feature refers to the length of the feature vector or the dimension of the feature space, and the feature dimension generally corresponds to the number of channels of the feature map. For example, each pixel of an RGB image is typically composed of three channels (red, green, blue), so each pixel can be considered as a three-dimensional feature vector, and thus the dimensions of the target map features are adjusted by a convolution operation so that the feature dimensions of the target detection features match the feature dimensions of the corresponding-scale color map features for better feature fusion processing with the color map features later.

If the target detection model further comprises at least one feature fusion module, the feature fusion module is different from the structure of the backbone network layer, and the feature fusion module is used for further extracting features of the target graph features.

In an example, the feature fusion module comprises a first convolution layer and a second convolution layer, and the target graph features are input to the first convolution layer of the feature fusion module to be subjected to convolution processing to obtain first features; the first convolution layer enables the 1x1 convolution checking module feature to carry out convolution operation on the channel dimension, and as the 1x1 convolution does not have pixel fusion in the length-width direction, all the computation is cross computation among channels, fusion among channels can be completed better, and the channels represent features, so that higher-level features can be further extracted; then, nonlinear transformation processing is carried out on the first features according to the activation function so as to improve the expression capacity of the model; namely removing the linear relation in the first feature through an activation function to obtain a second feature, wherein the activation function comprises, but is not limited to, reLU, sigmoid and the like; and then, inputting a second feature into a second convolution layer of the feature fusion module to carry out convolution processing to obtain a target detection feature, wherein the second convolution layer uses a 1x1 convolution kernel to carry out weighted sum or nonlinear mapping on the channel dimension on the output of the activation function so as to adjust the channel number, so that the obtained target detection feature is matched with the feature dimension of the corresponding backbone network layer, and the feature dimension is more deep.

In other embodiments of the present application, if the detection data includes a color image, the color image is input to a first convolution layer of a single-mode branch of the target detection model for feature processing; inputting the characteristics output by the first convolution layer of the single-mode branch into a second convolution layer of the single-mode branch for characteristic processing to obtain convolution graph characteristics; and carrying out convolution processing on the convolution graph characteristics to obtain target detection characteristics.

Under the condition that the detection data contains a color image, the detection data is represented as single-mode data, the color image is input to a single-mode branch of a target detection model at the moment, the single-mode branch is used for extracting image features of the color image, the single-mode branch comprises a first convolution layer and a second convolution layer, the color image is input to the first convolution layer for feature processing, the first convolution layer enables the 1x1 convolution checking module feature to carry out convolution operation on channel dimensions, the feature of a higher level is further extracted, the feature output by the first convolution layer of the single-mode branch is input to the second convolution layer of the single-mode branch for feature processing to obtain a convolution image feature, the second convolution layer uses a 1x1 convolution kernel to further extract features on the channel dimensions to obtain the convolution image feature, namely, the combination of the two 1x1 convolution layers can learn complex relations between different channels of the color image, and the channel feature of a higher level and abstract is extracted; and then, the convolution graph features can be input to at least one feature fusion module for convolution processing to obtain target detection features.

It should be noted that, for other detailed descriptions of S210, S230 to S240 shown in fig. 3, please refer to S210, S230 to S240 shown in fig. 2, and the detailed descriptions are omitted here.

In the embodiment of the application, the corresponding modal branches of the target detection model can be selected for carrying out feature extraction to obtain the modal features corresponding to the modal data, namely, the target detection model is compatible with the input of different amounts of modal data, so that the target detection model is more flexibly suitable for various modal scenes, and the respective features are respectively obtained through graph type modal branches aiming at the color image and the normal vector graph, and further the modal features are obtained through feature fusion, so that the modal features can comprehensively represent the color image and the normal vector graph.

In the embodiment of the present application, the method is described by using the server as an example, as shown in fig. 4, and the defect detection method is extended to S410 and S320 to S420 to S430 on the basis of the embodiment shown in fig. 3. The graph-type modal branches in S320 include a first convolution layer and a second convolution layer, and S410 to S430 are described in detail below.

S410, if the detection data comprise color images and target modal data, and the data types of the target modal data comprise vector diagram types, the color images and the target modal data are input into a first convolution layer of a diagram type modal branch to perform feature extraction, and color diagram features and vector diagram features are obtained.

In the embodiment of the application, the color image and the normal vector diagram are input into a first convolution layer to carry out convolution operation, and for each input, the convolution layer uses a convolution kernel to extract features on the color image and the normal vector diagram. In an example, the convolution kernel of the first convolution layer is 1x1, so that the convolution operation does not consider the neighborhood information any more, but the pixel point of each channel is processed independently, and then the features in the channel dimension are extracted from the color image and the normal vector diagram respectively through the 1x1 convolution operation.

And S420, performing feature stitching processing on the color map features and the vector map features to obtain stitching features.

And splicing the color image features and the normal vector image features according to channels so as to integrate the two types of information into one feature representation, so that the model can consider the information of the color image and the normal vector image at the same time.

In one example, to allow features of both modalities to learn and enhance each other in order to obtain richer features, the similarity between the color map features and the vector map features may be calculated, thereby achieving alignment and enhancement of the features.

Wherein the graph-type modality branch further comprises a self-attention layer; inputting the color map features and the vector map features into a self-attention layer, wherein the self-attention layer calculates the similarity of the color map features and the vector map features through a self-attention mechanism, and generates attention weights according to the similarity; in one example, the similarity of the color map features and the vector map features is calculated by a point multiplication operation, and then the similarity is converted into a probability distribution by a Softmax function, and the probability distribution is used as the attention weight.

Then carrying out feature alignment processing on the color map features and the vector map features according to the attention weights, namely carrying out feature alignment processing on the color map features=color map features+vector map features×attention weights; vector image feature=vector image feature+color image feature×attention weight after feature alignment processing; and then carrying out feature stitching on the color image features and the vector image features after feature alignment processing to obtain stitching features.

S430, inputting the spliced features into a second convolution layer of the graph type modal branch to perform feature extraction to obtain target graph features.

In one example, the stitching features are input to a second convolution layer to operate by convolution checking of the stitching features by the second convolution layer to extract higher level target graph features; in one example, the convolution kernel of the second convolution layer is 1x1, i.e., the convolution operation is performed in the channel dimension to extract the higher level target graph features.

In an example, in addition to considering the combination of the color image and the normal vector image, the feature of the color image may be combined separately, and the feature extraction performed by inputting the spliced feature to the second convolution layer of the graph type modal branch to obtain the target graph feature includes: inputting the spliced features to a second convolution layer of the graph type branch to perform feature extraction to obtain first graph features; inputting the color map features into a third convolution layer of the target detection model to perform feature extraction to obtain second map features; and obtaining weight values respectively corresponding to the first graph feature and the second graph feature, and carrying out weighted summation on the first graph feature and the second graph feature according to the weight values to obtain the target graph feature.

The third convolution layer of the target detection model may be a convolution layer of another modal branch, the third convolution layer uses a 1x1 convolution kernel, after extracting a color image feature, the color image feature is separately input into the third convolution layer, and the second image feature of the feature in the channel dimension is further extracted through the 1x1 convolution kernel, that is, in the channel dimension, the combination of the two 1x1 convolution layers can learn the complex relationship between different channels of the color image, and extract the channel feature with higher level and more abstract.

As described above, the spliced features are input into the second convolution layer to extract the comprehensive representation of the color image and the normal vector image, that is, the high-level features corresponding to the comprehensive representation of the color image and the normal vector image corresponding to the first image feature, and the high-level channel features of the individual color image corresponding to the second image feature, and the target image features obtained by respectively obtaining the weight values corresponding to the first image feature and the second image feature and further carrying out weighted summation on the first image feature and the second image feature according to the weight values can supplement the feature deletion of the first image feature for the color image to a certain extent, so that the target image features are more high-level and complete; the weight values corresponding to the first graph feature and the second graph feature can be preset, but the weight value corresponding to the first graph feature is larger than the weight value corresponding to the second graph feature, and the specific weight value can be flexibly adjusted according to actual conditions, for example, the weight value of the first graph feature is 0.8, and the weight value corresponding to the second graph feature is 0.2.

It should be noted that having the 1x1 convolution kernel operate in the channel dimension allows the network to learn complex relationships between channels without involving spatial neighborhoods, which is effective for fusing information of different channels, especially if the spatial structure of the input data is not a prime consideration.

In other embodiments of the present application, the first convolution layer and the second convolution layer may also be of other kernel sizes, such as the use of a 3x3 convolution kernel at the first convolution layer may capture smaller local features to help the network learn more complex features; the second convolution layer captures a greater range of information using a 5x5 convolution kernel to increase the perception of more global features by the network.

It should be noted that, for the detailed description of S210, S330, S230 to S240 shown in fig. 4, please refer to S210, S330, S230 to S240 shown in fig. 3, and the detailed description is omitted here.

In the embodiment of the application, the characteristic in the channel dimension is extracted from the color image and the normal vector image through the first convolution layer, the characteristic splicing processing is carried out on the characteristic of the color image and the characteristic of the vector image to obtain the spliced characteristic, and then the spliced characteristic is checked through the convolution of the second convolution layer to operate so as to extract the modal characteristic of a higher level, so that the modal characteristic can contain the complex relation among different channels, and the information of the different channels is effectively fused.

In the embodiment of the present application, the method is described by using the server as an example, as shown in fig. 5, and S230 shown in fig. 2 is extended to S510 to S530 on the basis of the embodiment shown in fig. 2. The main network of the target detection model comprises a plurality of main network layers which are sequentially connected; s510 to S530 are described in detail below.

S510, inputting the color image into a first backbone network layer for first-scale feature extraction to obtain first-scale color map features, and carrying out feature fusion processing on the first-scale color map features and target detection features to obtain first fusion features.

In the embodiment of the application, the backbone network is used for extracting the image characteristics of the color image, the backbone network is divided into n network layers, namely, a plurality of n sequentially connected backbone network layers are obtained, different backbone network layers are characteristics of different scales, namely, convolution of different step sizes is used for reducing the height and width of input, and meanwhile, the number of input channels is increased, so that the output of the characteristics of different scales is realized; in an example, where the backbone network may be a residual network (ResNet) with a depth of 50 layers, the backbone network layer is a Convolutional Neural Network (CNN).

And inputting the color image into a first backbone network layer to extract the features of the first scale to obtain the color image features of the first scale, and directly adding the features of the color image features of the first scale and the target detection features to obtain the first fusion features.

S520, inputting the first fusion feature into a next backbone network layer for second-scale feature extraction to obtain a second-scale color map feature, and carrying out feature fusion processing according to the second-scale color map feature and the target detection feature to obtain a second fusion feature.

S530, inputting the second fusion feature to the next backbone network layer, and carrying out feature fusion processing again until the last backbone network layer outputs to obtain the color map feature of the target scale, and obtaining the fusion feature according to the color map feature of the target scale and the target detection feature.

Inputting the first fusion feature into a next backbone network layer connected with the first backbone network layer to extract a second-scale color image feature of the second-scale feature, and further carrying out feature fusion according to the second-scale color image feature and the target detection feature to obtain a second fusion feature; in an example, the second scale color map feature and the target detection feature are subjected to feature addition to obtain a second fusion feature; in another example, the target detection feature is convolved to obtain a detection feature corresponding to the feature dimension of the second scale color map feature, and then the second scale color map feature and the detection feature are added to obtain a second fusion feature.

And then inputting the second fusion feature into the next backbone network layer to perform feature extraction of a third scale to obtain a color image feature of the third scale, performing feature fusion processing according to the color image feature of the third scale and the target detection feature to obtain a third fusion feature, and pushing the third fusion feature until the n-1 fusion feature is input into the last backbone network layer, namely outputting the n-1 fusion feature by the n-th backbone network layer to obtain the color image feature of the n scale, and performing feature fusion processing according to the color image feature of the n scale and the target detection feature to obtain the fusion feature.

It should be noted that, for other detailed descriptions of S210 to S220 and S240 shown in fig. 5, please refer to S210 to S220 and S240 shown in fig. 2, and the detailed descriptions are omitted herein.

In the embodiment of the application, the output of different scale features is realized through a plurality of backbone network layers, and further, the feature fusion is carried out according to the output of the scale features and the target detection features, so that the deeper and effective multi-mode feature fusion is effectively realized, and the fusion effect is improved.

In an embodiment of the present application, another defect detection method is provided, where the defect detection method may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in an embodiment of the present application, the defect detection method is described by using the method performed by the server as an example, as shown in fig. 6, and the process of obtaining the second fusion feature in S520 in fig. 5 is extended to S610 to S620 on the basis of the process shown in fig. 5. The target detection model further comprises a plurality of feature fusion modules which are sequentially connected, the number of the feature fusion modules is the same as that of the backbone network layers, if the number of the backbone network layers is n, the number of the feature fusion modules is also n, and the steps S610-S620 are described in detail below.

S610, inputting the first fusion feature into a next backbone network layer to extract features of a second scale to obtain color map features of the second scale, and carrying out convolution processing on target detection features through a next feature fusion module connected with a first feature fusion module to obtain depth detection features corresponding to the color map features of the second scale, wherein the target detection features are the features output by the first feature fusion module.

And S620, carrying out feature addition processing on the color map features and the depth detection features of the second scale to obtain second fusion features.

In the embodiment of the present application, the feature fusion module is different from the structure of the backbone network layer, and the plurality of feature fusion modules are sequentially connected, and the target detection feature is a feature output by the first feature fusion module, that is, the target graph feature obtained through the corresponding mode branching is input to the first feature fusion module to perform feature extraction on the target graph feature, so as to extract the higher-level mode feature to obtain the target detection feature, which is specifically referred to the embodiment shown in fig. 3 and will not be described herein again.

And inputting the target detection feature to a next feature fusion module connected with the first feature fusion module for further convolution processing to obtain a depth detection feature, wherein the feature dimension of the depth detection feature is matched with the feature dimension of a corresponding backbone network layer, further after the color map feature of a second scale is obtained, the feature dimension of the depth detection feature is matched with the color map feature of the second scale, and the color map feature of the second scale and the depth detection feature, which are matched with the feature dimension, are subjected to feature addition processing to obtain the second fusion feature, so that deeper and effective feature fusion is realized.

And the depth detection feature is input to a next feature fusion module to perform further feature processing to obtain the depth detection feature of the next feature dimension, and then the depth detection feature is subjected to feature addition processing with the color map feature of the third dimension until the depth detection feature is input to a last feature fusion module to perform feature processing.

It should be noted that, for other detailed descriptions of S210 to S220, S510, S530, S240 shown in fig. 6, please refer to S210 to S220, S510, S530, S240 shown in fig. 5, and the detailed descriptions are omitted here.

In the embodiment of the application, when the characteristics with different scales are output through the backbone network layer, the target detection characteristics are processed through the plurality of characteristic fusion modules which are sequentially connected, so that the processed characteristics are matched with the dimension of the characteristics output by the backbone network layer, and further, the characteristic addition processing can be effectively performed.

It should be noted that, in an embodiment of the present application, another defect detection method is provided, and the defect detection method may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in an embodiment of the present application, the defect detection method is described by using the method performed by the server as an example, as shown in fig. 7, and S620 is extended to S710 to S730 on the basis of the defect detection method shown in fig. 6. S710 to S730 are described in detail below.

S710, generating first weights of the color map features of the second scale according to the color map features of the second scale.

In the embodiment of the application, the fusion weight is dynamically generated according to the semantic information of the features with different scales, so that the weighted fusion of the features is carried out, and the deeper and effective feature fusion is realized; the first weight generated according to the color map features of the second scale may be that weights are allocated according to the output of the color map features of the second scale at the fully connected layer, for example, the color map features of the second scale are subjected to linear transformation (weight and bias) through one fully connected layer, then the result is input into a Sigmoid activation function, the Sigmoid function maps the value after the linear transformation to the interval [0,1], and the mapped value is taken as the first weight, and the contribution degree of the color map features of the second scale during fusion is expressed through the first weight.

S720, generating a second weight corresponding to the depth detection feature according to the first weight.

The second weight is a complement of the first weight, i.e., subtracting the first weight from 1 yields the second weight.

And S730, carrying out feature weighted summation on the color map features and the depth detection features of the second scale according to the first weight and the second weight to obtain a second fusion feature.

In the embodiment of the application, the first weight is used for weighting the color map features of the second scale, the second weight is used for weighting the depth detection features, and the obtained second fusion features are linear combinations of the color map features of the second scale and the depth detection features, so that the two features are mixed according to the contribution of each weight.

In other embodiments of the present application, when features are fused, feature adjustment may be performed according to the sizes and the channel numbers of features with different dimensions, so that features may be effectively fused; for example, the size and step length of the convolution kernel are dynamically generated according to the second scale, for example, the convolution kernel size=rounded up (the height or width of the input feature/the height or width of the target feature), then feature sampling is performed on the color map feature of the second scale according to the size and step length of the convolution kernel to obtain a sampling feature, and finally feature addition processing is performed on the sampling feature and the depth detection feature to obtain a second fusion feature.

It should be noted that, for other detailed descriptions of S210 to S220, S510, S610, S530, S240 shown in fig. 7, please refer to S210 to S220, S510, S610, S530, S240 shown in fig. 6, and the detailed descriptions thereof are omitted herein.

In the embodiment of the application, when the features output by the corresponding backbone network layer and the features output by the feature fusion module are fused, fusion weights are dynamically generated according to semantic information of features with different scales, and further, the weighted fusion of the features is carried out, so that deeper and effective feature fusion is realized.

In an embodiment of the present application, another defect detection method is provided, and the defect detection method may be applied to the implementation environment shown in fig. 1, where the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in an embodiment of the present application, the defect detection method is described by using the method performed by the server as an example, as shown in fig. 8, and S240 is extended to S810 to S820 on the basis of the defect detection method shown in fig. 2. Wherein, S810-S820 are described in detail below.

S810, inputting the fusion characteristics into a detection head module of the target detection model for defect detection.

S820, obtaining the defect type and the defect frame of the target object output by the detection head module, and obtaining a defect detection result according to the defect type and the defect frame.

In the embodiment of the application, a detection head module of the target detection model is used for carrying out category prediction and box regression, wherein the detection head comprises two sub-networks: one responsible regression (regression) and one responsible classification (classification) and the regression sub-network is used to predict the location and size of the bounding boxes and the classification sub-network is used to predict the class to which each bounding box belongs.

Thus, a defect class of the target object output by the detection head module and a defect frame can be obtained, wherein the defect class is used for indicating a specific type of defect, the defect frame is used for indicating the position and the size of the defect, and in an example, the defect frame can also indicate the confidence score of the defect; and then the defect type and the defect frame are taken as defect detection results of the target object.

In an example, the fusion feature is input to the detection head, the detection head performs convolution kernel activation processing on the fusion feature, the classification sub-network processes the convolution and activation output to predict the class of the target defect, for example, a Softmax function is used to map the output into probability distribution, each class corresponds to a probability value, and then a probability vector of the defect and a class of the defect are obtained, the probability vector represents the probability of the target being the defect at each position, and the class indicates the specific class of the defect.

Because the probability vector represents the probability that the target at each position is a defect, the target defect at the target position with the probability vector larger than the preset probability vector threshold can be obtained, and the target defect at the target position is processed through the regression sub-network to predict the position and the size of the boundary frame of the target defect, for example, the regression value of the boundary frame coordinate is generated through the convolution operation of the 1x1 convolution kernel.

And integrating the position and the size of the predicted target defect and the type of the defect to obtain a final defect detection result of the target object.

It should be noted that, for other detailed descriptions of S210 to S230 shown in fig. 8, please refer to S210 to S230 shown in fig. 2, and further description is omitted herein.

In the embodiment of the application, the fusion characteristics are subjected to defect detection through the detection head of the target detection model to obtain the defect type and the defect frame of the target object, and then the defect detection result is obtained according to the defect type and the defect frame, so that the defect detection result has integrity and reliability.

In an embodiment of the present application, another defect detection method is provided, and the defect detection method may be applied to the implementation environment shown in fig. 1, where the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in an embodiment of the present application, the defect detection method is described by using the method performed by the server as an example, as shown in fig. 9, and S210 is extended to S910 to S930 on the basis of the defect detection method shown in fig. 2. S910 to S930 are described in detail below.

S910, under a plurality of illumination conditions, respectively imaging the front part of the target object through a plurality of shooting devices to obtain a plurality of images, wherein each image corresponds to a specific illumination condition.

S920, acquiring luminosity information corresponding to each image, calculating a normal vector of each pixel point according to the luminosity information corresponding to each image, generating a normal vector diagram corresponding to the front position of the target object, and taking the normal vector diagram as target modal data.

S930, selecting a color image corresponding to the front part from the plurality of images, and generating detection data corresponding to the front part of the target object based on the color image and the target modal data.

In the embodiment of the application, the target object is a three-dimensional object and comprises a front part and a side part, wherein the area of the front part is larger than that of the side part, and the probability of occurrence of defects of the front part is larger than that of occurrence of defects of the side part, so that a color image and a normal vector diagram are adopted as detection data of the front part of the target object.

The normal vector image is obtained through photometric stereo imaging, wherein under a plurality of illumination conditions, the front part of the target object is shot through a plurality of shooting devices, the appearance of the target object under specific illumination conditions is captured, namely one illumination condition corresponds to one shooting device, each shooting condition generates one image, and then a plurality of images of the front part are obtained, wherein the illumination conditions comprise illumination directions and illumination intensities, and each shot image corresponds to one specific illumination condition.

For each image, brightness or color changes of pixel values under different illumination conditions can be compared, luminosity information of each pixel point under different illumination conditions is calculated according to the pixel value changes obtained through comparison, then normal vectors of each pixel point are calculated through technologies such as stereoscopic vision and geometric reasoning by utilizing the luminosity information obtained through calculation, normal vectors of each pixel point obtained through calculation are integrated, so that a normal vector diagram corresponding to a front part of a target object is generated, the size of the normal vector diagram is the same as that of an original color image, and each pixel point corresponds to one normal vector.

The image captured by the imaging device is a color image, so that the color image with the highest definition can be directly selected from the plurality of images, and the selected color image and normal vector image are used as detection data corresponding to the front part of the target object.

In another embodiment of the present application, when the detection data includes only a color image, the imaging device images a side portion of the target object to obtain a color image of the side portion, and the color image is used as the detection data corresponding to the side portion of the target object.

Since the probability of occurrence of a defect in the side portion of the target object is small, only a color image can be used as the detection data of the side portion of the target object, wherein the color image of the side portion can be obtained by photographing the side portion of the target object by the photographing device.

In the embodiment of the present application, the defect detection result obtained in step S240 includes a defect detection result of the front portion and a defect detection result of the side portion.

It should be noted that, for other detailed descriptions of S220 to S240 shown in fig. 9, please refer to S220 to S240 shown in fig. 2, and further description is omitted herein.

In the embodiment of the application, aiming at the front part of the target object, a color image and a normal vector image are selected as detection data, so that the front defect is detected more comprehensively and accurately, and the concave-convex point defect with depth sense is detected; and a color image is selected as detection data for the side part of the target object, so that the detection cost is reduced.

In an embodiment of the present application, another defect detection method is provided, and the defect detection method may be applied to the implementation environment shown in fig. 1, where the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in an embodiment of the present application, the defect detection method is described by using the method performed by the server as an example, as shown in fig. 10, and S1010 to S1050 are added to the defect detection method based on the training steps shown in fig. 2, that is, the target detection model. S1010 to S1050 are described in detail below.

S1010, acquiring sample detection data comprising a sample color image and sample modal data of a sample object and a marked sample defect result carried by the sample object.

S1020, inputting sample detection data into a multi-mode branch of the model to be trained, and inputting a sample color image into a single-mode branch of the model to be trained to obtain corresponding sample detection characteristics.

S1030, carrying out feature processing on the sample color images in different scales through a backbone network of the model to be trained to obtain multi-scale sample color map features, and carrying out feature fusion on the sample detection features and the multi-scale sample color map features to obtain sample fusion features.

S1040, performing defect detection on the sample fusion characteristics through a detection head of the model to be trained to obtain a sample defect detection result.

S1050, constructing model loss according to the difference of the sample defect detection result and the marked sample defect result, and adjusting model parameters of the model to be trained according to the model loss to obtain a target detection model.

In the embodiment of the application, the training process of the target detection model is similar to the application process of the target detection model, training data are required to be acquired, the training data comprise sample color images of sample objects, namely single-mode sample images, and sample mode data are required to be acquired, the sample mode data comprise sample data of other modes, and then the sample detection data are generated through the sample color images and the sample model data; in addition, the sample object carries a marked sample defect result.

The model to be trained comprises at least two modal branches, wherein the single modal branch is used for extracting the characteristics of a single modal sample image, the multi-modal branch is used for extracting the characteristics of sample detection data, and further corresponding sample detection characteristics are obtained, the sample detection characteristics output by the single modal branch comprise sample RGB characteristics, and the sample detection characteristics output by the multi-modal branch comprise sample RGB characteristics and sample modal characteristics; the multi-modal branch may be the graph-type modal branch, or may be a plurality of branches for extracting different modal data.

In one example, the single-mode sample image and sample detection data are sampled at 50% probability, respectively, i.e., the model to be trained is trained with both single-mode and multi-mode data, so that the model is compatible with the identification of single-mode and multi-mode data.

Carrying out feature processing on sample color images with different scales through a backbone network of a model to be trained to obtain multi-scale sample color image features, carrying out feature fusion on sample detection features and the multi-scale sample color image features to obtain sample fusion features, namely carrying out feature fusion on the sample color image features and the sample detection features under the different scale features of the backbone network to obtain sample fusion features, further inputting the sample fusion features to a detection head of the model to be trained to obtain a sample defect detection result, and then calculating errors according to the sample defect detection result and a marked sample defect result to obtain model loss, wherein a cross entropy loss function is used as model loss, and further, model parameters of the model to be trained are adjusted through the model loss to obtain a target detection model.

It should be noted that, for other detailed descriptions of S210 to S240 shown in fig. 10, please refer to S210 to S240 shown in fig. 2, and further description is omitted herein.

In order to facilitate understanding, the embodiment of the present application further provides a defect detection method, which uses a target object as a lithium battery for illustration, and performs defect detection on the lithium battery through a target detection model, where the target detection model is a model compatible with single-mode and Multi-mode input, and includes a Multi-mode backbone network (Multi-modal Backbone): And a Detection Head (Detection Head): /(I) Wherein, the detection head carries out category prediction and box regression.

The detection data is multi-modal data, and the detection data is exemplified by an RGB color image and a Normal Map vector diagram. As shown in fig. 11, the RGB color image of the lithium battery is displayedAnd corresponding Normal Map Normal vector diagram/>Inputting the RGB image and the Normal Map into the target detection model, and performing multi-mode selection and fusion processing on the multi-mode backbone network to output the fused characteristics of the RGB image and the Normal Map; the target detection head module processes the fused characteristics, namely, performs category prediction and box regression, and outputs defect category/>And box/>The box is used for representing the size and the position of the defect; the formula is: /(I)。/>

In the embodiment of the present application, as shown in fig. 12 (a), the multi-mode Backbone network includes 3 key modules, a Backbone network Backbone, and a multi-mode multi-scale feature adaptive feature fusion module adapter, where the feature fusion module can implement multi-scale feature adaptive fusion of an RGB diagram and a Normal Map diagram of a lithium battery, and the mode selector Modality selector makes a single model compatible with single-mode and multi-mode inputs.

Backbone networkFor extracting features of RGB color image x and combining with Normal Map image/>The method comprises the steps of (1) fusing the characteristics of a common backbone network, wherein the backbone network is ResNet, swin-transducer and the like, the backbone network is divided into n backbone network layers, and each layer outputs characteristic diagrams with different scales; the backbone network layer may be a convolutional CNN and a fransformer network layer; wherein, these n backbone network layers connect gradually, and then the output of backbone network is: /(I)。

Modality selector: compatible with both single-mode and multi-mode inputs, the mode selector includes 2 branches as shown in fig. 12 (c).

Single mode RGB branching。

Multimode branching。

Wherein,Is a convolution layer with a convolution kernel size of 1x1, concat is a feature splice.

Final modality selector. If only RGB image/>, is inputLet the parameters/>，/>I.e. only single-mode branches/>, are calculatedOutputting RGB image characteristics; if RGB image/>, is inputAnd Normal Map image/>Let the parameters/>，/>I.e. only the multi-modal branches/>, are calculatedOutputting multi-mode characteristics (namely the target graph characteristics), wherein the multi-mode characteristics can fully utilize information of different data types, and improve the performance and generalization capability of the model; thus, make modality selector/>And is compatible with single-mode and multi-mode inputs.

Multi-mode multi-scale characteristic self-adaptive fusion module adapter: The multi-modal feature is extracted, RGB and multi-modal feature fusion is carried out under different scale features of the backbone network, deeper and effective multi-modal feature fusion is achieved, and fusion effect is improved. As shown in fig. 12 (b), a feature fusion module includes a convolution with a convolution kernel size of 1x1, an activation function, and a convolution layer with a convolution kernel size of 1x 1. Corresponding backbone network/>/>Layer/>，Adaptor/>Also include/>And/o-。

AdaptorMultimodal feature fusion, fusion at layer 1 is/>; It is at the/>The fusion structure of layers can be generalized as/>Different/>Layers are features of different dimensions, so adapter/>The multi-scale characteristic self-adaptive fusion can be realized.

The flow of processing data of the Multi-mode backbone network Multi-modal Backbone f _θ is shown in (a) of fig. 12, and includes: the RGB image and the Normal Map image are input into a mode selector, the mode selector carries out convolution operation on the RGB image and the Normal Map image through a convolution layer with the convolution kernel size of 1x1 so as to extract the image characteristics of the RGB image and the image characteristics of the Normal Map image, then carries out characteristic splicing on the image characteristics of the RGB image and the image characteristics of the Normal Map image, and further carries out convolution operation on the characteristics obtained by splicing through the convolution layer with the convolution kernel size of 1x1 so as to obtain target Map characteristics corresponding to the RGB image and the Normal Map image.

The target graph features are input into a first feature fusion module A1 of a feature fusion module, the feature fusion module A1 carries out convolution operation on the target graph features through a convolution layer with the convolution kernel size of 1x1, then nonlinear processing is carried out on the features obtained after the convolution operation through an activation function, and finally the feature obtained after the nonlinear processing is carried out on the features through the convolution layer with the convolution kernel size of 1x1, so that the target detection feature A1 is obtained.

Simultaneously, inputting an RGB image into a backbone network, carrying out feature extraction on RBG images by the backbone network layer B1 through a first backbone network layer B1 of the backbone network to obtain a first-scale color map feature B1, and carrying out feature addition processing on the first-scale color map feature B1 and a target detection feature a1 to obtain a first fusion feature c1; and then, carrying out feature extraction on the first fusion feature c1 through a backbone network layer B2 to obtain a second-scale graph feature B2, inputting a target detection feature a1 into a second feature fusion module A2, carrying out further convolution processing on the target detection feature a1 through the feature fusion module A2 to obtain a detection feature A2, matching the feature dimensions of the detection feature A2 and the graph feature B2, carrying out feature addition processing on the detection feature A2 and the graph feature B2 to obtain a third-scale graph feature B3, and analogizing in sequence to obtain an+bn fusion feature.

And then inputting the fusion characteristics into a detection head module to obtain defect types and box boxes, and expressing by a formula:。

In the embodiment of the application, a large amount of box frame annotation data is used for carrying out full-supervision training on the multi-mode target detection model, and model parameter iterative updating is carried out by calculating the error loss (such as cross entropy loss function) of the true value annotation and the model prediction result.

According to the defect detection method provided by the application, the RGB color Map and the Normal Map Normal vector Map with stereoscopic impression are collected at the same time, so that the stereoscopic impression of defects can be highlighted, the recognition rate is improved, the algorithm missing detection is reduced, then the mode selection is performed, and the mode selector Modality selector is provided, so that a single model is compatible with single-mode and multi-mode input; then, multi-mode data fusion is carried out, a multi-mode multi-scale characteristic self-adaptive fusion method adapter is provided, and multi-scale characteristic self-adaptive fusion of an RGB image and a Normal Map image of the lithium battery is realized; finally, the type of defect and the box are detected using a target detection algorithm.

An embodiment of the apparatus of the present application is described herein, which may be used to perform the defect detection method of the above-described embodiment of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the above-mentioned embodiments of the defect detection method of the present application.

An embodiment of the present application provides a defect detecting apparatus, as shown in fig. 13, including.

An acquisition module 1310, configured to acquire detection data of the target object.

And a feature processing module 1320, configured to, if the detection data includes a color image and target modal data, perform feature extraction on the color image and the target modal data to obtain a target detection feature, where the target detection feature is fused with a feature of the color image and a feature of the target modal data.

The feature fusion module 1330 is configured to extract features of different scales from the color image through a backbone network of the pre-trained target detection model to obtain multi-scale color map features, and perform feature fusion on the target detection features and the multi-scale color map features to obtain fusion features.

And a defect detection module 1340, configured to detect a defect of the target object according to the fusion feature, so as to obtain a defect detection result of the target object.

In an embodiment of the present application, based on the foregoing solution, the feature processing module is further configured to select a graph type mode branch of the target detection model to perform feature extraction on the color image and the target mode data, respectively, if the detection data includes the color image and the target mode data, and the data type of the target mode data includes a vector graph type, so as to obtain a color image feature and a vector image feature, respectively; performing feature fusion on the color map features and the vector map features to obtain target map features; and carrying out convolution processing on the target graph features to obtain target detection features, wherein the feature dimensions of the target detection features are matched with the feature dimensions of the color graph features of corresponding scales.

In one embodiment of the present application, based on the foregoing solution, the object detection model further includes at least one feature fusion module; the feature processing module is further used for inputting the target graph features to a first convolution layer of the feature fusion module to carry out convolution processing to obtain first features; performing nonlinear transformation processing on the first feature according to an activation function to obtain a second feature; and inputting the second feature to a second convolution layer of the feature fusion module to carry out convolution processing to obtain the target detection feature.

In one embodiment of the present application, based on the foregoing scheme, the graph-type modal branch includes a first convolution layer and a second convolution layer; the feature processing module is further used for inputting the color image and the target modal data to a first convolution layer of the graph type modal branch to perform feature extraction, so as to obtain color graph features and vector graph features; performing feature stitching processing on the color map features and the vector map features to obtain stitching features; and inputting the spliced features to a second convolution layer of the graph type modal branch to perform feature extraction to obtain the target graph features.

In one embodiment of the present application, based on the foregoing scheme, the graph-type modal branch further includes a self-attention layer; the model selection module is further used for inputting the color map features and the vector map features to the self-attention layer so as to calculate the similarity of the color map features and the vector map features and generate attention weights according to the similarity; performing feature alignment processing on the color map features and the vector map features according to the attention weight; and performing feature stitching on the color map features and the vector map features after feature alignment processing to obtain the stitching features.

In one embodiment of the present application, based on the foregoing solution, the feature processing module is further configured to input, if the detection data includes a color image, the color image to a first convolution layer of a single-mode branch of the target detection model for feature processing; inputting the characteristics output by the first convolution layer of the single-mode branch into a second convolution layer of the single-mode branch for characteristic processing to obtain convolution graph characteristics; and carrying out convolution processing on the convolution graph characteristics to obtain the target detection characteristics.

In one embodiment of the present application, based on the foregoing scheme, the backbone network includes a plurality of backbone network layers connected in sequence; the feature fusion module is further used for inputting the color image to a first backbone network layer for first-scale feature extraction to obtain first-scale color image features, and carrying out feature fusion processing on the first-scale color image features and the target detection features to obtain first fusion features; inputting the first fusion feature to a next backbone network layer for second-scale feature extraction to obtain a second-scale color map feature, and carrying out feature fusion processing according to the second-scale color map feature and the target detection feature to obtain a second fusion feature; and inputting the second fusion feature to a next backbone network layer, and carrying out feature fusion processing again until a final backbone network layer outputs a color map feature with a target scale, and obtaining the fusion feature according to the color map feature with the target scale and the target detection feature.

In one embodiment of the present application, based on the foregoing solution, the target detection model further includes a plurality of feature fusion modules connected in sequence, where the number of feature fusion modules is the same as the number of backbone network layers; the feature fusion module is further used for carrying out convolution processing on the target detection feature through a next feature fusion module connected with the first feature fusion module to obtain a depth detection feature corresponding to the color map feature of the second scale; the target detection feature is the feature output by the first feature fusion module; and carrying out feature addition processing on the color map features of the second scale and the depth detection features to obtain the second fusion features.

In one embodiment of the present application, based on the foregoing scheme, the feature fusion module is further configured to generate a first weight of the second-scale color map feature according to the second-scale color map feature; generating a second weight corresponding to the depth detection feature according to the first weight; and carrying out feature weighted summation on the color map features of the second scale and the depth detection features according to the first weight and the second weight to obtain the second fusion features.

In one embodiment of the present application, based on the foregoing scheme, the defect detection module is further configured to input the fusion feature to a detection head module of the target detection model for defect detection; and obtaining the defect type and the defect frame of the target object output by the detection head module, and obtaining the defect detection result according to the defect type and the defect frame.

In an embodiment of the present application, based on the foregoing solution, the obtaining module is further configured to image, under a plurality of illumination conditions, a front portion of the target object by using a plurality of photographing devices to obtain a plurality of images, where each image corresponds to one illumination condition; acquiring luminosity information corresponding to each image, calculating a normal vector of each pixel point according to the luminosity information corresponding to each image so as to generate a normal vector diagram corresponding to the front part of the target object, and taking the normal vector diagram as the target modal data; and selecting a color image corresponding to the front part from a plurality of images, and generating detection data corresponding to the front part of the target object based on the color image and the target modal data.

In one embodiment of the present application, based on the foregoing scheme, the apparatus further includes a training module, where the training module is configured to obtain sample detection data including a sample color image and sample modal data of a sample object, and a marked sample defect result carried by the sample object; inputting the sample detection data into a multi-modal branch of a model to be trained, and inputting the sample color image into a single-modal branch of the model to be trained to obtain corresponding sample detection characteristics; carrying out feature processing on the sample color images in different scales through a backbone network of the model to be trained to obtain multi-scale sample color map features, and carrying out feature fusion on the sample detection features and the multi-scale sample color map features to obtain sample fusion features; performing defect detection on the sample fusion characteristics through a detection head of the model to be trained to obtain a sample defect detection result; and constructing model loss according to the difference of the sample defect detection result and the marked sample defect result, and adjusting model parameters of the model to be trained according to the model loss to obtain the target detection model.

It should be noted that, the apparatus provided in the foregoing embodiments and the method provided in the foregoing embodiments belong to the same concept, and the specific manner in which each module and unit perform the operation has been described in detail in the method embodiments, which is not repeated herein.

The device provided in the above embodiment may be provided in the terminal or in the server.

The embodiment of the application also provides an electronic device, which comprises one or more processors, and a storage device, wherein the storage device is used for storing one or more computer programs, and when the one or more computer programs are executed by the one or more processors, the electronic device is enabled to realize the defect detection method.

It should be noted that, the computer system 1400 of the electronic device shown in fig. 14 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 14, the computer system 1400 includes a processor (Central Processing Unit, CPU) 1401 which can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1402 or a program loaded from a storage portion 1408 into a random access Memory (Random Access Memory, RAM) 1403. In the RAM 1403, various programs and data required for system operation are also stored. The CPU 1401, ROM 1402, and RAM 1403 are connected to each other through a bus 1404. An Input/Output (I/O) interface 1405 is also connected to bus 1404.

In some embodiments, the following components are connected to the I/O interface 1405: an input section 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), and a speaker; a storage portion 1408 including a hard disk or the like; and a communication section 1409 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1409 performs communication processing via a network such as the internet. The drive 1410 is also connected to the I/O interface 1405 as needed. A removable medium 1411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1410 so that a computer program read therefrom is installed into the storage portion 1408 as needed.

In particular, according to embodiments of the present application, the process described above with reference to the flowcharts may be implemented as a computer program. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. When executed by a processor (CPU) 1401, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory), a flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer programs.

The units or modules involved in the embodiments of the present application may be implemented in software, or may be implemented in hardware, and the described units or modules may also be disposed in a processor. Where the names of the units or modules do not in some way constitute a limitation of the units or modules themselves.

Another aspect of the present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a defect detection method as described above. The computer-readable storage medium may be included in the electronic device described in the above embodiment or may exist alone without being incorporated in the electronic device.

Another aspect of the present application also provides a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the electronic device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the electronic device to execute the defect detection method provided in the above-described respective embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

The foregoing is merely illustrative of the preferred embodiments of the present application and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make corresponding variations or modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be defined by the claims.

Claims

1. A defect detection method, comprising:

Acquiring detection data of a target object;

if the detection data comprises a color image and target modal data, extracting features of the color image and the target modal data to obtain target detection features, wherein the target detection features are fused with the features of the color image and the features of the target modal data;

extracting features of different scales from the color image through a main network of a pre-trained target detection model to obtain multi-scale color image features, and carrying out feature fusion on the target detection features and the multi-scale color image features to obtain fusion features;

and performing defect detection on the target object according to the fusion characteristics to obtain a defect detection result of the target object.

2. The method according to claim 1, wherein if the detection data includes color image and target modal data, performing feature extraction on the color image and target modal data to obtain target detection features, including:

If the detection data comprises a color image and target modal data, and the data type of the target modal data comprises a vector diagram type, selecting a diagram type modal branch of the target detection model to respectively perform feature extraction on the color image and the target modal data so as to respectively obtain color diagram features and vector diagram features;

performing feature fusion on the color map features and the vector map features to obtain target map features;

And carrying out convolution processing on the target graph characteristics to obtain the target detection characteristics.

3. The method of claim 2, wherein the object detection model further comprises at least one feature fusion module; the convolving the target graph feature to obtain the target detection feature comprises the following steps:

Inputting the target graph characteristics to a first convolution layer of the characteristic fusion module for convolution processing to obtain first characteristics;

performing nonlinear transformation processing on the first feature according to an activation function to obtain a second feature;

and inputting the second feature to a second convolution layer of the feature fusion module to carry out convolution processing to obtain the target detection feature.

4. The method of claim 2, wherein the graph-type modal branch includes a first convolution layer and a second convolution layer; the selecting the graph type modal branch of the target detection model to respectively perform feature extraction on the color image and the target modal data to respectively obtain color graph features and vector graph features includes:

Inputting the color image and the target modal data to a first convolution layer of the graph type modal branch for feature extraction to obtain color graph features and vector graph features;

performing feature fusion on the color map features and the vector map features to obtain target map features, wherein the feature fusion comprises the following steps:

performing feature stitching processing on the color map features and the vector map features to obtain stitching features;

and inputting the spliced features to a second convolution layer of the graph type modal branch to perform feature extraction to obtain the target graph features.

5. The method of claim 4, wherein the graph-type modality branch further comprises a self-attention layer; the characteristic splicing processing is carried out on the color map characteristic and the vector map characteristic to obtain a spliced characteristic, and the method comprises the following steps:

inputting the color map features and the vector map features to the self-attention layer to calculate the similarity of the color map features and the vector map features, and generating attention weights according to the similarity;

performing feature alignment processing on the color map features and the vector map features according to the attention weight;

and performing feature stitching on the color map features and the vector map features after feature alignment processing to obtain the stitching features.

6. The method according to claim 1, wherein the method further comprises:

If the detection data comprises a color image, inputting the color image into a first convolution layer of a single-mode branch of the target detection model for feature processing;

inputting the characteristics output by the first convolution layer of the single-mode branch into a second convolution layer of the single-mode branch for characteristic processing to obtain convolution graph characteristics;

And carrying out convolution processing on the convolution graph characteristics to obtain the target detection characteristics.

7. The method according to any one of claims 1 to 6, wherein the backbone network comprises a plurality of backbone network layers connected in sequence; the method for extracting the multi-scale color map features by the feature extraction of the color images through the main network of the pre-trained target detection model, and carrying out feature fusion on the target detection features and the multi-scale color map features to obtain fusion features comprises the following steps:

Inputting the color image to a first backbone network layer for first-scale feature extraction to obtain first-scale color image features, and carrying out feature fusion processing on the first-scale color image features and the target detection features to obtain first fusion features;

Inputting the first fusion feature to a next backbone network layer for second-scale feature extraction to obtain a second-scale color map feature, and carrying out feature fusion processing according to the second-scale color map feature and the target detection feature to obtain a second fusion feature;

And inputting the second fusion feature to a next backbone network layer, and carrying out feature fusion processing again until a final backbone network layer outputs a color map feature with a target scale, and obtaining the fusion feature according to the color map feature with the target scale and the target detection feature.

8. The method of claim 7, wherein the object detection model further comprises a plurality of feature fusion modules connected in sequence; the number of the feature fusion modules is the same as that of the backbone network layers; and performing feature fusion processing on the color map features according to the second scale and the target detection features to obtain second fusion features, wherein the feature fusion processing comprises the following steps:

The target detection feature is subjected to convolution processing through a next feature fusion module connected with the first feature fusion module, so that a depth detection feature corresponding to the color map feature of the second scale is obtained; the target detection feature is the feature output by the first feature fusion module;

And carrying out feature addition processing on the color map features of the second scale and the depth detection features to obtain the second fusion features.

9. The method of claim 8, wherein performing feature addition processing on the color map feature of the second scale and the depth detection feature to obtain the second fusion feature comprises:

Generating a first weight of the color map feature of the second scale according to the color map feature of the second scale;

generating a second weight corresponding to the depth detection feature according to the first weight;

and carrying out feature weighted summation on the color map features of the second scale and the depth detection features according to the first weight and the second weight to obtain the second fusion features.

10. The method according to claim 1, wherein the performing defect detection on the target object according to the fusion feature to obtain a defect detection result of the target object includes:

Inputting the fusion characteristics to a detection head module of the target detection model for defect detection;

and obtaining the defect type and the defect frame of the target object output by the detection head module, and obtaining the defect detection result according to the defect type and the defect frame.

11. The method of claim 1, wherein the acquiring the detection data of the target object comprises:

Under a plurality of illumination conditions, respectively imaging the front part of the target object through a plurality of shooting devices to obtain a plurality of images, wherein each image corresponds to one illumination condition;

Acquiring luminosity information corresponding to each image, calculating a normal vector of each pixel point according to the luminosity information corresponding to each image so as to generate a normal vector diagram corresponding to the front part of the target object, and taking the normal vector diagram as the target modal data;

and selecting a color image corresponding to the front part from a plurality of images, and generating detection data corresponding to the front part of the target object based on the color image and the target modal data.

12. The method of claim 1, wherein the training of the object detection model comprises:

acquiring sample detection data comprising sample color images and sample modal data of a sample object, and marking sample defect results carried by the sample object;

inputting the sample detection data into a multi-modal branch of a model to be trained, and inputting the sample color image into a single-modal branch of the model to be trained to obtain corresponding sample detection characteristics;

Carrying out feature processing on the sample color images in different scales through a backbone network of the model to be trained to obtain multi-scale sample color map features, and carrying out feature fusion on the sample detection features and the multi-scale sample color map features to obtain sample fusion features;

Performing defect detection on the sample fusion characteristics through a detection head of the model to be trained to obtain a sample defect detection result;

And constructing model loss according to the difference of the sample defect detection result and the marked sample defect result, and adjusting model parameters of the model to be trained according to the model loss to obtain the target detection model.

13. A defect detection apparatus, comprising:

The acquisition module is used for acquiring detection data of the target object;

The feature processing module is used for extracting features of the color image and the target modal data to obtain target detection features if the detection data comprise the color image and the target modal data, and the target detection features are fused with the features of the color image and the features of the target modal data;

The feature fusion module is used for extracting features of different scales from the color image through a main network of a pre-trained target detection model to obtain multi-scale color image features, and carrying out feature fusion on the target detection features and the multi-scale color image features to obtain fusion features;

And the defect detection module is used for carrying out defect detection on the target object according to the fusion characteristics to obtain a defect detection result of the target object.

14. An electronic device, comprising:

One or more processors;

Storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-12.

15. A computer readable storage medium, having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 12.

16. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the method of any one of claims 1 to 12.