CN117274253A

CN117274253A - Part detection method and device based on multimode transducer and readable medium

Info

Publication number: CN117274253A
Application number: CN202311546437.0A
Authority: CN
Inventors: 童斌斌; 温廷羲; 李静雯; 杨其建; 曾焕强
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2023-12-22
Anticipated expiration: 2043-11-20
Also published as: CN117274253B

Abstract

The invention discloses a part detection method, a device and a readable medium based on a multi-mode transducer, relating to the field of image processing, comprising the following steps: acquiring the weight of parts and components and overlapping the images of the parts and components at different angles to obtain an input image; constructing and training a spare and accessory part detection model to obtain a trained spare and accessory part detection model, and inputting an input image and weight into the trained spare and accessory part detection model to obtain a feature vector; establishing a part feature vector database containing feature vectors of parts of known models; inputting an input image and weight of a part to be detected into a trained part detection model to obtain a feature vector of the part to be detected, comparing the feature vector of the part to be detected with the feature vector in a part feature vector database, and detecting to obtain the model of the part to be detected, thereby solving the problems of poor accuracy and the like of the traditional image detection method for detecting the model of the part by using single type data.

Description

Part detection method and device based on multimode transducer and readable medium

Technical Field

The invention relates to the field of image processing, in particular to a part detection method and device based on a multi-mode transducer and a readable medium.

Background

With the continuous development of domestic factory intellectualization and computer technology, how to realize intelligent detection classification of spare and accessory parts in factory workshops is receiving more and more attention. Because of the refinement and high standardization of the manufacturing industry, extremely high requirements are placed on the accuracy of the detection of the model of the spare part. At present, different parts are detected in a factory mainly by means of manpower, the manual detection is high in requirement on staff, a large number of different parts are required to be memorized by the staff, the staff are required to make judgment in a very short time, and the detection error rate of the very small fine parts is high. The image detection technology not only can improve the accuracy of detection, but also can save a large amount of human resources and time, and the technology provides a high-efficiency and accurate solution for intelligent detection of spare and accessory parts models, and promotes the development of the deep learning technology in practical application.

Conventional image detection methods typically accept only a single type of data, such as a single picture. Therefore, the accuracy of detecting the model of the spare and accessory parts is poor, and the similar spare and accessory parts cannot be detected correctly. Although the conventional image detection method can detect parts and components within a certain range, the conventional image detection method is still far from sufficient in application universality. Therefore, a detection method for the model of the spare and accessory part is needed, which not only can collect image appearance information from multiple angles, but also can provide weight information, and provides more useful information for judging the model of the spare and accessory part, so that the accuracy of detecting the model of the spare and accessory part is improved, and meanwhile, the detection technology can be applied to a wider field.

Disclosure of Invention

The technical problems mentioned above are solved. An objective of the embodiments of the present application is to provide a method, a device and a readable medium for detecting parts based on a multimode transducer, so as to solve the technical problems mentioned in the background section.

In a first aspect, the present invention provides a method for detecting a spare and accessory part based on a multimode transducer, including the steps of:

acquiring the weight of the spare and accessory parts and a plurality of images of the spare and accessory parts shot at different angles, and superposing the images to obtain an input image;

constructing and training a spare and accessory part detection model to obtain a trained spare and accessory part detection model, wherein the spare and accessory part detection model comprises a multi-mode feature fusion module, a modified Swin transform module, a global average pooling layer and a full connection layer which are sequentially connected, an input image and weight are input into the trained spare and accessory part detection model, the input image and weight are subjected to feature fusion through the multi-mode feature fusion module to obtain a multi-mode feature map, the multi-mode feature map is input into the modified Swin transform module to perform feature extraction to obtain fusion features, and the fusion features are input into the global average pooling layer and the full connection layer to obtain feature vectors;

Inputting the input image and weight of the spare and accessory parts with known model into a trained spare and accessory part detection model to obtain corresponding feature vectors, and establishing a spare and accessory part feature vector database;

inputting the input image and weight of the spare and accessory parts to be detected into a trained spare and accessory part detection model to obtain feature vectors of the spare and accessory parts to be detected, comparing the feature vectors of the spare and accessory parts to be detected with feature vectors in a spare and accessory part feature vector database, and detecting to obtain the model of the spare and accessory parts to be detected.

Preferably, the multi-mode feature fusion module comprises a first convolution layer, a second convolution layer, an average pooling layer and a normalization layer, wherein the convolution kernel size of the first convolution layer is 3×3, and the convolution kernel size of the second convolution layer is 1×1.

Preferably, the multi-mode feature fusion module is used for carrying out feature fusion on the input image and the weight to obtain a multi-mode feature map, and the method specifically comprises the following steps:

the method comprises the steps that a plurality of images of parts photographed at different angles comprise front images, side images and top images of the parts, the front images, the side images and the top images are overlapped to obtain an input image, the input image sequentially passes through a first convolution layer, a second convolution layer, an average pooling layer and a normalization layer to obtain image characteristics, weight is normalized to obtain a normalized weight value, and the normalized weight value is added with the image characteristics to obtain a multi-mode characteristic image.

Preferably, the improved Swin fransformer module comprises a first unit, a third convolution layer, a first channel attention network, a fourth convolution layer, a second unit, a fifth convolution layer, a second channel attention network, a sixth convolution layer, a third unit, a seventh convolution layer, a third channel attention network, an eighth convolution layer and a fourth unit, wherein the first unit comprises a linear embedded layer and a first twin Swin fransformer module, the second unit comprises a first patch merging layer and a second twin Swin fransformer module, the third unit comprises a second patch merging layer and a third twin Swin fransformer module, the fourth unit comprises a third patch merging layer and a fourth twin Swin fransformer module, the convolution kernel sizes of the third convolution layer, the fifth convolution layer and the seventh convolution layer are all 3×3, and the convolution kernel sizes of the fourth convolution layer, the sixth convolution layer and the eighth convolution layer are all 1×1.

Preferably, the multi-mode feature map is input into a modified Swin transducer module for feature extraction to obtain fusion features, which specifically comprises:

the multi-modal feature map is input into a linear embedding layer, the output of the linear embedding layer is input into a fourth convolution layer after being added with the output of the first patch merging layer through a first twin Swin transform module and a first patch merging layer, the output of the first patch merging layer is input into a sixth convolution layer after being added with the output of the second patch merging layer through a fifth convolution layer and a second patch merging layer, the output of the sixth convolution layer is input into a third twin Swin transform module and a third patch merging layer after being added with the output of the second patch merging layer, the output of the second patch merging layer is input into an eighth twin convolution layer after being added with the output of the third patch merging layer through a seventh convolution layer and a third patch merging layer, and the output of the eighth convolution layer is input into the eighth convolution layer after being added with the output of the third patch merging layer through a fourth twin Swin transform module, so as to obtain the fusion feature map.

Preferably, the first twin Swin Transformer module, the second twin Swin Transformer module, the third twin Swin Transformer module and the fourth twin Swin Transformer module all adopt twin Swin Transformer modules, and the twin Swin Transformer module comprises a first Swin Transformer module, a second Swin Transformer module and a ninth convolution layer, and the convolution kernel size of the ninth convolution layer is 1×1; and dividing the features input into the twin Swin transform module into a first feature subset and a second feature subset by adopting an equidistant sampling mode, respectively inputting the first feature subset and the second feature subset into the first Swin transform module and the second Swin transform module, outputting to obtain a first result subset and a second result subset, adding the first result subset and the second result subset to obtain a result subset, and inputting the result subset into a ninth convolution layer to obtain the output of the twin Swin transform module.

Preferably, a triplet loss function is adopted in the training process of the spare part detection model, and the formula is as follows:

；

wherein,representing the square of the euclidean norm, max (x, 0) representing the larger value between x and 0, f (x) representing the process of mapping the weight of the input image and the component parts to a feature vector, α being a preset edge for controlling the minimum threshold of the distance difference between the positive and negative samples, a, p representing the positive sample, and n representing the negative sample.

In a second aspect, the present invention provides a device for detecting parts based on a multimode transducer, including:

the data acquisition module is configured to acquire the weight of the spare and accessory parts and a plurality of images of the spare and accessory parts shot at different angles, and the images are overlapped to obtain an input image;

the model construction module is configured to construct and train a spare and accessory part detection model to obtain a trained spare and accessory part detection model, the spare and accessory part detection model comprises a multi-mode feature fusion module, a modified Swin transform module, a global average pooling layer and a full connection layer which are sequentially connected, an input image and weight are input into the trained spare and accessory part detection model, the input image and weight are subjected to feature fusion through the multi-mode feature fusion module to obtain a multi-mode feature map, the multi-mode feature map is input into the modified Swin transform module to perform feature extraction to obtain fusion features, and the fusion features are input into the global average pooling layer and the full connection layer to obtain feature vectors;

the database building module is configured to input the input image and weight of the spare and accessory parts with known model into the trained spare and accessory part detection model to obtain corresponding feature vectors, and build a spare and accessory part feature vector database;

The comparison module is configured to input the input image and the weight of the spare and accessory parts to be detected into the trained spare and accessory part detection model to obtain the feature vector of the spare and accessory parts to be detected, and compare the feature vector of the spare and accessory parts to be detected with the feature vector in the spare and accessory part feature vector database to detect the model of the spare and accessory parts to be detected.

In a third aspect, the present invention provides an electronic device comprising one or more processors; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

(1) The part detection method based on the multi-mode transducer can detect parts according to the multi-mode data comprising the front image, the side image, the top image and the weight of the part, can effectively extract the characteristics of the multi-mode data through the multi-mode characteristic fusion module, and normalizes the characteristics so that the model can be better applied to subsequent processing. By utilizing the multi-mode data, the detection accuracy of the model of the spare and accessory parts is improved, the model of the spare and accessory parts can be determined more reliably without being limited by single-mode data, and therefore more accurate detection of complex objects is achieved.

(2) The improved Swin converter module is adopted in the part detection model of the part detection method based on the multi-mode converter, so that the reasoning speed of the part detection model is improved, and the full interaction of data in the multi-mode characteristic diagram is ensured. Meanwhile, in order to enhance the processing capacity of the model on the detail features, a jump structure is introduced between two adjacent units, and detail information possibly ignored by the part detection model in the feature extraction process is filled through the channel attention network design. This series of improvements aims to improve the performance and accuracy of the part inspection model, making it more efficient at handling a variety of complex tasks.

(3) The part detection method based on the multi-mode transducer uses a trained part detection model to establish a part feature vector database, and the part feature vector database stores feature vectors of all parts possibly needing to be detected. When the trained spare and accessory part detection model is used for detecting spare and accessory parts, the closest feature matching can be found by calculating the Euclidean distance between the feature vector of the spare and accessory part to be detected and the feature vector in the spare and accessory part feature vector database, so that the detection function of the spare and accessory parts is realized, the method can ensure that different spare and accessory parts can be detected efficiently and accurately, and powerful support is provided for related applications.

(4) The part detection method based on the multi-mode transducer provided by the invention adopts the triplet loss function in the training process of the part detection model, and effectively realizes the training of the part detection model. The method has important application in the field of spare and accessory part model detection, fully utilizes multi-mode data and an improved transducer network structure, improves the detection accuracy and reliability, and provides powerful support for industrial application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary device frame pattern to which an embodiment of the present application may be applied;

FIG. 2 is a flow chart of a method for detecting parts based on a multi-mode transducer according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a part detection model of a part detection method based on a multi-mode transducer according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a modified Swin transducer module of a multimodal transducer based spare and accessory part detection method in accordance with an embodiment of the present application;

FIG. 5 is a schematic diagram of a twin Swin transducer module of a multimodal transducer based spare and accessory part detection method in accordance with an embodiment of the present application;

FIG. 6 is a schematic diagram of a multi-mode transducer-based component detection apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device suitable for use in implementing the embodiments of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 illustrates an exemplary device architecture 100 to which the multimodal transducer-based part detection method or the multimodal transducer-based part detection device of embodiments of the present application may be applied.

As shown in fig. 1, the apparatus architecture 100 may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various applications, such as a data processing class application, a file processing class application, and the like, may be installed on the terminal device one 101, the terminal device two 102, and the terminal device three 103.

The first terminal device 101, the second terminal device 102 and the third terminal device 103 may be hardware or software. When the first terminal device 101, the second terminal device 102, and the third terminal device 103 are hardware, they may be various electronic devices, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like. When the first terminal apparatus 101, the second terminal apparatus 102, and the third terminal apparatus 103 are software, they can be installed in the above-listed electronic apparatuses. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background data processing server that processes files or data uploaded by the terminal device one 101, the terminal device two 102, and the terminal device three 103. The background data processing server can process the acquired file or data to generate a processing result.

It should be noted that, the method for detecting parts based on the multimode transducer according to the embodiment of the present application may be executed by the server 105, or may be executed by the first terminal device 101, the second terminal device 102, or the third terminal device 103, and accordingly, the device for detecting parts based on the multimode transducer may be disposed in the server 105, or may be disposed in the first terminal device 101, the second terminal device 102, or the third terminal device 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above-described apparatus architecture may not include a network, but only a server or terminal device.

Fig. 2 shows a method for detecting parts based on a multimode transducer according to an embodiment of the present application, including the following steps:

S1, acquiring the weight of the spare and accessory parts and a plurality of images of the spare and accessory parts shot at different angles, and superposing the images to obtain an input image.

Specifically, referring to fig. 3, the industrial camera is used to capture images of the parts from multiple angles (front, side, top) and simultaneously capture the weight of the parts. The unit of weight is micrograms.

S2, constructing and training a spare and accessory part detection model to obtain a trained spare and accessory part detection model, wherein the spare and accessory part detection model comprises a multi-mode feature fusion module, a modified Swin transform module, a global average pooling layer and a full connection layer which are sequentially connected, an input image and weight are input into the trained spare and accessory part detection model, the input image and the weight are subjected to feature fusion through the multi-mode feature fusion module to obtain a multi-mode feature map, the multi-mode feature map is input into the modified Swin transform module to perform feature extraction to obtain fusion features, and the fusion features are input into the global average pooling layer and the full connection layer to obtain feature vectors.

In a specific embodiment, the multi-modal feature fusion module includes a first convolution layer, a second convolution layer, an averaging pooling layer, and a normalization layer, where the convolution kernel of the first convolution layer has a size of 3×3, and the convolution kernel of the second convolution layer has a size of 1×1.

In a specific embodiment, the method for obtaining the multi-modal feature map includes the steps of:

Specifically, the front image, side image, and top image of the part are superimposed into one 9-dimensional input image. The input image is subjected to feature extraction and feature size compression by a first convolution layer having a convolution kernel size of 3 x 3 and a second convolution layer having a convolution kernel size of 1 x 1. And then, adopting an average pooling layer to reduce the size of the feature map output by the second convolution layer, and finally using a normalization layer to perform layer normalization (Layer Normalization) operation to obtain an image feature containing a plurality of angle parts. Further, the weight of the spare and accessory parts is normalized, so that a normalized weight value is obtained, and the specific normalization operation is shown as the following formula:

；

Wherein x represents the weight of the current spare and accessory parts;indicating all types of parts to be inspectedThe weight of the lightest parts; />Indicating the weight of the heaviest component among all types of components that need to be detected.

And adding the normalized weight value and the extracted image characteristic to finally generate a multi-mode characteristic diagram. And adding the normalized weight value to each pixel value in the image characteristic to obtain the multi-mode characteristic map. The multimodal feature map incorporates part front, side, top images and weight information. Compared with the traditional deep learning method, the spare and accessory part detection model can receive richer spare and accessory part information through the multi-mode feature fusion module, so that the detection accuracy of spare and accessory part models is improved. The innovative design is helpful for the part detection model to more fully understand and utilize various aspects and features of the parts, and the detection performance is improved.

In a specific embodiment, the improved Swin fransformer module comprises a first unit, a third convolution layer, a first channel attention network, a fourth convolution layer, a second unit, a fifth convolution layer, a second channel attention network, a sixth convolution layer, a third unit, a seventh convolution layer, a third channel attention network, an eighth convolution layer and a fourth unit, the first unit comprises a linear embedded layer and a first twin Swin fransformer module, the second unit comprises a first patch merging layer and a second twin Swin fransformer module, the third unit comprises a second patch merging layer and a third twin Swin fransformer module, the fourth unit comprises a third patch merging layer and a fourth twin Swin fransformer module, the convolution kernel sizes of the third, fifth and seventh convolution layers are 3×3, and the convolution kernel sizes of the fourth, sixth and eighth convolution layers are 1×1.

In a specific embodiment, the inputting of the multi-mode feature map to the improved Swin transducer module performs feature extraction to obtain a fusion feature, which specifically includes:

Specifically, referring to FIG. 4, the multimodal profile is input to a modified Swin transducer module to further extract features. The embodiment of the application improves the network structure of the traditional Swin transducer module and increases the jump connection. The traditional Swin transducer module adopts a network structure with 12 layers (2,2,6,2), and is not easy to be transplanted to the edge end due to the large scale of the module. Therefore, the embodiment of the application adjusts the network structure of the traditional Swin Transformer module, and modifies the layer number of the traditional Swin Transformer module into 8 layers (2, 2), corresponding to the linear embedded layer and the first twin Swin Transformer module in the first unit, the first patch merging layer and the second twin Swin Transformer module in the second unit, the second patch merging layer and the third twin Swin Transformer module in the third unit, and the third patch merging layer and the fourth twin Swin Transformer module in the fourth unit. In order to enhance the processing capability of the spare and accessory part detection model on detailed information, jump connection is introduced between different units of the improved Swin transducer module.

Specifically, the outputs of the linear embedded layers are jump connected between the first unit and the second unit, and feature map scale compression and dimension expansion are performed by a third convolution layer with a convolution kernel size of 3×3. The output of the third convolution layer is then input to the first channel attention network (e.g., SENet) and each channel is assigned a different weight to reflect the importance of the different channels in the task. This allows the network to focus on the most important feature channels, thereby improving the performance of the model. The output of the first channel attention network is fused with the output of the first patch combining layer in the second unit. Since the two feature maps are of the same scale, they are simply stacked together and then halved in dimension using a fourth convolution layer with a convolution kernel size of 1 x 1. Finally, the output of the fourth convolution layer is input into a second twin Swin converter module for further processing, and jump connection is completed.

And between the second unit and the third unit, the output of the first patch merging layer in the second unit is connected in a jumping manner, and the feature map scale compression and the dimension expansion are performed through a fifth convolution layer with a convolution kernel size of 3×3. In addition, a second channel attention network (e.g., SENet) is used for feature map dimension processing. The output of the second channel attention network is fused with the output of the second patch combining layer in the third unit, halving the dimensions using a sixth convolution layer with a convolution kernel size of 1 x 1. Finally, the output of the sixth convolution layer is input into a third twin Swin converter module for further processing, and jump connection is completed.

And between the third unit and the fourth unit, the output of the second patch merging layer in the third unit is connected in a jumping manner, and the characteristic diagram scale compression and the dimension expansion are carried out through a seventh convolution layer with the convolution kernel size of 3 multiplied by 3. In addition, a third channel attention network (e.g., SENet) is used for feature map dimension processing. The output of the third channel attention network is fused with the output of the third patch combining layer in the fourth unit, and the eighth convolution layer with a convolution kernel size of 1 x 1 is used for dimension halving. Finally, the output of the eighth convolution layer is input into a fourth twin Swin converter module for further processing, and jump connection is completed.

In a specific embodiment, the first twin Swin Transformer module, the second twin Swin Transformer module, the third twin Swin Transformer module and the fourth twin Swin Transformer module all adopt twin Swin Transformer modules, and the twin Swin Transformer module comprises a first Swin Transformer module, a second Swin Transformer module and a ninth convolution layer, and the convolution kernel size of the ninth convolution layer is 1×1; and dividing the features input into the twin Swin transform module into a first feature subset and a second feature subset by adopting an equidistant sampling mode, respectively inputting the first feature subset and the second feature subset into the first Swin transform module and the second Swin transform module, outputting to obtain a first result subset and a second result subset, adding the first result subset and the second result subset to obtain a result subset, and inputting the result subset into a ninth convolution layer to obtain the output of the twin Swin transform module.

Specifically, referring to fig. 5, for the first twin Swin fransformer module, the second twin Swin fransformer module, the third twin Swin fransformer module, and the fourth twin Swin fransformer module, embodiments of the present application design parallel first Swin fransformer module and second Swin fransformer module based on Swin fransformer. Specifically, embodiments of the present application divide an input feature into two equally sized feature subsets by means of equidistant sampling. The two feature subsets are then input into a first Swin transducer module and a second Swin transducer module, respectively. Then, the first and second subsets of results output by the first and second Swin Transformer modules, respectively, are combined into one subset of results, and finally the subset of results is fused together using a ninth convolution layer having a convolution kernel size of 1×1 to enhance the interaction between features, thereby improving the accuracy of the model.

In a specific embodiment, a triplet loss function is adopted in the training process of the spare and accessory part detection model, and the formula is as follows:

；

Specifically, the embodiments of the present application use the concept of FaceNet, with the triplet loss function. The triplet loss function forces the network to learn to map samples of the same class to close points and samples of different classes to far away points. The object is to make the distance of the same class sample in the space as small as possible and the distance of the different class sample as large as possible by learning a proper feature space.

S3, inputting the input image and weight of the spare and accessory parts with known models into the trained spare and accessory part detection model to obtain corresponding feature vectors, and establishing a spare and accessory part feature vector database.

Specifically, with the aid of the trained part detection model, the embodiment of the application constructs a part feature vector database, namely, an input image and weight of a part of a known model are input into the trained part detection model, a corresponding 128-dimensional feature vector is extracted, and the feature vector of the part of the known model and the corresponding model thereof are stored in the part feature vector database.

S4, inputting the input image and the weight of the spare and accessory parts to be detected into a trained spare and accessory part detection model to obtain feature vectors of the spare and accessory parts to be detected, comparing the feature vectors of the spare and accessory parts to be detected with feature vectors in a spare and accessory part feature vector database, and detecting to obtain the model of the spare and accessory parts to be detected.

Specifically, during actual detection, the model corresponding to the feature vector in the feature vector database of the spare and accessory parts with the minimum Euclidean distance is further used as the model of the spare and accessory parts to be detected by calculating the Euclidean distance between the feature vector of the spare and accessory parts to be detected and the feature vector in the feature vector database of the spare and accessory parts, so that the model of the spare and accessory parts can be accurately detected.

The above steps S1-S4 do not merely represent the order between steps, but rather are step notations.

With further reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a device for detecting parts based on a multimode transducer, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be specifically applied to various electronic devices.

The embodiment of the application provides a spare and accessory part detection device based on multimode transducer, which comprises:

The data acquisition module 1 is configured to acquire the weight of the spare and accessory parts and a plurality of images of the spare and accessory parts shot at different angles, and superimpose the images to obtain an input image;

the model construction module 2 is configured to construct and train a spare and accessory part detection model to obtain a trained spare and accessory part detection model, wherein the spare and accessory part detection model comprises a multi-mode feature fusion module, a modified Swin transform module, a global average pooling layer and a full connection layer which are sequentially connected, an input image and weight are input into the trained spare and accessory part detection model, the input image and the weight are subjected to feature fusion through the multi-mode feature fusion module to obtain a multi-mode feature map, the multi-mode feature map is input into the modified Swin transform module to perform feature extraction to obtain fusion features, and the fusion features are input into the global average pooling layer and the full connection layer to obtain feature vectors;

the database building module 3 is configured to input an input image and weight of a part with a known model into the trained part detection model to obtain a corresponding feature vector, and build a part feature vector database;

the comparison module 4 is configured to input the input image and the weight of the spare and accessory parts to be detected into the trained spare and accessory part detection model to obtain the feature vector of the spare and accessory parts to be detected, and compare the feature vector of the spare and accessory parts to be detected with the feature vector in the spare and accessory part feature vector database to detect the model of the spare and accessory parts to be detected.

Referring now to fig. 7, there is illustrated a schematic diagram of a computer apparatus 700 suitable for use in implementing an electronic device (e.g., a server or terminal device as illustrated in fig. 1) of an embodiment of the present application. The electronic device shown in fig. 7 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.

As shown in fig. 7, the computer apparatus 700 includes a Central Processing Unit (CPU) 701 and a Graphics Processor (GPU) 702, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 703 or a program loaded from a storage section 709 into a Random Access Memory (RAM) 704. In the RAM 704, various programs and data required for the operation of the computer device 700 are also stored. The CPU 701, the GPU702, the ROM 703, and the RAM 704 are connected to each other through a bus 705. An input/output (I/O) interface 706 is also connected to the bus 705.

The following components are connected to the I/O interface 706: an input section 707 including a keyboard, a mouse, and the like; an output portion 708 including a speaker, such as a Liquid Crystal Display (LCD), or the like; a storage section 709 including a hard disk or the like; and a communication section 710 including a network interface card such as a LAN card, a modem, and the like. The communication section 710 performs communication processing via a network such as the internet. The drives 711 may also be connected to the I/O interfaces 706 as needed. A removable medium 712 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 711, so that a computer program read out therefrom is installed into the storage section 709 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 710, and/or installed from the removable media 712. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 701 and a Graphics Processor (GPU) 702.

It should be noted that the computer readable medium described in the present application may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples of the computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments described in the present application may be implemented by software, or may be implemented by hardware. The described modules may also be provided in a processor.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring the weight of the spare and accessory parts and a plurality of images of the spare and accessory parts shot at different angles, and superposing the images to obtain an input image; constructing and training a spare and accessory part detection model to obtain a trained spare and accessory part detection model, wherein the spare and accessory part detection model comprises a multi-mode feature fusion module, a modified Swin transform module, a global average pooling layer and a full connection layer which are sequentially connected, an input image and weight are input into the trained spare and accessory part detection model, the input image and weight are subjected to feature fusion through the multi-mode feature fusion module to obtain a multi-mode feature map, the multi-mode feature map is input into the modified Swin transform module to perform feature extraction to obtain fusion features, and the fusion features are input into the global average pooling layer and the full connection layer to obtain feature vectors; inputting the input image and weight of the spare and accessory parts with known model into a trained spare and accessory part detection model to obtain corresponding feature vectors, and establishing a spare and accessory part feature vector database; inputting the input image and weight of the spare and accessory parts to be detected into a trained spare and accessory part detection model to obtain feature vectors of the spare and accessory parts to be detected, comparing the feature vectors of the spare and accessory parts to be detected with feature vectors in a spare and accessory part feature vector database, and detecting to obtain the model of the spare and accessory parts to be detected.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A part detection method based on a multi-mode transducer is characterized by comprising the following steps:

constructing and training a spare and accessory part detection model to obtain a trained spare and accessory part detection model, wherein the spare and accessory part detection model comprises a multi-mode feature fusion module, a modified Swin transform module, a global average pooling layer and a full connection layer which are sequentially connected, the input image and the weight are input into the trained spare and accessory part detection model, the input image and the weight are subjected to feature fusion through the multi-mode feature fusion module to obtain a multi-mode feature map, the multi-mode feature map is input into the modified Swin transform module to perform feature extraction to obtain fusion features, and the fusion features are input into the global average pooling layer and the full connection layer to obtain feature vectors;

Inputting the input image and the weight of the spare and accessory parts with known models into a trained spare and accessory part detection model to obtain corresponding feature vectors, and establishing a spare and accessory part feature vector database;

inputting the input image and the weight of the spare and accessory parts to be detected into a trained spare and accessory part detection model to obtain the feature vector of the spare and accessory parts to be detected, comparing the feature vector of the spare and accessory parts to be detected with the feature vector in the spare and accessory part feature vector database, and detecting to obtain the model of the spare and accessory parts to be detected.

2. The method for detecting parts based on the multimode converter according to claim 1, wherein the multimode feature fusion module comprises a first convolution layer, a second convolution layer, an average pooling layer and a normalization layer, the convolution kernel of the first convolution layer is 3×3, and the convolution kernel of the second convolution layer is 1×1.

3. The method for detecting parts and components based on multimode transformers according to claim 2, wherein the step of performing feature fusion on the input image and the weight by the multimode feature fusion module to obtain a multimode feature map specifically comprises:

The images of the parts photographed at different angles comprise a front image, a side image and a top image of the parts, the front image, the side image and the top image are overlapped to obtain an input image, the input image sequentially passes through the first convolution layer, the second convolution layer, the average pooling layer and the normalization layer to obtain image characteristics, the weight is subjected to normalization operation to obtain a normalized weight value, and the normalized weight value is added with the image characteristics to obtain the multi-mode characteristic image.

4. The multimode-transducer-based spare and accessory part detection method of claim 1, wherein the modified Swin transducer module comprises a first unit, a third convolution layer, a first channel attention network, a fourth convolution layer, a second unit, a fifth convolution layer, a second channel attention network, a sixth convolution layer, a third unit, a seventh convolution layer, a third channel attention network, an eighth convolution layer, and a fourth unit, the first unit comprises a linear embedded layer and a first twin Swin transducer module, the second unit comprises a first patch merging layer and a second twin Swin transducer module, the third unit comprises a second patch merging layer and a third twin Swin transducer module, the fourth unit comprises a third patch merging layer and a fourth twin Swin transducer module, the convolution kernel sizes of the third, fifth, and seventh convolution layers are all 3 x 3, and the fourth convolution layer, the sixth convolution layer, and the eighth convolution layer are all 1 x 3.

5. The method for detecting parts and components based on multimode transformers according to claim 4, wherein the multimode feature map is input into the improved Swin Transformer module for feature extraction, and fusion features are obtained, specifically comprising:

the multi-modal feature map is input into a linear embedded layer, the output of the linear embedded layer respectively passes through a first twin Swin transform module and a first patch merging layer and the third convolution layer and a first channel attention network, the output of the first channel attention network is added with the output of the first patch merging layer and then is input into the fourth convolution layer, the output of the fourth convolution layer sequentially passes through the second twin Swin transform module and a second patch merging layer, the output of the first patch merging layer sequentially passes through the fifth convolution layer and the second channel attention network, the output of the second channel attention network is added with the output of the second patch merging layer and then is input into the sixth convolution layer, the output of the sixth convolution layer sequentially passes through the third twin Swin transform module and the third patch merging layer, the output of the second patch merging layer sequentially passes through the seventh convolution layer and the third patch attention network, the output of the third channel attention network sequentially passes through the third convolution layer and the eighth convolution layer and then is added with the output of the eighth convolution layer, and then is input into the eighth convolution layer.

6. The multimode-transducer-based spare and accessory part detection method of claim 4, wherein the first twin Swin transducer module, the second twin Swin transducer module, the third twin Swin transducer module and the fourth twin Swin transducer module all employ twin Swin transducer modules, the twin Swin transducer modules comprise a first Swin transducer module, a second Swin transducer module and a ninth convolution layer, and a convolution kernel of the ninth convolution layer has a size of 1 x 1; and dividing the features input into the twin Swin transform module into a first feature subset and a second feature subset by adopting an equidistant sampling mode, respectively inputting the first feature subset and the second feature subset into the first Swin transform module and the second Swin transform module, outputting to obtain a first result subset and a second result subset, adding the first result subset and the second result subset to obtain a result subset, and inputting the result subset into the ninth convolution layer to obtain the output of the twin Swin transform module.

7. The method for detecting parts based on multimode transformers according to claim 1, wherein a triplet loss function is adopted in the training process of the parts detection model, and the formula is as follows:

；

Wherein I ² Representing the square of the euclidean norm, max (x, 0) representing the larger value between x and 0, f (x) representing the process of mapping the weight of the input image and the component parts to a feature vector, α being a preset edge for controlling the minimum threshold of the distance difference between the positive and negative samples, a, p representing the positive sample, and n representing the negative sample.

8. A multi-modal transducer-based component detection apparatus, comprising:

the model construction module is configured to construct a spare and accessory part detection model and train the spare and accessory part detection model to obtain a trained spare and accessory part detection model, the spare and accessory part detection model comprises a multi-mode feature fusion module, a modified Swin transform module, a global average pooling layer and a full connection layer which are sequentially connected, the input image and the weight are input into the trained spare and accessory part detection model, the input image and the weight are subjected to feature fusion through the multi-mode feature fusion module to obtain a multi-mode feature map, the multi-mode feature map is input into the modified Swin transform module to perform feature extraction to obtain fusion features, and the fusion features are input into the global average pooling layer and the full connection layer to obtain feature vectors;

The database building module is configured to input the input image of the spare and accessory parts with known model and the weight into a trained spare and accessory part detection model to obtain corresponding feature vectors, and build a spare and accessory part feature vector database;

the comparison module is configured to input the input image and the weight of the spare and accessory parts to be detected into a trained spare and accessory part detection model to obtain the feature vector of the spare and accessory parts to be detected, compare the feature vector of the spare and accessory parts to be detected with the feature vector in the spare and accessory part feature vector database, and detect the model of the spare and accessory parts to be detected.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.