CN117036891A - Cross-modal feature fusion-based image recognition method and system - Google Patents
Cross-modal feature fusion-based image recognition method and system Download PDFInfo
- Publication number
- CN117036891A CN117036891A CN202311063209.8A CN202311063209A CN117036891A CN 117036891 A CN117036891 A CN 117036891A CN 202311063209 A CN202311063209 A CN 202311063209A CN 117036891 A CN117036891 A CN 117036891A
- Authority
- CN
- China
- Prior art keywords
- image
- rgb
- cross
- module
- feature fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 91
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000007246 mechanism Effects 0.000 claims abstract description 45
- 238000000605 extraction Methods 0.000 claims abstract description 32
- 230000011218 segmentation Effects 0.000 claims abstract description 31
- 230000000295 complement effect Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 22
- 238000010586 diagram Methods 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 15
- 239000000284 extract Substances 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 12
- 238000007499 fusion processing Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 8
- 238000003860 storage Methods 0.000 claims description 6
- 210000000988 bone and bone Anatomy 0.000 claims description 5
- 238000009826 distribution Methods 0.000 abstract description 19
- 238000013527 convolutional neural network Methods 0.000 description 15
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 7
- 238000003384 imaging method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 239000013598 vector Substances 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 210000001015 abdomen Anatomy 0.000 description 1
- 230000003187 abdominal effect Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000153 supplemental effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image recognition method and system based on cross-modal feature fusion, wherein the method comprises the following steps: acquiring an RGB image and a depth image of a shooting object; identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified in a shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified; the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step. By introducing a depth image shot by a depth camera as another mode and matching with the improved mode for identification, the target segmentation requirement of the power distribution cabinet components in a dynamic environment is met.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to an image recognition method and system based on cross-modal feature fusion.
Background
The power distribution cabinet is a vital device in the power system, plays roles of distributing, controlling and protecting electric energy, ensures safe operation of the power system, provides stable and reliable power supply, and protects circuits and devices from power faults. The robot technology plays an important role in the operation and the operation of the power distribution cabinet, is used for realizing tasks such as fault detection, component operation and the like through accurate movement and control, and greatly improves the working efficiency and safety. The computer vision technology provides the robot with enhanced sensing and recognition capability in the operation of the power distribution cabinet, and by using the computer vision technology, the robot can accurately recognize equipment, connectors and the like in the power distribution cabinet and acquire related data and image information, thereby providing important guidance and support for the operation of the robot. However, the working environment of the power distribution cabinet is usually closed and dynamic, and a robot is used to replace a worker to perform daily operations, so that a model is required to cope with challenges brought by the dynamic environment, such as shadow shielding, insufficient light, low resolution and the like. Under these conditions, it is difficult to achieve high precision for an algorithm that only uses visible light, so that a depth camera is used to provide a visible light image and a depth image, and perceptibility, reliability and robustness of the target positioning and segmentation algorithm can be improved by fusing complementarity of different modes.
With the development of convolutional neural networks, a dual-flow network based on CNN (convolutional neural network) has emerged for target detection and segmentation. In the past work, no matter how a modal fusion mechanism is designed, the modal fusion mechanism is carried out on a convolutional neural network, for example, based on a cross-modal learning and domain self-adaptive RGBD image semantic segmentation method (CN 114419323A), CNN has strong characterization and learning capacity in single-mode internal reasoning, can effectively capture local features in input data through convolution operation, has better processing capacity on data with spatial structures such as images, and has the advantages of local perceptibility and multi-level feature representation. Compared with CNN, a transducer model is based on global attention, has correlation between key and query, can model long-range dependence and capture global information, combines a convolutional neural network and the transducer, can simultaneously consider local and global information, extracts strong characteristic representation, solves the problem of long-range dependence, such as an abdomen CT image multi-organ segmentation method, device and terminal equipment (CN 116030259A), and combines CNN and the transducer and is applied to the field of target detection.
However, in the above solution, the cross-modal learning and domain adaptive RGBD image semantic segmentation method (CN 114419323A) uses a model of a convolutional neural network and has good segmentation performance, but the locality of convolution operation limits that the model is difficult to learn long-distance dependency in images outside the receptive field, so that the capability of the model of the convolutional neural network to process details such as texture, shape and size change in the images is limited to a certain extent; thus, convolutional neural networks may face challenges in processing image tasks with long-range dependencies, and may not be able to capture features and context information of the image global.
The method, the device and the terminal equipment (CN 116030259A) for dividing the abdominal CT image multiple organs are characterized in that a model based on a visual transducer can model the image global information by adopting a self-attention mechanism, and the whole model improves the dividing precision of a target through the multi-scale global semantic feature extraction capability. However, the single-mode transducer model is applied in a very single fixed scene, and there are limitations in performing object segmentation in a real scene, and the environment in the real world is usually open and dynamic, such as shadow shielding, light exposure and insufficiency, and low resolution, under which a single-mode segmentation algorithm is difficult to achieve high segmentation accuracy.
In order to meet the requirement of dividing targets of power distribution cabinet components in a dynamic environment, a depth image shot by a depth camera is introduced as another mode, characteristics of each mode under different scales are extracted based on CNN, and complementation fusion among different modes is carried out through a transducer module, so that perceptibility, reliability and robustness of a target positioning and dividing algorithm are improved, and the model has the characteristics of small structure volume, low resource consumption and the like, and is easy to deploy on edge equipment.
Disclosure of Invention
The embodiment of the invention aims to provide an image recognition method and system based on cross-modal feature fusion, which aims to meet the requirements of dividing targets of power distribution cabinet components in a dynamic environment, and the model has the characteristics of small structural volume, low resource consumption and the like, and is easy to deploy on edge equipment by introducing a depth image shot by a depth camera as another mode, extracting features of each mode under different scales based on CNN, and then carrying out complementary fusion among different modes through a transducer module.
In order to solve the technical problem, a first aspect of the embodiments of the present invention provides an image recognition method based on cross-modal feature fusion, including the following steps:
Acquiring an RGB image and a depth image of a shooting object;
identifying the RGB image and the depth image based on the cross-modal feature fusion model, identifying a plurality of image units of targets to be identified in the shooting object, and acquiring the type and state information of the targets to be identified according to the image units of the targets to be identified;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
Further, before the identifying the RGB image and the depth image based on the cross-modal feature fusion model, the method further includes:
acquiring historical image data of the shooting object under various shooting conditions, wherein the historical image data comprises: a plurality of historical RGB images of the shooting object and corresponding historical depth images;
and based on the historical image data of a preset proportion, performing recognition training on the target to be recognized on the cross-modal feature fusion model.
Further, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion;
the back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through a convolution module, obtains characteristic diagrams of a plurality of scales after characteristic fusion through a plurality of corresponding characteristic fusion modules, and sends the characteristic diagrams to the Neck part through a channel attention module respectively;
the Neck part extracts and performs the fusion processing on the scale of the characteristics output by the channel attention module, and sends the characteristics after the fusion processing to the Head part;
and the Head part determines the segmentation area of the object to be identified according to the characteristics.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
the first branch and the second branch are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction units comprise: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted from the picture feature extraction unit corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branch after feature fusion.
Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and training matrix sequences are retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;
based on the input sequence of the transducer module, Q is used by a self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD ;
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Further, the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Further, after the identifying the image units of the plurality of objects to be identified in the shooting object, the method further includes:
dividing the image unit according to the identification result to obtain image data of a plurality of targets to be identified;
the sizes of the image data of the plurality of targets to be identified are adjusted to be preset sizes;
and acquiring the type and state information of the target to be identified based on the image data of the target to be identified with a preset size.
Accordingly, a second aspect of the embodiments of the present invention provides an image recognition system based on cross-modal feature fusion, including:
an image acquisition module for acquiring an RGB image and a depth image of a photographic subject;
the image recognition module is used for recognizing the RGB image and the depth image based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the type and state information of the targets to be recognized according to the image units of the targets to be recognized;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
Further, the image recognition system based on cross-modal feature fusion further comprises: a model training module, the model training module comprising:
a history data acquiring unit configured to acquire history image data of the photographic subject under various photographic conditions, the history image data including: a plurality of historical RGB images of the shooting object and corresponding historical depth images;
the model identification training unit is used for carrying out identification training on the target to be identified on the cross-modal feature fusion model based on the historical image data of the preset proportion.
Further, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion;
the back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through a convolution module, obtains characteristic diagrams of a plurality of scales after characteristic fusion through a plurality of corresponding characteristic fusion modules, and sends the characteristic diagrams to the Neck part through a channel attention module respectively;
the Neck part extracts and performs the fusion processing on the scale of the characteristics output by the channel attention module, and sends the characteristics after the fusion processing to the Head part;
And the Head part determines the segmentation area of the object to be identified according to the characteristics.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
the first branch and the second branch are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction units comprise: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted from the picture feature extraction unit corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branch after feature fusion.
Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and training matrix sequences are retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;
based on the input sequence of the transducer module, Q is used by a self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In the followingUsing Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD ;
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Further, the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Further, the image recognition module includes:
the image segmentation unit is used for segmenting the image unit according to the identification result to obtain image data of a plurality of targets to be identified;
the image adjusting unit is used for adjusting the sizes of the image data of the plurality of targets to be identified to a preset size;
an information acquisition unit for acquiring the kind and state information of the object to be identified based on the image data of the object to be identified of a preset size.
Accordingly, a third aspect of the embodiment of the present invention provides an electronic device, including: at least one processor; and a memory coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the cross-modality feature fusion-based image recognition method described above.
Accordingly, a fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described cross-modality feature fusion-based image recognition method.
The technical scheme provided by the embodiment of the invention has the following beneficial technical effects:
in order to meet the requirement of dividing targets of power distribution cabinet components in a dynamic environment, a depth image shot by a depth camera is introduced as another mode, characteristics of each mode under different scales are extracted based on CNN, and complementation fusion among different modes is carried out through a transducer module, so that perceptibility, reliability and robustness of a target positioning and dividing algorithm are improved, and the model has the characteristics of small structure volume, low resource consumption and the like, and is easy to deploy on edge equipment.
Drawings
FIG. 1 is a flowchart of an image recognition method based on cross-modal feature fusion provided by an embodiment of the invention;
fig. 2 is a flowchart for identifying components of a power distribution cabinet according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a cross-modal feature fusion model provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a TransSACA module provided by an embodiment of the present invention;
fig. 5a is a view showing photographing and segmentation of angle 1 based on RGB mode photographing in the prior art;
fig. 5b is a view showing photographing and segmentation of angle 2 based on RGB mode photographing in the prior art;
fig. 5c is a view showing photographing and segmentation of angle 3 based on RGB mode photographing according to the prior art;
FIG. 5d is a view of the prior art image taken at an angle 1 based on CBAM mode;
FIG. 5e is a view of the prior art image capture and segmentation of angle 2 based on CBAM mode capture;
FIG. 5f is a view of the prior art image capture and segmentation of angle 3 based on CBAM mode capture;
fig. 5g is a view showing shooting and segmentation of angle 1 based on CFT mode shooting in the prior art;
FIG. 5h is a view of shooting and segmentation of angle 2 based on CFT mode shooting in the prior art;
FIG. 5i is a view of shooting and segmentation of angle 3 based on CFT mode shooting in the prior art;
FIG. 5j is a view of the imaging and segmentation of angle 1 based on cross-modality feature fusion mode imaging in accordance with the present invention;
FIG. 5k is a view of the imaging and segmentation of angle 2 based on cross-modal feature fusion mode imaging in accordance with the present invention;
FIG. 5l is a view of the imaging and segmentation of angle 3 based on cross-modal feature fusion mode imaging in accordance with the present invention;
FIG. 6 is a block diagram of an image recognition system based on cross-modality feature fusion provided by an embodiment of the present invention;
FIG. 7 is a block diagram of an image recognition module provided by an embodiment of the present invention;
FIG. 8 is a block diagram of a model training module provided by an embodiment of the present invention.
Reference numerals:
1. an image acquisition module 2, an image recognition module 21, an image segmentation unit 22, an image adjustment unit, 23, an information acquisition unit, 3, a model training module, 31, a historical data acquisition unit, 32 and a model identification training unit.
Detailed Description
The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.
Referring to fig. 1 and 2, a first aspect of the embodiment of the present invention provides an image recognition method based on cross-modal feature fusion, which includes the following steps:
Step S100, an RGB image and a depth image of a photographing object (i.e., a power distribution cabinet) are acquired.
In an alternative form of embodiment of the present invention, an identifiable power distribution cabinet component includes: touch-sensitive screen, temperature control table, scram switch, red auto-lock switch, yellow auto-lock switch, boat type switch, high voltage connection board, air switch handle, air switch base, change over switch handle, change over switch base, 1 handle of load switch, 1 base of load switch, 2 handles of load switch, 2 bases of load switch, door lock handle, lock core, voltmeter, pilot lamp, rotary switch handle, rotary switch base, green self-reset switch, white self-reset switch.
Step S300, identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified (namely power distribution cabinet components) in the shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified.
The cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step.
Further, before identifying the RGB image and the depth image based on the cross-modal feature fusion model in step S300, the method further includes:
step S210, acquiring historical image data of the shooting subject under various shooting conditions, where the historical image data includes: several historical RGB images of the shooting object and corresponding historical depth images.
Step S220, based on historical image data of a preset proportion, recognition training of the target to be recognized is conducted on the cross-modal feature fusion model.
By acquiring a large number of historical photos of the power distribution cabinet components in advance, obtaining input pictures with uniform size through preprocessing, constructing a data set, and obtaining a plurality of historical photos according to 8:1:1 are randomly scattered and distributed to a training set, a verification set and a test set, and the model is trained. The dual-mode branch feature extracts image features under different scales of RGB and D. And (3) building a multi-scale transducer segmentation model to perform image fusion of different scales, and feeding the output characteristic graph back to the original branch so as to achieve the purpose of enhancing the characteristics of the branch. In the process of model training and parameter adjustment, the loss function is optimized, and the SIoU loss function is used for replacing the original CIoU loss function, so that the convergence speed of training and the segmentation accuracy are improved. Training the model according to a preset scheme to obtain a weight file of the converged model.
Specifically, referring to fig. 3, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion. The back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through the convolution module, obtains the characteristic images of a plurality of scales after the characteristic fusion is carried out through a plurality of corresponding characteristic fusion modules, and sends the characteristic images to the Neck part through the channel attention module respectively; the Neck part extracts the characteristics output by the channel attention module, performs the fusion processing on the scale, and sends the characteristics after the fusion processing to the Head part; the Head part determines a segmented region of the object to be identified according to the features.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image; the first branch road and the second branch road are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction unit includes: a Conv module, a C3 module and/or an SPPF module; the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted by the picture feature extraction units corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branches after being fused.
The Conv module consists of a convolution layer, a normalization layer and an activation function, local space information is extracted through convolution operation, characteristic value distribution is normalized through a BN layer, and nonlinear transformation capacity is introduced through the activation function, so that conversion and extraction of input characteristics are realized. And the C3 module improves the capability of feature extraction by increasing the convolution depth and the receptive field. The SPPF module performs pooling operation of different sizes on the input feature images to obtain a group of feature images of different sizes, then connects the feature images together, reduces the dimension through a full connection layer, and obtains feature vectors of fixed sizes.
The Neck part adopts an FPN feature pyramid, feature graphs with different scales are fused together through up-sampling and down-sampling operations to generate a multi-scale feature pyramid, the fusion of features with different levels is realized through up-sampling and feature graph fusion with coarser granularity from top to bottom, and then the feature graphs with different levels are fused through a convolution layer from bottom to top.
Further, referring to fig. 4, the ransasa module adopts a multi-mode feature fusion mechanism, the first input end is an RGB image convolution feature map, the second input end is a D image convolution feature map, the RGB image convolution feature map and the D image convolution feature map are flattened and retrained into a matrix sequence respectively, and the input sequence of the fransformer module is obtained after adding the position embedding; input sequence based on a transducer module, using Q through self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD The method comprises the steps of carrying out a first treatment on the surface of the Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Specifically, the TransSACA module is a multi-mode feature fusion mechanism, uses the self-attention and cross-attention of a transducer to combine the global backgrounds of RGB mode and D mode, and because of their complementarity, each branch of the module receives as input a sequence of discrete token, each token comprising a featureThe quantity representation, the feature vector is complemented by a position code to incorporate a position induced bias. As shown in the figure, F IN RGB ∈R C×H×W Is a convolution feature diagram of RGB image, F IN D ∈R C×H×W Is a D image convolution feature map, wherein C represents the channel number, H represents the picture height value, and W represents the picture width value, which are obtained by convolution extraction of features from RGB map and D map, respectively:
F IN RGB =Φ RGB (I RGB ),F IN D =Φ D (I D );
Wherein I is RGB And I D Respectively an input RGB diagram and a D diagram, phi RGB And phi is D The convolution module is applied to generating feature mapping of input images of different modes, then flattening each feature image and reordering matrix order, and the input sequence X of the transducer is obtained after adding position embedding IN RGB ∈R HW×C And X IN D ∈R HW×C A set of queries, keys and values (Q, K and V) for RGB and D are calculated using linear projection, for example:
wherein W is RGB Q 、W D Q ∈R C×Dq ,W RGB K 、W D K ∈R C×Dw ,W RGB V 、W D Vv ∈R C×Dv Is a weight matrix, D in the belonging module Q =D W =D V =c, each attention head uses Q (·) And K (·) Is used to calculate the attention weight and then multiplied by V (·) To obtain output Z (·) . For example self-attention use Q RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB Similarly, can obtain Z saD . Cross-attention use Q D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB Similarly, can obtain Z caD 。
Here the number of the elements is the number,is a scaling factor to prevent excessive results of dot product generation from softmaThe x-function produces a smaller gradient for controlling the magnitude and stability of the attention weights, and the multi-headed attention of self-attention and cross-attention can further improve the performance of the model by focusing differently on the features at different locations. The expression of multi-headed attention is as follows:
Wherein,
where h represents the number of heads, zi represents the attention weight of the ith head, W O ,W Q i ,W K i ,W V i All e R C×C Is a projection matrix.
More specifically, the self-attention module is used to build a correlation inside the sequence, calculate the correlation of each element with other elements in the sequence, where Q, K, V are from the same input modality, analyze the remote dependencies and explore the context information to further improve the pattern-specific characteristics to input the global feature X IN RGB For example, the self-attention global feature Z of the output SA RGB Can be expressed as follows:
the cross-attention module is then used to process associations between different modalities or different inputs to reduce ambiguity, Q is from the different input modalities, and K and V are the same input modalities, so that effective information exchange and fusion between the different modalities is established, information transfer and complementation between the different modalities are promoted, in the module, a query is acquired from another input feature (e.g., Q D ) And keys in self-input features (e.g. K RGB ) A) to calculate the correlation, expressed as follows:
here, Z saRGB And Z saD Is the output of the self-attention module, Q RGB 、K RGB And V RGB Is a related intermediate representation of RGB image features, Q D 、K D And V D Is a relevant intermediate representation of the D image features.
Finally, the processing is carried out by using MLP, which comprises two layers of fully-connected feedforward network, and a GELU activation function is used in the middle to calculate the output X OUT RGB And X is OUT D The dimension is the same as the input feature map, and therefore is added directly as supplemental information to the original modality branch.
Where X is OUT RGB And X is OUT D The output dimension is the same as the input sequence, and then the output is reshaped into a c×h×w feature map F OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Because the computational overhead of processing a high resolution feature map is very expensive, to reduce the amount of computation by the Transformer in processing the high resolution feature map, the high resolution feature map is downsampled with averaging pooling, sampled to a fixed resolution of h=w=8, and then passed as input to the module, and the output is upsampled to the original resolution using bilinear interpolation before element summing with the existing feature map.
Further, the loss function of the Head portion is a bounding box regression loss function. Wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Specifically, the Head part introduces a new boundary box regression loss function SIoU, replaces the original CIoU loss function, improves the convergence speed of model training and the reasoning accuracy, and compared with the CIoU, the SIoU considers the problem of angle, and is composed of angle loss, distance loss, shape loss and overlapping loss. The overall loss function includes a sum of the regression loss for the bounding box, the confidence loss, the classification loss, and the masked regression loss,
in the distance loss function Δ, (b) cx gt ,b cy gt ) Is the center coordinates of the truth box, (b) cx ,b cy ) Is the predicted frame center coordinates. C (C) w And C h Width and height of minimum bounding rectangle of truth box and prediction box, respectively, p x And p y Representing twoEuclidean distance of the center coordinates of the individual frames. In the angle loss function C w And C h The width and height values of the center points of the truth box and the prediction box are respectively, and sigma represents the distance between the center points of the truth box and the prediction box. In the shape loss function Ω, (w, h) and (w gt ,h gt ) θ controls the degree of attention to shape loss for the width and height of the prediction and truth boxes, respectively. K represents an output feature map, S 2 And N respectively represent the number of image grids in the prediction process and the number of prediction frames in each grid, and the coefficient M obj kij Whether the Kth output feature map of the jth prediction frame of the ith grid is a positive sample or not, BCE sig obj Representing a binary cross entropy loss function, w obj And w cls Representing the weight of the positive sample. X is x p And x gt Respectively representing a prediction vector and a true value vector; wherein alpha is box 、ɑ obj 、ɑ cls 、ɑ seg Weights for position error, confidence error, classification error, and segmentation error are represented, respectively. Segmentation loss function L seg Binary cross entropy is used, where P is the hxw xk matrix of prototype masks and C is the n x k matrix of mask coefficients for n instances defined by NMS and threshold. Sigma represents a sigmoid function, which predicts the mask M p Mask M with true value gt And after combination, sending the binary cross entropy to calculate.
Further, after identifying the image units of the plurality of objects to be identified in the photographed object, the method further includes:
step S310, dividing the image unit according to the identification result to obtain a plurality of image data of the object to be identified.
Step S320, the sizes of the image data of the plurality of objects to be identified are adjusted to a preset size.
Step S330, obtaining the type and state information of the object to be identified based on the image data of the object to be identified with the preset size.
In one embodiment of the invention, mAP0.5 and mAP0.5:0.05:0.95 are used to evaluate the segmentation performance of the model. mAP requires Precision and Recall for calculation.
Wherein True Position (TP) represents that the IOU between the predicted mask and the True value is greater than a prescribed threshold, false Position (FP) represents that the IOU between the predicted mask and the True value does not satisfy the prescribed threshold, and False Positive (FN) represents that there is no intersection between the predicted mask and the True value. The mAP is calculated as follows:
where AP represents the integral of each class Precision and Recall curve, map0.5 represents the average of all APs for all classes when the threshold for IOU is set to 0.5. mAP 0.5:0.95 calculates the average value over IOU thresholds of 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, which is obviously much more stringent than mAP0.5, their values range from 0 to 1.
TABLE 1
As can be seen from Table 1, the index of the power distribution cabinet component is higher than that of other algorithms mAP0.5 and mAP0.5:0.05:0.95.
As can be seen from fig. 5a to fig. 5l, by testing the visible light images and the corresponding depth images of the shooting angles of the 3 different power distribution cabinet components, it can be seen that all the network segmentation results are good in the visible light scene. However, in the low light scenario, the RGB mode is under-divided by a knob (ee_rot_swch_hdl) of the rotary switch, a load switch handle (ee_load_swchl_hdl), and a door handle (ee_gt_lk_hdl). The CBAM mode can divide the handle, but the knob of the rotary switch and the handle of the load switch are not divided, and the CFT mode may also be divided. The image recognition method based on cross-modal feature fusion in the invention avoids the problems.
According to the method, the problems that a power distribution cabinet robot faces recognition difficulty and operation difficulty in low illumination or night scenes are solved, the newly designed Transformer modules are densely inserted into a double-flow network frame, the self-attention mechanism is matched with the staggered attention mechanism to capture the relation between different positions in a sequence, the relation is used for relying and integrating global context information, and the spatial relation of components can be accurately understood; by considering global information around the components, the robot can accurately locate and segment the targets; finally, the SIoU loss function is adopted to replace the original CIoU loss function, the SIoU considers the angle problem on the basis of the CIoU, the angle loss, the distance loss, the shape loss and the overlapping loss, the training convergence speed and the segmentation precision of the model are improved, and the model resource consumption is low and is easy to deploy on edge equipment.
Accordingly, referring to fig. 6, a second aspect of the embodiment of the present invention provides an image recognition system based on cross-modal feature fusion, including:
an image acquisition module 1 for acquiring an RGB image and a depth image of a photographic subject;
The image recognition module 2 is used for recognizing RGB images and depth images based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the types and state information of the targets to be recognized according to the image units of the targets to be recognized;
the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step.
Further, referring to fig. 7, the image recognition module 2 includes:
an image dividing unit 21 for dividing the image unit according to the recognition result to obtain image data of a plurality of objects to be recognized;
an image adjustment unit 22 for adjusting the sizes of the image data of the plurality of objects to be identified to a preset size;
an information acquisition unit 23 for acquiring the kind and state information of the object to be recognized based on the image data of the object to be recognized of a preset size.
Further, referring to fig. 8, the image recognition system based on cross-modal feature fusion further includes: model training module 3, model training module 3 includes:
A history data acquiring unit 31 for acquiring history image data of a subject under various photographing conditions, the history image data including: shooting a plurality of historical RGB images and corresponding historical depth images of an object;
the model recognition training unit 32 is configured to perform recognition training on the cross-modal feature fusion model based on the historical image data of the preset scale.
Further, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion;
the method comprises the steps that a back bone part receives an RGB image and a depth image respectively, the convolution module extracts the characteristics of the RGB image and the depth image in multiple scales, and the characteristic fusion module performs characteristic fusion to obtain characteristic images in multiple scales, and the characteristic images are sent to a Neck part through a channel attention module respectively;
the Neck part extracts the characteristics output by the channel attention module, performs the fusion processing on the scale, and sends the characteristics after the fusion processing to the Head part;
the Head part determines a segmented region of the object to be identified according to the features.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
The first branch road and the second branch road are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction unit includes: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted by the picture feature extraction units corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branches after being fused.
Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature map, a second input end is a D image convolution feature map, the RGB image convolution feature map and the D image convolution feature map are flattened respectively, a matrix sequence is retrained, and an input sequence of the Transformer module is obtained after position embedding is added;
input sequence based on a transducer module, using Q through self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD ;
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Further, the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Accordingly, a third aspect of the embodiment of the present invention provides an electronic device, including: at least one processor; and a memory coupled to the at least one processor; the memory stores instructions executable by a processor, the instructions being executable by the processor, to cause the at least one processor to perform the cross-modal feature fusion-based image recognition method.
Accordingly, a fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described cross-modality feature fusion-based image recognition method.
The embodiment of the invention aims to protect an image recognition method and system based on cross-modal feature fusion, wherein the method comprises the following steps: acquiring an RGB image and a depth image of a shooting object; identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified in a shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified; the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step. The technical scheme has the following effects:
In order to meet the requirement of dividing targets of power distribution cabinet components in a dynamic environment, a depth image shot by a depth camera is introduced as another mode, characteristics of each mode under different scales are extracted based on CNN, and complementation fusion among different modes is carried out through a transducer module, so that perceptibility, reliability and robustness of a target positioning and dividing algorithm are improved, and the model has the characteristics of small structure volume, low resource consumption and the like, and is easy to deploy on edge equipment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.
Claims (10)
1. An image recognition method based on cross-modal feature fusion is characterized by comprising the following steps:
acquiring an RGB image and a depth image of a shooting object;
identifying the RGB image and the depth image based on the cross-modal feature fusion model, identifying a plurality of image units of targets to be identified in the shooting object, and acquiring the type and state information of the targets to be identified according to the image units of the targets to be identified;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
2. The method for identifying images based on cross-modal feature fusion according to claim 1, further comprising, before identifying the RGB image and the depth image based on the cross-modal feature fusion model:
acquiring historical image data of the shooting object under various shooting conditions, wherein the historical image data comprises: a plurality of historical RGB images of the shooting object and corresponding historical depth images;
And based on the historical image data of a preset proportion, performing recognition training on the target to be recognized on the cross-modal feature fusion model.
3. The method for identifying images based on cross-modal feature fusion as claimed in claim 2, wherein,
the cross-modal feature fusion model comprises: a Backbone portion, a neg portion, and a Head portion;
the back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through a convolution module, obtains characteristic diagrams of a plurality of scales after characteristic fusion through a plurality of corresponding characteristic fusion modules, and sends the characteristic diagrams to the Neck part through a channel attention module respectively;
the Neck part extracts and performs the fusion processing on the scale of the characteristics output by the channel attention module, and sends the characteristics after the fusion processing to the Head part;
and the Head part determines the segmentation area of the object to be identified according to the characteristics.
4. The method for identifying images based on cross-modal feature fusion as claimed in claim 3,
the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
The first branch and the second branch are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction units comprise: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted from the picture feature extraction unit corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branch after feature fusion.
5. The method for identifying images based on cross-modal feature fusion as claimed in claim 4, wherein,
the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and the matrix sequence is retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;
based on the input sequence of the transducer module, Q is used by a self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD ;
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
6. The method for identifying images based on cross-modal feature fusion as claimed in claim 3,
the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
7. The cross-modal feature fusion-based image recognition method as claimed in any one of claims 1 to 6, wherein after the recognition of the image units of the plurality of objects to be recognized in the photographic subject, further comprising:
dividing the image unit according to the identification result to obtain image data of a plurality of targets to be identified;
The sizes of the image data of the plurality of targets to be identified are adjusted to be preset sizes;
and acquiring the type and state information of the target to be identified based on the image data of the target to be identified with a preset size.
8. An image recognition system based on cross-modal feature fusion, comprising:
an image acquisition module for acquiring an RGB image and a depth image of a photographic subject;
the image recognition module is used for recognizing the RGB image and the depth image based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the type and state information of the targets to be recognized according to the image units of the targets to be recognized;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
9. An electronic device, comprising: at least one processor; and a memory coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the cross-modality feature fusion-based image recognition method of any of claims 1-7.
10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the cross-modal feature fusion based image recognition method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311063209.8A CN117036891B (en) | 2023-08-22 | 2023-08-22 | Cross-modal feature fusion-based image recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311063209.8A CN117036891B (en) | 2023-08-22 | 2023-08-22 | Cross-modal feature fusion-based image recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117036891A true CN117036891A (en) | 2023-11-10 |
CN117036891B CN117036891B (en) | 2024-03-29 |
Family
ID=88624313
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311063209.8A Active CN117036891B (en) | 2023-08-22 | 2023-08-22 | Cross-modal feature fusion-based image recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117036891B (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021115159A1 (en) * | 2019-12-09 | 2021-06-17 | 中兴通讯股份有限公司 | Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor |
CN113989340A (en) * | 2021-10-29 | 2022-01-28 | 天津大学 | Point cloud registration method based on distribution |
CN114419323A (en) * | 2022-03-31 | 2022-04-29 | 华东交通大学 | Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method |
CN114494215A (en) * | 2022-01-29 | 2022-05-13 | 脉得智能科技(无锡)有限公司 | Transformer-based thyroid nodule detection method |
CN114693952A (en) * | 2022-03-24 | 2022-07-01 | 安徽理工大学 | RGB-D significance target detection method based on multi-modal difference fusion network |
CN114973411A (en) * | 2022-05-31 | 2022-08-30 | 华中师范大学 | Self-adaptive evaluation method, system, equipment and storage medium for attitude motion |
CN115713679A (en) * | 2022-10-13 | 2023-02-24 | 北京大学 | Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map |
CN115908789A (en) * | 2022-12-09 | 2023-04-04 | 大连民族大学 | Cross-modal feature fusion and asymptotic decoding saliency target detection method and device |
WO2023056889A1 (en) * | 2021-10-09 | 2023-04-13 | 百果园技术(新加坡)有限公司 | Model training and scene recognition method and apparatus, device, and medium |
CN116052108A (en) * | 2023-02-21 | 2023-05-02 | 浙江工商大学 | Transformer-based traffic scene small sample target detection method and device |
CN116206133A (en) * | 2023-04-25 | 2023-06-02 | 山东科技大学 | RGB-D significance target detection method |
CN116310396A (en) * | 2023-02-28 | 2023-06-23 | 安徽理工大学 | RGB-D significance target detection method based on depth quality weighting |
CN116385761A (en) * | 2023-01-31 | 2023-07-04 | 同济大学 | 3D target detection method integrating RGB and infrared information |
CN116452805A (en) * | 2023-04-15 | 2023-07-18 | 安徽理工大学 | Transformer-based RGB-D semantic segmentation method of cross-modal fusion network |
-
2023
- 2023-08-22 CN CN202311063209.8A patent/CN117036891B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021115159A1 (en) * | 2019-12-09 | 2021-06-17 | 中兴通讯股份有限公司 | Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor |
WO2023056889A1 (en) * | 2021-10-09 | 2023-04-13 | 百果园技术(新加坡)有限公司 | Model training and scene recognition method and apparatus, device, and medium |
CN113989340A (en) * | 2021-10-29 | 2022-01-28 | 天津大学 | Point cloud registration method based on distribution |
CN114494215A (en) * | 2022-01-29 | 2022-05-13 | 脉得智能科技(无锡)有限公司 | Transformer-based thyroid nodule detection method |
CN114693952A (en) * | 2022-03-24 | 2022-07-01 | 安徽理工大学 | RGB-D significance target detection method based on multi-modal difference fusion network |
CN114419323A (en) * | 2022-03-31 | 2022-04-29 | 华东交通大学 | Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method |
CN114973411A (en) * | 2022-05-31 | 2022-08-30 | 华中师范大学 | Self-adaptive evaluation method, system, equipment and storage medium for attitude motion |
CN115713679A (en) * | 2022-10-13 | 2023-02-24 | 北京大学 | Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map |
CN115908789A (en) * | 2022-12-09 | 2023-04-04 | 大连民族大学 | Cross-modal feature fusion and asymptotic decoding saliency target detection method and device |
CN116385761A (en) * | 2023-01-31 | 2023-07-04 | 同济大学 | 3D target detection method integrating RGB and infrared information |
CN116052108A (en) * | 2023-02-21 | 2023-05-02 | 浙江工商大学 | Transformer-based traffic scene small sample target detection method and device |
CN116310396A (en) * | 2023-02-28 | 2023-06-23 | 安徽理工大学 | RGB-D significance target detection method based on depth quality weighting |
CN116452805A (en) * | 2023-04-15 | 2023-07-18 | 安徽理工大学 | Transformer-based RGB-D semantic segmentation method of cross-modal fusion network |
CN116206133A (en) * | 2023-04-25 | 2023-06-02 | 山东科技大学 | RGB-D significance target detection method |
Non-Patent Citations (4)
Title |
---|
FANG QINGYUN 等,: "Cross-Modality Fusion Transformer for Multispectral Object Detection", 《ARXIV:2111.00273V4》, pages 1 - 10 * |
QINGJUN RU 等: "Cross-Modal Transformer for RGB-D semantic segmentation of production workshop objects", 《PATTERN RECOGNITION》, pages 1 - 12 * |
ZONGWEI WU 等: "Transformer Fusion for Indoor RGB-D Semantic Segmentation", 《COMPUTER VISION AND IMAGE UNDERSTANDING》, pages 1 - 15 * |
张飞: "基于多模态数据的人物定位和跟踪方法研究", 《CNKI学位》, vol. 2023, no. 02 * |
Also Published As
Publication number | Publication date |
---|---|
CN117036891B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200272806A1 (en) | Real-Time Tracking of Facial Features in Unconstrained Video | |
CN114666564B (en) | Method for synthesizing virtual viewpoint image based on implicit neural scene representation | |
CN110599395A (en) | Target image generation method, device, server and storage medium | |
CN112036260B (en) | Expression recognition method and system for multi-scale sub-block aggregation in natural environment | |
CN112001859A (en) | Method and system for repairing face image | |
JP2021163503A (en) | Three-dimensional pose estimation by two-dimensional camera | |
CN113095106A (en) | Human body posture estimation method and device | |
CN114783024A (en) | Face recognition system of gauze mask is worn in public place based on YOLOv5 | |
CN111062263A (en) | Method, device, computer device and storage medium for hand pose estimation | |
CN112200157A (en) | Human body 3D posture recognition method and system for reducing image background interference | |
CN112580434B (en) | Face false detection optimization method and system based on depth camera and face detection equipment | |
CN111160291A (en) | Human eye detection method based on depth information and CNN | |
Zheng et al. | Feater: An efficient network for human reconstruction via feature map-based transformer | |
CN116958420A (en) | High-precision modeling method for three-dimensional face of digital human teacher | |
CN114494594B (en) | Deep learning-based astronaut operation equipment state identification method | |
CN106845555A (en) | Image matching method and image matching apparatus based on Bayer format | |
CN113065506B (en) | Human body posture recognition method and system | |
JP2021176078A (en) | Deep layer learning and feature detection through vector field estimation | |
JP2021163502A (en) | Three-dimensional pose estimation by multiple two-dimensional cameras | |
CN116681687B (en) | Wire detection method and device based on computer vision and computer equipment | |
CN117036891B (en) | Cross-modal feature fusion-based image recognition method and system | |
KR20210018114A (en) | Cross-domain metric learning system and method | |
CN114863487A (en) | One-stage multi-person human body detection and posture estimation method based on quadratic regression | |
CN107403145A (en) | Image characteristic points positioning method and device | |
Agusta et al. | Field seeding algorithm for people counting using kinect depth image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |