CN117036891A - Cross-modal feature fusion-based image recognition method and system - Google Patents

Cross-modal feature fusion-based image recognition method and system Download PDF

Info

Publication number
CN117036891A
CN117036891A CN202311063209.8A CN202311063209A CN117036891A CN 117036891 A CN117036891 A CN 117036891A CN 202311063209 A CN202311063209 A CN 202311063209A CN 117036891 A CN117036891 A CN 117036891A
Authority
CN
China
Prior art keywords
image
rgb
cross
module
feature fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311063209.8A
Other languages
Chinese (zh)
Other versions
CN117036891B (en
Inventor
吴波
战秋成
郑随兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Realman Intelligent Technology Beijing Co ltd
Original Assignee
Realman Intelligent Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Realman Intelligent Technology Beijing Co ltd filed Critical Realman Intelligent Technology Beijing Co ltd
Priority to CN202311063209.8A priority Critical patent/CN117036891B/en
Publication of CN117036891A publication Critical patent/CN117036891A/en
Application granted granted Critical
Publication of CN117036891B publication Critical patent/CN117036891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image recognition method and system based on cross-modal feature fusion, wherein the method comprises the following steps: acquiring an RGB image and a depth image of a shooting object; identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified in a shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified; the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step. By introducing a depth image shot by a depth camera as another mode and matching with the improved mode for identification, the target segmentation requirement of the power distribution cabinet components in a dynamic environment is met.

Description

Cross-modal feature fusion-based image recognition method and system
Technical Field
The invention relates to the technical field of image processing, in particular to an image recognition method and system based on cross-modal feature fusion.
Background
The power distribution cabinet is a vital device in the power system, plays roles of distributing, controlling and protecting electric energy, ensures safe operation of the power system, provides stable and reliable power supply, and protects circuits and devices from power faults. The robot technology plays an important role in the operation and the operation of the power distribution cabinet, is used for realizing tasks such as fault detection, component operation and the like through accurate movement and control, and greatly improves the working efficiency and safety. The computer vision technology provides the robot with enhanced sensing and recognition capability in the operation of the power distribution cabinet, and by using the computer vision technology, the robot can accurately recognize equipment, connectors and the like in the power distribution cabinet and acquire related data and image information, thereby providing important guidance and support for the operation of the robot. However, the working environment of the power distribution cabinet is usually closed and dynamic, and a robot is used to replace a worker to perform daily operations, so that a model is required to cope with challenges brought by the dynamic environment, such as shadow shielding, insufficient light, low resolution and the like. Under these conditions, it is difficult to achieve high precision for an algorithm that only uses visible light, so that a depth camera is used to provide a visible light image and a depth image, and perceptibility, reliability and robustness of the target positioning and segmentation algorithm can be improved by fusing complementarity of different modes.
With the development of convolutional neural networks, a dual-flow network based on CNN (convolutional neural network) has emerged for target detection and segmentation. In the past work, no matter how a modal fusion mechanism is designed, the modal fusion mechanism is carried out on a convolutional neural network, for example, based on a cross-modal learning and domain self-adaptive RGBD image semantic segmentation method (CN 114419323A), CNN has strong characterization and learning capacity in single-mode internal reasoning, can effectively capture local features in input data through convolution operation, has better processing capacity on data with spatial structures such as images, and has the advantages of local perceptibility and multi-level feature representation. Compared with CNN, a transducer model is based on global attention, has correlation between key and query, can model long-range dependence and capture global information, combines a convolutional neural network and the transducer, can simultaneously consider local and global information, extracts strong characteristic representation, solves the problem of long-range dependence, such as an abdomen CT image multi-organ segmentation method, device and terminal equipment (CN 116030259A), and combines CNN and the transducer and is applied to the field of target detection.
However, in the above solution, the cross-modal learning and domain adaptive RGBD image semantic segmentation method (CN 114419323A) uses a model of a convolutional neural network and has good segmentation performance, but the locality of convolution operation limits that the model is difficult to learn long-distance dependency in images outside the receptive field, so that the capability of the model of the convolutional neural network to process details such as texture, shape and size change in the images is limited to a certain extent; thus, convolutional neural networks may face challenges in processing image tasks with long-range dependencies, and may not be able to capture features and context information of the image global.
The method, the device and the terminal equipment (CN 116030259A) for dividing the abdominal CT image multiple organs are characterized in that a model based on a visual transducer can model the image global information by adopting a self-attention mechanism, and the whole model improves the dividing precision of a target through the multi-scale global semantic feature extraction capability. However, the single-mode transducer model is applied in a very single fixed scene, and there are limitations in performing object segmentation in a real scene, and the environment in the real world is usually open and dynamic, such as shadow shielding, light exposure and insufficiency, and low resolution, under which a single-mode segmentation algorithm is difficult to achieve high segmentation accuracy.
In order to meet the requirement of dividing targets of power distribution cabinet components in a dynamic environment, a depth image shot by a depth camera is introduced as another mode, characteristics of each mode under different scales are extracted based on CNN, and complementation fusion among different modes is carried out through a transducer module, so that perceptibility, reliability and robustness of a target positioning and dividing algorithm are improved, and the model has the characteristics of small structure volume, low resource consumption and the like, and is easy to deploy on edge equipment.
Disclosure of Invention
The embodiment of the invention aims to provide an image recognition method and system based on cross-modal feature fusion, which aims to meet the requirements of dividing targets of power distribution cabinet components in a dynamic environment, and the model has the characteristics of small structural volume, low resource consumption and the like, and is easy to deploy on edge equipment by introducing a depth image shot by a depth camera as another mode, extracting features of each mode under different scales based on CNN, and then carrying out complementary fusion among different modes through a transducer module.
In order to solve the technical problem, a first aspect of the embodiments of the present invention provides an image recognition method based on cross-modal feature fusion, including the following steps:
Acquiring an RGB image and a depth image of a shooting object;
identifying the RGB image and the depth image based on the cross-modal feature fusion model, identifying a plurality of image units of targets to be identified in the shooting object, and acquiring the type and state information of the targets to be identified according to the image units of the targets to be identified;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
Further, before the identifying the RGB image and the depth image based on the cross-modal feature fusion model, the method further includes:
acquiring historical image data of the shooting object under various shooting conditions, wherein the historical image data comprises: a plurality of historical RGB images of the shooting object and corresponding historical depth images;
and based on the historical image data of a preset proportion, performing recognition training on the target to be recognized on the cross-modal feature fusion model.
Further, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion;
the back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through a convolution module, obtains characteristic diagrams of a plurality of scales after characteristic fusion through a plurality of corresponding characteristic fusion modules, and sends the characteristic diagrams to the Neck part through a channel attention module respectively;
the Neck part extracts and performs the fusion processing on the scale of the characteristics output by the channel attention module, and sends the characteristics after the fusion processing to the Head part;
and the Head part determines the segmentation area of the object to be identified according to the characteristics.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
the first branch and the second branch are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction units comprise: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted from the picture feature extraction unit corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branch after feature fusion.
Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and training matrix sequences are retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;
based on the input sequence of the transducer module, Q is used by a self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Further, the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Further, after the identifying the image units of the plurality of objects to be identified in the shooting object, the method further includes:
dividing the image unit according to the identification result to obtain image data of a plurality of targets to be identified;
the sizes of the image data of the plurality of targets to be identified are adjusted to be preset sizes;
and acquiring the type and state information of the target to be identified based on the image data of the target to be identified with a preset size.
Accordingly, a second aspect of the embodiments of the present invention provides an image recognition system based on cross-modal feature fusion, including:
an image acquisition module for acquiring an RGB image and a depth image of a photographic subject;
the image recognition module is used for recognizing the RGB image and the depth image based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the type and state information of the targets to be recognized according to the image units of the targets to be recognized;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
Further, the image recognition system based on cross-modal feature fusion further comprises: a model training module, the model training module comprising:
a history data acquiring unit configured to acquire history image data of the photographic subject under various photographic conditions, the history image data including: a plurality of historical RGB images of the shooting object and corresponding historical depth images;
the model identification training unit is used for carrying out identification training on the target to be identified on the cross-modal feature fusion model based on the historical image data of the preset proportion.
Further, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion;
the back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through a convolution module, obtains characteristic diagrams of a plurality of scales after characteristic fusion through a plurality of corresponding characteristic fusion modules, and sends the characteristic diagrams to the Neck part through a channel attention module respectively;
the Neck part extracts and performs the fusion processing on the scale of the characteristics output by the channel attention module, and sends the characteristics after the fusion processing to the Head part;
And the Head part determines the segmentation area of the object to be identified according to the characteristics.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
the first branch and the second branch are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction units comprise: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted from the picture feature extraction unit corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branch after feature fusion.
Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and training matrix sequences are retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;
based on the input sequence of the transducer module, Q is used by a self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In the followingUsing Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Further, the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Further, the image recognition module includes:
the image segmentation unit is used for segmenting the image unit according to the identification result to obtain image data of a plurality of targets to be identified;
the image adjusting unit is used for adjusting the sizes of the image data of the plurality of targets to be identified to a preset size;
an information acquisition unit for acquiring the kind and state information of the object to be identified based on the image data of the object to be identified of a preset size.
Accordingly, a third aspect of the embodiment of the present invention provides an electronic device, including: at least one processor; and a memory coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the cross-modality feature fusion-based image recognition method described above.
Accordingly, a fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described cross-modality feature fusion-based image recognition method.
The technical scheme provided by the embodiment of the invention has the following beneficial technical effects:
in order to meet the requirement of dividing targets of power distribution cabinet components in a dynamic environment, a depth image shot by a depth camera is introduced as another mode, characteristics of each mode under different scales are extracted based on CNN, and complementation fusion among different modes is carried out through a transducer module, so that perceptibility, reliability and robustness of a target positioning and dividing algorithm are improved, and the model has the characteristics of small structure volume, low resource consumption and the like, and is easy to deploy on edge equipment.
Drawings
FIG. 1 is a flowchart of an image recognition method based on cross-modal feature fusion provided by an embodiment of the invention;
fig. 2 is a flowchart for identifying components of a power distribution cabinet according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a cross-modal feature fusion model provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a TransSACA module provided by an embodiment of the present invention;
fig. 5a is a view showing photographing and segmentation of angle 1 based on RGB mode photographing in the prior art;
fig. 5b is a view showing photographing and segmentation of angle 2 based on RGB mode photographing in the prior art;
fig. 5c is a view showing photographing and segmentation of angle 3 based on RGB mode photographing according to the prior art;
FIG. 5d is a view of the prior art image taken at an angle 1 based on CBAM mode;
FIG. 5e is a view of the prior art image capture and segmentation of angle 2 based on CBAM mode capture;
FIG. 5f is a view of the prior art image capture and segmentation of angle 3 based on CBAM mode capture;
fig. 5g is a view showing shooting and segmentation of angle 1 based on CFT mode shooting in the prior art;
FIG. 5h is a view of shooting and segmentation of angle 2 based on CFT mode shooting in the prior art;
FIG. 5i is a view of shooting and segmentation of angle 3 based on CFT mode shooting in the prior art;
FIG. 5j is a view of the imaging and segmentation of angle 1 based on cross-modality feature fusion mode imaging in accordance with the present invention;
FIG. 5k is a view of the imaging and segmentation of angle 2 based on cross-modal feature fusion mode imaging in accordance with the present invention;
FIG. 5l is a view of the imaging and segmentation of angle 3 based on cross-modal feature fusion mode imaging in accordance with the present invention;
FIG. 6 is a block diagram of an image recognition system based on cross-modality feature fusion provided by an embodiment of the present invention;
FIG. 7 is a block diagram of an image recognition module provided by an embodiment of the present invention;
FIG. 8 is a block diagram of a model training module provided by an embodiment of the present invention.
Reference numerals:
1. an image acquisition module 2, an image recognition module 21, an image segmentation unit 22, an image adjustment unit, 23, an information acquisition unit, 3, a model training module, 31, a historical data acquisition unit, 32 and a model identification training unit.
Detailed Description
The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.
Referring to fig. 1 and 2, a first aspect of the embodiment of the present invention provides an image recognition method based on cross-modal feature fusion, which includes the following steps:
Step S100, an RGB image and a depth image of a photographing object (i.e., a power distribution cabinet) are acquired.
In an alternative form of embodiment of the present invention, an identifiable power distribution cabinet component includes: touch-sensitive screen, temperature control table, scram switch, red auto-lock switch, yellow auto-lock switch, boat type switch, high voltage connection board, air switch handle, air switch base, change over switch handle, change over switch base, 1 handle of load switch, 1 base of load switch, 2 handles of load switch, 2 bases of load switch, door lock handle, lock core, voltmeter, pilot lamp, rotary switch handle, rotary switch base, green self-reset switch, white self-reset switch.
Step S300, identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified (namely power distribution cabinet components) in the shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified.
The cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step.
Further, before identifying the RGB image and the depth image based on the cross-modal feature fusion model in step S300, the method further includes:
step S210, acquiring historical image data of the shooting subject under various shooting conditions, where the historical image data includes: several historical RGB images of the shooting object and corresponding historical depth images.
Step S220, based on historical image data of a preset proportion, recognition training of the target to be recognized is conducted on the cross-modal feature fusion model.
By acquiring a large number of historical photos of the power distribution cabinet components in advance, obtaining input pictures with uniform size through preprocessing, constructing a data set, and obtaining a plurality of historical photos according to 8:1:1 are randomly scattered and distributed to a training set, a verification set and a test set, and the model is trained. The dual-mode branch feature extracts image features under different scales of RGB and D. And (3) building a multi-scale transducer segmentation model to perform image fusion of different scales, and feeding the output characteristic graph back to the original branch so as to achieve the purpose of enhancing the characteristics of the branch. In the process of model training and parameter adjustment, the loss function is optimized, and the SIoU loss function is used for replacing the original CIoU loss function, so that the convergence speed of training and the segmentation accuracy are improved. Training the model according to a preset scheme to obtain a weight file of the converged model.
Specifically, referring to fig. 3, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion. The back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through the convolution module, obtains the characteristic images of a plurality of scales after the characteristic fusion is carried out through a plurality of corresponding characteristic fusion modules, and sends the characteristic images to the Neck part through the channel attention module respectively; the Neck part extracts the characteristics output by the channel attention module, performs the fusion processing on the scale, and sends the characteristics after the fusion processing to the Head part; the Head part determines a segmented region of the object to be identified according to the features.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image; the first branch road and the second branch road are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction unit includes: a Conv module, a C3 module and/or an SPPF module; the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted by the picture feature extraction units corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branches after being fused.
The Conv module consists of a convolution layer, a normalization layer and an activation function, local space information is extracted through convolution operation, characteristic value distribution is normalized through a BN layer, and nonlinear transformation capacity is introduced through the activation function, so that conversion and extraction of input characteristics are realized. And the C3 module improves the capability of feature extraction by increasing the convolution depth and the receptive field. The SPPF module performs pooling operation of different sizes on the input feature images to obtain a group of feature images of different sizes, then connects the feature images together, reduces the dimension through a full connection layer, and obtains feature vectors of fixed sizes.
The Neck part adopts an FPN feature pyramid, feature graphs with different scales are fused together through up-sampling and down-sampling operations to generate a multi-scale feature pyramid, the fusion of features with different levels is realized through up-sampling and feature graph fusion with coarser granularity from top to bottom, and then the feature graphs with different levels are fused through a convolution layer from bottom to top.
Further, referring to fig. 4, the ransasa module adopts a multi-mode feature fusion mechanism, the first input end is an RGB image convolution feature map, the second input end is a D image convolution feature map, the RGB image convolution feature map and the D image convolution feature map are flattened and retrained into a matrix sequence respectively, and the input sequence of the fransformer module is obtained after adding the position embedding; input sequence based on a transducer module, using Q through self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD The method comprises the steps of carrying out a first treatment on the surface of the Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Specifically, the TransSACA module is a multi-mode feature fusion mechanism, uses the self-attention and cross-attention of a transducer to combine the global backgrounds of RGB mode and D mode, and because of their complementarity, each branch of the module receives as input a sequence of discrete token, each token comprising a featureThe quantity representation, the feature vector is complemented by a position code to incorporate a position induced bias. As shown in the figure, F IN RGB ∈R C×H×W Is a convolution feature diagram of RGB image, F IN D ∈R C×H×W Is a D image convolution feature map, wherein C represents the channel number, H represents the picture height value, and W represents the picture width value, which are obtained by convolution extraction of features from RGB map and D map, respectively:
F IN RGB =Φ RGB (I RGB ),F IN D =Φ D (I D );
Wherein I is RGB And I D Respectively an input RGB diagram and a D diagram, phi RGB And phi is D The convolution module is applied to generating feature mapping of input images of different modes, then flattening each feature image and reordering matrix order, and the input sequence X of the transducer is obtained after adding position embedding IN RGB ∈R HW×C And X IN D ∈R HW×C A set of queries, keys and values (Q, K and V) for RGB and D are calculated using linear projection, for example:
wherein W is RGB Q 、W D Q ∈R C×Dq ,W RGB K 、W D K ∈R C×Dw ,W RGB V 、W D Vv ∈R C×Dv Is a weight matrix, D in the belonging module Q =D W =D V =c, each attention head uses Q (·) And K (·) Is used to calculate the attention weight and then multiplied by V (·) To obtain output Z (·) . For example self-attention use Q RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB Similarly, can obtain Z saD . Cross-attention use Q D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB Similarly, can obtain Z caD
Here the number of the elements is the number,is a scaling factor to prevent excessive results of dot product generation from softmaThe x-function produces a smaller gradient for controlling the magnitude and stability of the attention weights, and the multi-headed attention of self-attention and cross-attention can further improve the performance of the model by focusing differently on the features at different locations. The expression of multi-headed attention is as follows:
Wherein,
where h represents the number of heads, zi represents the attention weight of the ith head, W O ,W Q i ,W K i ,W V i All e R C×C Is a projection matrix.
More specifically, the self-attention module is used to build a correlation inside the sequence, calculate the correlation of each element with other elements in the sequence, where Q, K, V are from the same input modality, analyze the remote dependencies and explore the context information to further improve the pattern-specific characteristics to input the global feature X IN RGB For example, the self-attention global feature Z of the output SA RGB Can be expressed as follows:
the cross-attention module is then used to process associations between different modalities or different inputs to reduce ambiguity, Q is from the different input modalities, and K and V are the same input modalities, so that effective information exchange and fusion between the different modalities is established, information transfer and complementation between the different modalities are promoted, in the module, a query is acquired from another input feature (e.g., Q D ) And keys in self-input features (e.g. K RGB ) A) to calculate the correlation, expressed as follows:
here, Z saRGB And Z saD Is the output of the self-attention module, Q RGB 、K RGB And V RGB Is a related intermediate representation of RGB image features, Q D 、K D And V D Is a relevant intermediate representation of the D image features.
Finally, the processing is carried out by using MLP, which comprises two layers of fully-connected feedforward network, and a GELU activation function is used in the middle to calculate the output X OUT RGB And X is OUT D The dimension is the same as the input feature map, and therefore is added directly as supplemental information to the original modality branch.
Where X is OUT RGB And X is OUT D The output dimension is the same as the input sequence, and then the output is reshaped into a c×h×w feature map F OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Because the computational overhead of processing a high resolution feature map is very expensive, to reduce the amount of computation by the Transformer in processing the high resolution feature map, the high resolution feature map is downsampled with averaging pooling, sampled to a fixed resolution of h=w=8, and then passed as input to the module, and the output is upsampled to the original resolution using bilinear interpolation before element summing with the existing feature map.
Further, the loss function of the Head portion is a bounding box regression loss function. Wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Specifically, the Head part introduces a new boundary box regression loss function SIoU, replaces the original CIoU loss function, improves the convergence speed of model training and the reasoning accuracy, and compared with the CIoU, the SIoU considers the problem of angle, and is composed of angle loss, distance loss, shape loss and overlapping loss. The overall loss function includes a sum of the regression loss for the bounding box, the confidence loss, the classification loss, and the masked regression loss,
in the distance loss function Δ, (b) cx gt ,b cy gt ) Is the center coordinates of the truth box, (b) cx ,b cy ) Is the predicted frame center coordinates. C (C) w And C h Width and height of minimum bounding rectangle of truth box and prediction box, respectively, p x And p y Representing twoEuclidean distance of the center coordinates of the individual frames. In the angle loss function C w And C h The width and height values of the center points of the truth box and the prediction box are respectively, and sigma represents the distance between the center points of the truth box and the prediction box. In the shape loss function Ω, (w, h) and (w gt ,h gt ) θ controls the degree of attention to shape loss for the width and height of the prediction and truth boxes, respectively. K represents an output feature map, S 2 And N respectively represent the number of image grids in the prediction process and the number of prediction frames in each grid, and the coefficient M obj kij Whether the Kth output feature map of the jth prediction frame of the ith grid is a positive sample or not, BCE sig obj Representing a binary cross entropy loss function, w obj And w cls Representing the weight of the positive sample. X is x p And x gt Respectively representing a prediction vector and a true value vector; wherein alpha is box 、ɑ obj 、ɑ cls 、ɑ seg Weights for position error, confidence error, classification error, and segmentation error are represented, respectively. Segmentation loss function L seg Binary cross entropy is used, where P is the hxw xk matrix of prototype masks and C is the n x k matrix of mask coefficients for n instances defined by NMS and threshold. Sigma represents a sigmoid function, which predicts the mask M p Mask M with true value gt And after combination, sending the binary cross entropy to calculate.
Further, after identifying the image units of the plurality of objects to be identified in the photographed object, the method further includes:
step S310, dividing the image unit according to the identification result to obtain a plurality of image data of the object to be identified.
Step S320, the sizes of the image data of the plurality of objects to be identified are adjusted to a preset size.
Step S330, obtaining the type and state information of the object to be identified based on the image data of the object to be identified with the preset size.
In one embodiment of the invention, mAP0.5 and mAP0.5:0.05:0.95 are used to evaluate the segmentation performance of the model. mAP requires Precision and Recall for calculation.
Wherein True Position (TP) represents that the IOU between the predicted mask and the True value is greater than a prescribed threshold, false Position (FP) represents that the IOU between the predicted mask and the True value does not satisfy the prescribed threshold, and False Positive (FN) represents that there is no intersection between the predicted mask and the True value. The mAP is calculated as follows:
where AP represents the integral of each class Precision and Recall curve, map0.5 represents the average of all APs for all classes when the threshold for IOU is set to 0.5. mAP 0.5:0.95 calculates the average value over IOU thresholds of 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, which is obviously much more stringent than mAP0.5, their values range from 0 to 1.
TABLE 1
As can be seen from Table 1, the index of the power distribution cabinet component is higher than that of other algorithms mAP0.5 and mAP0.5:0.05:0.95.
As can be seen from fig. 5a to fig. 5l, by testing the visible light images and the corresponding depth images of the shooting angles of the 3 different power distribution cabinet components, it can be seen that all the network segmentation results are good in the visible light scene. However, in the low light scenario, the RGB mode is under-divided by a knob (ee_rot_swch_hdl) of the rotary switch, a load switch handle (ee_load_swchl_hdl), and a door handle (ee_gt_lk_hdl). The CBAM mode can divide the handle, but the knob of the rotary switch and the handle of the load switch are not divided, and the CFT mode may also be divided. The image recognition method based on cross-modal feature fusion in the invention avoids the problems.
According to the method, the problems that a power distribution cabinet robot faces recognition difficulty and operation difficulty in low illumination or night scenes are solved, the newly designed Transformer modules are densely inserted into a double-flow network frame, the self-attention mechanism is matched with the staggered attention mechanism to capture the relation between different positions in a sequence, the relation is used for relying and integrating global context information, and the spatial relation of components can be accurately understood; by considering global information around the components, the robot can accurately locate and segment the targets; finally, the SIoU loss function is adopted to replace the original CIoU loss function, the SIoU considers the angle problem on the basis of the CIoU, the angle loss, the distance loss, the shape loss and the overlapping loss, the training convergence speed and the segmentation precision of the model are improved, and the model resource consumption is low and is easy to deploy on edge equipment.
Accordingly, referring to fig. 6, a second aspect of the embodiment of the present invention provides an image recognition system based on cross-modal feature fusion, including:
an image acquisition module 1 for acquiring an RGB image and a depth image of a photographic subject;
The image recognition module 2 is used for recognizing RGB images and depth images based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the types and state information of the targets to be recognized according to the image units of the targets to be recognized;
the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step.
Further, referring to fig. 7, the image recognition module 2 includes:
an image dividing unit 21 for dividing the image unit according to the recognition result to obtain image data of a plurality of objects to be recognized;
an image adjustment unit 22 for adjusting the sizes of the image data of the plurality of objects to be identified to a preset size;
an information acquisition unit 23 for acquiring the kind and state information of the object to be recognized based on the image data of the object to be recognized of a preset size.
Further, referring to fig. 8, the image recognition system based on cross-modal feature fusion further includes: model training module 3, model training module 3 includes:
A history data acquiring unit 31 for acquiring history image data of a subject under various photographing conditions, the history image data including: shooting a plurality of historical RGB images and corresponding historical depth images of an object;
the model recognition training unit 32 is configured to perform recognition training on the cross-modal feature fusion model based on the historical image data of the preset scale.
Further, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion;
the method comprises the steps that a back bone part receives an RGB image and a depth image respectively, the convolution module extracts the characteristics of the RGB image and the depth image in multiple scales, and the characteristic fusion module performs characteristic fusion to obtain characteristic images in multiple scales, and the characteristic images are sent to a Neck part through a channel attention module respectively;
the Neck part extracts the characteristics output by the channel attention module, performs the fusion processing on the scale, and sends the characteristics after the fusion processing to the Head part;
the Head part determines a segmented region of the object to be identified according to the features.
Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
The first branch road and the second branch road are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction unit includes: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted by the picture feature extraction units corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branches after being fused.
Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature map, a second input end is a D image convolution feature map, the RGB image convolution feature map and the D image convolution feature map are flattened respectively, a matrix sequence is retrained, and an input sequence of the Transformer module is obtained after position embedding is added;
input sequence based on a transducer module, using Q through self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
Further, the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
Accordingly, a third aspect of the embodiment of the present invention provides an electronic device, including: at least one processor; and a memory coupled to the at least one processor; the memory stores instructions executable by a processor, the instructions being executable by the processor, to cause the at least one processor to perform the cross-modal feature fusion-based image recognition method.
Accordingly, a fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described cross-modality feature fusion-based image recognition method.
The embodiment of the invention aims to protect an image recognition method and system based on cross-modal feature fusion, wherein the method comprises the following steps: acquiring an RGB image and a depth image of a shooting object; identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified in a shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified; the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step. The technical scheme has the following effects:
In order to meet the requirement of dividing targets of power distribution cabinet components in a dynamic environment, a depth image shot by a depth camera is introduced as another mode, characteristics of each mode under different scales are extracted based on CNN, and complementation fusion among different modes is carried out through a transducer module, so that perceptibility, reliability and robustness of a target positioning and dividing algorithm are improved, and the model has the characteristics of small structure volume, low resource consumption and the like, and is easy to deploy on edge equipment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (10)

1. An image recognition method based on cross-modal feature fusion is characterized by comprising the following steps:
acquiring an RGB image and a depth image of a shooting object;
identifying the RGB image and the depth image based on the cross-modal feature fusion model, identifying a plurality of image units of targets to be identified in the shooting object, and acquiring the type and state information of the targets to be identified according to the image units of the targets to be identified;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
2. The method for identifying images based on cross-modal feature fusion according to claim 1, further comprising, before identifying the RGB image and the depth image based on the cross-modal feature fusion model:
acquiring historical image data of the shooting object under various shooting conditions, wherein the historical image data comprises: a plurality of historical RGB images of the shooting object and corresponding historical depth images;
And based on the historical image data of a preset proportion, performing recognition training on the target to be recognized on the cross-modal feature fusion model.
3. The method for identifying images based on cross-modal feature fusion as claimed in claim 2, wherein,
the cross-modal feature fusion model comprises: a Backbone portion, a neg portion, and a Head portion;
the back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through a convolution module, obtains characteristic diagrams of a plurality of scales after characteristic fusion through a plurality of corresponding characteristic fusion modules, and sends the characteristic diagrams to the Neck part through a channel attention module respectively;
the Neck part extracts and performs the fusion processing on the scale of the characteristics output by the channel attention module, and sends the characteristics after the fusion processing to the Head part;
and the Head part determines the segmentation area of the object to be identified according to the characteristics.
4. The method for identifying images based on cross-modal feature fusion as claimed in claim 3,
the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;
The first branch and the second branch are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction units comprise: a Conv module, a C3 module and/or an SPPF module;
the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted from the picture feature extraction unit corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branch after feature fusion.
5. The method for identifying images based on cross-modal feature fusion as claimed in claim 4, wherein,
the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and the matrix sequence is retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;
based on the input sequence of the transducer module, Q is used by a self-attention mechanism RGB And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z saRGB And Z saD In using Q through a cross-attention mechanism D And K RGB Is used to calculate the attention weight and then multiplied by V RGB To obtain output Z caRGB And Z caD
Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle OUT RGB And X is OUT D ,X OUT RGB And X is OUT D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W OUT RGB And F OUT D And fed back into each individual modality branch using element summation with existing feature mappings.
6. The method for identifying images based on cross-modal feature fusion as claimed in claim 3,
the loss function of the Head part is a bounding box regression loss function;
wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.
7. The cross-modal feature fusion-based image recognition method as claimed in any one of claims 1 to 6, wherein after the recognition of the image units of the plurality of objects to be recognized in the photographic subject, further comprising:
dividing the image unit according to the identification result to obtain image data of a plurality of targets to be identified;
The sizes of the image data of the plurality of targets to be identified are adjusted to be preset sizes;
and acquiring the type and state information of the target to be identified based on the image data of the target to be identified with a preset size.
8. An image recognition system based on cross-modal feature fusion, comprising:
an image acquisition module for acquiring an RGB image and a depth image of a photographic subject;
the image recognition module is used for recognizing the RGB image and the depth image based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the type and state information of the targets to be recognized according to the image units of the targets to be recognized;
the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.
9. An electronic device, comprising: at least one processor; and a memory coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the cross-modality feature fusion-based image recognition method of any of claims 1-7.
10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the cross-modal feature fusion based image recognition method of any of claims 1-7.
CN202311063209.8A 2023-08-22 2023-08-22 Cross-modal feature fusion-based image recognition method and system Active CN117036891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311063209.8A CN117036891B (en) 2023-08-22 2023-08-22 Cross-modal feature fusion-based image recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311063209.8A CN117036891B (en) 2023-08-22 2023-08-22 Cross-modal feature fusion-based image recognition method and system

Publications (2)

Publication Number Publication Date
CN117036891A true CN117036891A (en) 2023-11-10
CN117036891B CN117036891B (en) 2024-03-29

Family

ID=88624313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311063209.8A Active CN117036891B (en) 2023-08-22 2023-08-22 Cross-modal feature fusion-based image recognition method and system

Country Status (1)

Country Link
CN (1) CN117036891B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor
CN113989340A (en) * 2021-10-29 2022-01-28 天津大学 Point cloud registration method based on distribution
CN114419323A (en) * 2022-03-31 2022-04-29 华东交通大学 Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method
CN114494215A (en) * 2022-01-29 2022-05-13 脉得智能科技(无锡)有限公司 Transformer-based thyroid nodule detection method
CN114693952A (en) * 2022-03-24 2022-07-01 安徽理工大学 RGB-D significance target detection method based on multi-modal difference fusion network
CN114973411A (en) * 2022-05-31 2022-08-30 华中师范大学 Self-adaptive evaluation method, system, equipment and storage medium for attitude motion
CN115713679A (en) * 2022-10-13 2023-02-24 北京大学 Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN115908789A (en) * 2022-12-09 2023-04-04 大连民族大学 Cross-modal feature fusion and asymptotic decoding saliency target detection method and device
WO2023056889A1 (en) * 2021-10-09 2023-04-13 百果园技术(新加坡)有限公司 Model training and scene recognition method and apparatus, device, and medium
CN116052108A (en) * 2023-02-21 2023-05-02 浙江工商大学 Transformer-based traffic scene small sample target detection method and device
CN116206133A (en) * 2023-04-25 2023-06-02 山东科技大学 RGB-D significance target detection method
CN116310396A (en) * 2023-02-28 2023-06-23 安徽理工大学 RGB-D significance target detection method based on depth quality weighting
CN116385761A (en) * 2023-01-31 2023-07-04 同济大学 3D target detection method integrating RGB and infrared information
CN116452805A (en) * 2023-04-15 2023-07-18 安徽理工大学 Transformer-based RGB-D semantic segmentation method of cross-modal fusion network

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor
WO2023056889A1 (en) * 2021-10-09 2023-04-13 百果园技术(新加坡)有限公司 Model training and scene recognition method and apparatus, device, and medium
CN113989340A (en) * 2021-10-29 2022-01-28 天津大学 Point cloud registration method based on distribution
CN114494215A (en) * 2022-01-29 2022-05-13 脉得智能科技(无锡)有限公司 Transformer-based thyroid nodule detection method
CN114693952A (en) * 2022-03-24 2022-07-01 安徽理工大学 RGB-D significance target detection method based on multi-modal difference fusion network
CN114419323A (en) * 2022-03-31 2022-04-29 华东交通大学 Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method
CN114973411A (en) * 2022-05-31 2022-08-30 华中师范大学 Self-adaptive evaluation method, system, equipment and storage medium for attitude motion
CN115713679A (en) * 2022-10-13 2023-02-24 北京大学 Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN115908789A (en) * 2022-12-09 2023-04-04 大连民族大学 Cross-modal feature fusion and asymptotic decoding saliency target detection method and device
CN116385761A (en) * 2023-01-31 2023-07-04 同济大学 3D target detection method integrating RGB and infrared information
CN116052108A (en) * 2023-02-21 2023-05-02 浙江工商大学 Transformer-based traffic scene small sample target detection method and device
CN116310396A (en) * 2023-02-28 2023-06-23 安徽理工大学 RGB-D significance target detection method based on depth quality weighting
CN116452805A (en) * 2023-04-15 2023-07-18 安徽理工大学 Transformer-based RGB-D semantic segmentation method of cross-modal fusion network
CN116206133A (en) * 2023-04-25 2023-06-02 山东科技大学 RGB-D significance target detection method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FANG QINGYUN 等,: "Cross-Modality Fusion Transformer for Multispectral Object Detection", 《ARXIV:2111.00273V4》, pages 1 - 10 *
QINGJUN RU 等: "Cross-Modal Transformer for RGB-D semantic segmentation of production workshop objects", 《PATTERN RECOGNITION》, pages 1 - 12 *
ZONGWEI WU 等: "Transformer Fusion for Indoor RGB-D Semantic Segmentation", 《COMPUTER VISION AND IMAGE UNDERSTANDING》, pages 1 - 15 *
张飞: "基于多模态数据的人物定位和跟踪方法研究", 《CNKI学位》, vol. 2023, no. 02 *

Also Published As

Publication number Publication date
CN117036891B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
US20200272806A1 (en) Real-Time Tracking of Facial Features in Unconstrained Video
CN114666564B (en) Method for synthesizing virtual viewpoint image based on implicit neural scene representation
CN110599395A (en) Target image generation method, device, server and storage medium
CN112036260B (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN112001859A (en) Method and system for repairing face image
JP2021163503A (en) Three-dimensional pose estimation by two-dimensional camera
CN113095106A (en) Human body posture estimation method and device
CN114783024A (en) Face recognition system of gauze mask is worn in public place based on YOLOv5
CN111062263A (en) Method, device, computer device and storage medium for hand pose estimation
CN112200157A (en) Human body 3D posture recognition method and system for reducing image background interference
CN112580434B (en) Face false detection optimization method and system based on depth camera and face detection equipment
CN111160291A (en) Human eye detection method based on depth information and CNN
Zheng et al. Feater: An efficient network for human reconstruction via feature map-based transformer
CN116958420A (en) High-precision modeling method for three-dimensional face of digital human teacher
CN114494594B (en) Deep learning-based astronaut operation equipment state identification method
CN106845555A (en) Image matching method and image matching apparatus based on Bayer format
CN113065506B (en) Human body posture recognition method and system
JP2021176078A (en) Deep layer learning and feature detection through vector field estimation
JP2021163502A (en) Three-dimensional pose estimation by multiple two-dimensional cameras
CN116681687B (en) Wire detection method and device based on computer vision and computer equipment
CN117036891B (en) Cross-modal feature fusion-based image recognition method and system
KR20210018114A (en) Cross-domain metric learning system and method
CN114863487A (en) One-stage multi-person human body detection and posture estimation method based on quadratic regression
CN107403145A (en) Image characteristic points positioning method and device
Agusta et al. Field seeding algorithm for people counting using kinect depth image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant