CN117036891A

CN117036891A - Cross-modal feature fusion-based image recognition method and system

Info

Publication number: CN117036891A
Application number: CN202311063209.8A
Authority: CN
Inventors: 吴波; 战秋成; 郑随兵
Original assignee: Realman Intelligent Technology Beijing Co ltd
Current assignee: Realman Intelligent Technology Beijing Co ltd
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-10
Anticipated expiration: 2043-08-22
Also published as: CN117036891B

Abstract

The invention discloses an image recognition method and system based on cross-modal feature fusion, wherein the method comprises the following steps: acquiring an RGB image and a depth image of a shooting object; identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified in a shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified; the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step. By introducing a depth image shot by a depth camera as another mode and matching with the improved mode for identification, the target segmentation requirement of the power distribution cabinet components in a dynamic environment is met.

Description

Cross-modal feature fusion-based image recognition method and system

Technical Field

The invention relates to the technical field of image processing, in particular to an image recognition method and system based on cross-modal feature fusion.

Background

The power distribution cabinet is a vital device in the power system, plays roles of distributing, controlling and protecting electric energy, ensures safe operation of the power system, provides stable and reliable power supply, and protects circuits and devices from power faults. The robot technology plays an important role in the operation and the operation of the power distribution cabinet, is used for realizing tasks such as fault detection, component operation and the like through accurate movement and control, and greatly improves the working efficiency and safety. The computer vision technology provides the robot with enhanced sensing and recognition capability in the operation of the power distribution cabinet, and by using the computer vision technology, the robot can accurately recognize equipment, connectors and the like in the power distribution cabinet and acquire related data and image information, thereby providing important guidance and support for the operation of the robot. However, the working environment of the power distribution cabinet is usually closed and dynamic, and a robot is used to replace a worker to perform daily operations, so that a model is required to cope with challenges brought by the dynamic environment, such as shadow shielding, insufficient light, low resolution and the like. Under these conditions, it is difficult to achieve high precision for an algorithm that only uses visible light, so that a depth camera is used to provide a visible light image and a depth image, and perceptibility, reliability and robustness of the target positioning and segmentation algorithm can be improved by fusing complementarity of different modes.

With the development of convolutional neural networks, a dual-flow network based on CNN (convolutional neural network) has emerged for target detection and segmentation. In the past work, no matter how a modal fusion mechanism is designed, the modal fusion mechanism is carried out on a convolutional neural network, for example, based on a cross-modal learning and domain self-adaptive RGBD image semantic segmentation method (CN 114419323A), CNN has strong characterization and learning capacity in single-mode internal reasoning, can effectively capture local features in input data through convolution operation, has better processing capacity on data with spatial structures such as images, and has the advantages of local perceptibility and multi-level feature representation. Compared with CNN, a transducer model is based on global attention, has correlation between key and query, can model long-range dependence and capture global information, combines a convolutional neural network and the transducer, can simultaneously consider local and global information, extracts strong characteristic representation, solves the problem of long-range dependence, such as an abdomen CT image multi-organ segmentation method, device and terminal equipment (CN 116030259A), and combines CNN and the transducer and is applied to the field of target detection.

However, in the above solution, the cross-modal learning and domain adaptive RGBD image semantic segmentation method (CN 114419323A) uses a model of a convolutional neural network and has good segmentation performance, but the locality of convolution operation limits that the model is difficult to learn long-distance dependency in images outside the receptive field, so that the capability of the model of the convolutional neural network to process details such as texture, shape and size change in the images is limited to a certain extent; thus, convolutional neural networks may face challenges in processing image tasks with long-range dependencies, and may not be able to capture features and context information of the image global.

The method, the device and the terminal equipment (CN 116030259A) for dividing the abdominal CT image multiple organs are characterized in that a model based on a visual transducer can model the image global information by adopting a self-attention mechanism, and the whole model improves the dividing precision of a target through the multi-scale global semantic feature extraction capability. However, the single-mode transducer model is applied in a very single fixed scene, and there are limitations in performing object segmentation in a real scene, and the environment in the real world is usually open and dynamic, such as shadow shielding, light exposure and insufficiency, and low resolution, under which a single-mode segmentation algorithm is difficult to achieve high segmentation accuracy.

In order to meet the requirement of dividing targets of power distribution cabinet components in a dynamic environment, a depth image shot by a depth camera is introduced as another mode, characteristics of each mode under different scales are extracted based on CNN, and complementation fusion among different modes is carried out through a transducer module, so that perceptibility, reliability and robustness of a target positioning and dividing algorithm are improved, and the model has the characteristics of small structure volume, low resource consumption and the like, and is easy to deploy on edge equipment.

Disclosure of Invention

The embodiment of the invention aims to provide an image recognition method and system based on cross-modal feature fusion, which aims to meet the requirements of dividing targets of power distribution cabinet components in a dynamic environment, and the model has the characteristics of small structural volume, low resource consumption and the like, and is easy to deploy on edge equipment by introducing a depth image shot by a depth camera as another mode, extracting features of each mode under different scales based on CNN, and then carrying out complementary fusion among different modes through a transducer module.

In order to solve the technical problem, a first aspect of the embodiments of the present invention provides an image recognition method based on cross-modal feature fusion, including the following steps:

Acquiring an RGB image and a depth image of a shooting object;

identifying the RGB image and the depth image based on the cross-modal feature fusion model, identifying a plurality of image units of targets to be identified in the shooting object, and acquiring the type and state information of the targets to be identified according to the image units of the targets to be identified;

the cross-modal feature fusion model performs feature extraction on the RGB image and the depth image, obtains features of multiple levels of the RGB image and the depth image, fuses complementary semantic information between the RGB image and the depth image features by using a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism, and fuses the features of multiple scales step by step.

Further, before the identifying the RGB image and the depth image based on the cross-modal feature fusion model, the method further includes:

acquiring historical image data of the shooting object under various shooting conditions, wherein the historical image data comprises: a plurality of historical RGB images of the shooting object and corresponding historical depth images;

and based on the historical image data of a preset proportion, performing recognition training on the target to be recognized on the cross-modal feature fusion model.

Further, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion;

the back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through a convolution module, obtains characteristic diagrams of a plurality of scales after characteristic fusion through a plurality of corresponding characteristic fusion modules, and sends the characteristic diagrams to the Neck part through a channel attention module respectively;

the Neck part extracts and performs the fusion processing on the scale of the characteristics output by the channel attention module, and sends the characteristics after the fusion processing to the Head part;

and the Head part determines the segmentation area of the object to be identified according to the characteristics.

Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;

the first branch and the second branch are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction units comprise: a Conv module, a C3 module and/or an SPPF module;

the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted from the picture feature extraction unit corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branch after feature fusion.

Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and training matrix sequences are retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;

based on the input sequence of the transducer module, Q is used by a self-attention mechanism _RGB And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _saRGB And Z _saD In using Q through a cross-attention mechanism _D And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _caRGB And Z _caD ；

Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle ^OUT _RGB And X is ^OUT _D ，X ^OUT _RGB And X is ^OUT _D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W ^OUT _RGB And F ^OUT _D And fed back into each individual modality branch using element summation with existing feature mappings.

Further, the loss function of the Head part is a bounding box regression loss function;

wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.

Further, after the identifying the image units of the plurality of objects to be identified in the shooting object, the method further includes:

dividing the image unit according to the identification result to obtain image data of a plurality of targets to be identified;

the sizes of the image data of the plurality of targets to be identified are adjusted to be preset sizes;

and acquiring the type and state information of the target to be identified based on the image data of the target to be identified with a preset size.

Accordingly, a second aspect of the embodiments of the present invention provides an image recognition system based on cross-modal feature fusion, including:

an image acquisition module for acquiring an RGB image and a depth image of a photographic subject;

the image recognition module is used for recognizing the RGB image and the depth image based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the type and state information of the targets to be recognized according to the image units of the targets to be recognized;

Further, the image recognition system based on cross-modal feature fusion further comprises: a model training module, the model training module comprising:

a history data acquiring unit configured to acquire history image data of the photographic subject under various photographic conditions, the history image data including: a plurality of historical RGB images of the shooting object and corresponding historical depth images;

the model identification training unit is used for carrying out identification training on the target to be identified on the cross-modal feature fusion model based on the historical image data of the preset proportion.

based on the input sequence of the transducer module, Q is used by a self-attention mechanism _RGB And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _saRGB And Z _saD In the followingUsing Q through a cross-attention mechanism _D And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _caRGB And Z _caD ；

Further, the image recognition module includes:

the image segmentation unit is used for segmenting the image unit according to the identification result to obtain image data of a plurality of targets to be identified;

the image adjusting unit is used for adjusting the sizes of the image data of the plurality of targets to be identified to a preset size;

an information acquisition unit for acquiring the kind and state information of the object to be identified based on the image data of the object to be identified of a preset size.

Accordingly, a third aspect of the embodiment of the present invention provides an electronic device, including: at least one processor; and a memory coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the cross-modality feature fusion-based image recognition method described above.

Accordingly, a fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described cross-modality feature fusion-based image recognition method.

The technical scheme provided by the embodiment of the invention has the following beneficial technical effects:

Drawings

FIG. 1 is a flowchart of an image recognition method based on cross-modal feature fusion provided by an embodiment of the invention;

fig. 2 is a flowchart for identifying components of a power distribution cabinet according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a cross-modal feature fusion model provided by an embodiment of the present invention;

FIG. 4 is a flow chart of a TransSACA module provided by an embodiment of the present invention;

fig. 5a is a view showing photographing and segmentation of angle 1 based on RGB mode photographing in the prior art;

fig. 5b is a view showing photographing and segmentation of angle 2 based on RGB mode photographing in the prior art;

fig. 5c is a view showing photographing and segmentation of angle 3 based on RGB mode photographing according to the prior art;

FIG. 5d is a view of the prior art image taken at an angle 1 based on CBAM mode;

FIG. 5e is a view of the prior art image capture and segmentation of angle 2 based on CBAM mode capture;

FIG. 5f is a view of the prior art image capture and segmentation of angle 3 based on CBAM mode capture;

fig. 5g is a view showing shooting and segmentation of angle 1 based on CFT mode shooting in the prior art;

FIG. 5h is a view of shooting and segmentation of angle 2 based on CFT mode shooting in the prior art;

FIG. 5i is a view of shooting and segmentation of angle 3 based on CFT mode shooting in the prior art;

FIG. 5j is a view of the imaging and segmentation of angle 1 based on cross-modality feature fusion mode imaging in accordance with the present invention;

FIG. 5k is a view of the imaging and segmentation of angle 2 based on cross-modal feature fusion mode imaging in accordance with the present invention;

FIG. 5l is a view of the imaging and segmentation of angle 3 based on cross-modal feature fusion mode imaging in accordance with the present invention;

FIG. 6 is a block diagram of an image recognition system based on cross-modality feature fusion provided by an embodiment of the present invention;

FIG. 7 is a block diagram of an image recognition module provided by an embodiment of the present invention;

FIG. 8 is a block diagram of a model training module provided by an embodiment of the present invention.

Reference numerals:

1. an image acquisition module 2, an image recognition module 21, an image segmentation unit 22, an image adjustment unit, 23, an information acquisition unit, 3, a model training module, 31, a historical data acquisition unit, 32 and a model identification training unit.

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

Referring to fig. 1 and 2, a first aspect of the embodiment of the present invention provides an image recognition method based on cross-modal feature fusion, which includes the following steps:

Step S100, an RGB image and a depth image of a photographing object (i.e., a power distribution cabinet) are acquired.

In an alternative form of embodiment of the present invention, an identifiable power distribution cabinet component includes: touch-sensitive screen, temperature control table, scram switch, red auto-lock switch, yellow auto-lock switch, boat type switch, high voltage connection board, air switch handle, air switch base, change over switch handle, change over switch base, 1 handle of load switch, 1 base of load switch, 2 handles of load switch, 2 bases of load switch, door lock handle, lock core, voltmeter, pilot lamp, rotary switch handle, rotary switch base, green self-reset switch, white self-reset switch.

Step S300, identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified (namely power distribution cabinet components) in the shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified.

The cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step.

Further, before identifying the RGB image and the depth image based on the cross-modal feature fusion model in step S300, the method further includes:

step S210, acquiring historical image data of the shooting subject under various shooting conditions, where the historical image data includes: several historical RGB images of the shooting object and corresponding historical depth images.

Step S220, based on historical image data of a preset proportion, recognition training of the target to be recognized is conducted on the cross-modal feature fusion model.

By acquiring a large number of historical photos of the power distribution cabinet components in advance, obtaining input pictures with uniform size through preprocessing, constructing a data set, and obtaining a plurality of historical photos according to 8:1:1 are randomly scattered and distributed to a training set, a verification set and a test set, and the model is trained. The dual-mode branch feature extracts image features under different scales of RGB and D. And (3) building a multi-scale transducer segmentation model to perform image fusion of different scales, and feeding the output characteristic graph back to the original branch so as to achieve the purpose of enhancing the characteristics of the branch. In the process of model training and parameter adjustment, the loss function is optimized, and the SIoU loss function is used for replacing the original CIoU loss function, so that the convergence speed of training and the segmentation accuracy are improved. Training the model according to a preset scheme to obtain a weight file of the converged model.

Specifically, referring to fig. 3, the cross-modal feature fusion model includes: a Backbone portion, a neg portion, and a Head portion. The back bone part receives the RGB image and the depth image respectively, extracts the characteristics of a plurality of scales of the RGB image and the depth image through the convolution module, obtains the characteristic images of a plurality of scales after the characteristic fusion is carried out through a plurality of corresponding characteristic fusion modules, and sends the characteristic images to the Neck part through the channel attention module respectively; the Neck part extracts the characteristics output by the channel attention module, performs the fusion processing on the scale, and sends the characteristics after the fusion processing to the Head part; the Head part determines a segmented region of the object to be identified according to the features.

Further, the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image; the first branch road and the second branch road are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction unit includes: a Conv module, a C3 module and/or an SPPF module; the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted by the picture feature extraction units corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branches after being fused.

The Conv module consists of a convolution layer, a normalization layer and an activation function, local space information is extracted through convolution operation, characteristic value distribution is normalized through a BN layer, and nonlinear transformation capacity is introduced through the activation function, so that conversion and extraction of input characteristics are realized. And the C3 module improves the capability of feature extraction by increasing the convolution depth and the receptive field. The SPPF module performs pooling operation of different sizes on the input feature images to obtain a group of feature images of different sizes, then connects the feature images together, reduces the dimension through a full connection layer, and obtains feature vectors of fixed sizes.

The Neck part adopts an FPN feature pyramid, feature graphs with different scales are fused together through up-sampling and down-sampling operations to generate a multi-scale feature pyramid, the fusion of features with different levels is realized through up-sampling and feature graph fusion with coarser granularity from top to bottom, and then the feature graphs with different levels are fused through a convolution layer from bottom to top.

Further, referring to fig. 4, the ransasa module adopts a multi-mode feature fusion mechanism, the first input end is an RGB image convolution feature map, the second input end is a D image convolution feature map, the RGB image convolution feature map and the D image convolution feature map are flattened and retrained into a matrix sequence respectively, and the input sequence of the fransformer module is obtained after adding the position embedding; input sequence based on a transducer module, using Q through self-attention mechanism _RGB And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _saRGB And Z _saD In using Q through a cross-attention mechanism _D And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _caRGB And Z _caD The method comprises the steps of carrying out a first treatment on the surface of the Processing based on multi-layer perceptron model, including two layers of fully connected feedforward network, calculating output X by using a GELU activation function in the middle ^OUT _RGB And X is ^OUT _D ，X ^OUT _RGB And X is ^OUT _D The output dimension is the same as the input sequence, and the output is remolded into a characteristic mapping F of C multiplied by H multiplied by W ^OUT _RGB And F ^OUT _D And fed back into each individual modality branch using element summation with existing feature mappings.

Specifically, the TransSACA module is a multi-mode feature fusion mechanism, uses the self-attention and cross-attention of a transducer to combine the global backgrounds of RGB mode and D mode, and because of their complementarity, each branch of the module receives as input a sequence of discrete token, each token comprising a featureThe quantity representation, the feature vector is complemented by a position code to incorporate a position induced bias. As shown in the figure, F ^IN _RGB ∈R ^C×H×W Is a convolution feature diagram of RGB image, F ^IN _D ∈R ^C×H×W Is a D image convolution feature map, wherein C represents the channel number, H represents the picture height value, and W represents the picture width value, which are obtained by convolution extraction of features from RGB map and D map, respectively:

F ^IN _RGB ＝Φ _RGB (I _RGB )，F ^IN _D ＝Φ _D (I _D )；

Wherein I is _RGB And I _D Respectively an input RGB diagram and a D diagram, phi _RGB And phi is _D The convolution module is applied to generating feature mapping of input images of different modes, then flattening each feature image and reordering matrix order, and the input sequence X of the transducer is obtained after adding position embedding ^IN _RGB ∈R ^HW×C And X ^IN _D ∈R ^HW×C A set of queries, keys and values (Q, K and V) for RGB and D are calculated using linear projection, for example:

wherein W is _RGB ^Q 、W _D ^Q ∈R ^C×Dq ，W _RGB ^K 、W _D ^K ∈R ^C×Dw ，W _RGB ^V 、W _D ^Vv ∈R ^C×Dv Is a weight matrix, D in the belonging module _Q ＝D _W ＝D _V =c, each attention head uses Q _(·) And K _(·) Is used to calculate the attention weight and then multiplied by V _(·) To obtain output Z _(·) . For example self-attention use Q _RGB And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _saRGB Similarly, can obtain Z _saD . Cross-attention use Q _D And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _caRGB Similarly, can obtain Z _caD 。

Here the number of the elements is the number,is a scaling factor to prevent excessive results of dot product generation from softmaThe x-function produces a smaller gradient for controlling the magnitude and stability of the attention weights, and the multi-headed attention of self-attention and cross-attention can further improve the performance of the model by focusing differently on the features at different locations. The expression of multi-headed attention is as follows:

Wherein,

where h represents the number of heads, zi represents the attention weight of the ith head, W ^O ，W ^Q _i ，W ^K _i ，W ^V _i All e R ^C×C Is a projection matrix.

More specifically, the self-attention module is used to build a correlation inside the sequence, calculate the correlation of each element with other elements in the sequence, where Q, K, V are from the same input modality, analyze the remote dependencies and explore the context information to further improve the pattern-specific characteristics to input the global feature X ^IN _RGB For example, the self-attention global feature Z of the output ^SA _RGB Can be expressed as follows:

the cross-attention module is then used to process associations between different modalities or different inputs to reduce ambiguity, Q is from the different input modalities, and K and V are the same input modalities, so that effective information exchange and fusion between the different modalities is established, information transfer and complementation between the different modalities are promoted, in the module, a query is acquired from another input feature (e.g., Q _D ) And keys in self-input features (e.g. K _RGB ) A) to calculate the correlation, expressed as follows:

here, Z _saRGB And Z _saD Is the output of the self-attention module, Q _RGB 、K _RGB And V _RGB Is a related intermediate representation of RGB image features, Q _D 、K _D And V _D Is a relevant intermediate representation of the D image features.

Finally, the processing is carried out by using MLP, which comprises two layers of fully-connected feedforward network, and a GELU activation function is used in the middle to calculate the output X ^OUT _RGB And X is ^OUT _D The dimension is the same as the input feature map, and therefore is added directly as supplemental information to the original modality branch.

Where X is ^OUT _RGB And X is ^OUT _D The output dimension is the same as the input sequence, and then the output is reshaped into a c×h×w feature map F ^OUT _RGB And F ^OUT _D And fed back into each individual modality branch using element summation with existing feature mappings.

Because the computational overhead of processing a high resolution feature map is very expensive, to reduce the amount of computation by the Transformer in processing the high resolution feature map, the high resolution feature map is downsampled with averaging pooling, sampled to a fixed resolution of h=w=8, and then passed as input to the module, and the output is upsampled to the original resolution using bilinear interpolation before element summing with the existing feature map.

Further, the loss function of the Head portion is a bounding box regression loss function. Wherein the bounding box regression loss function comprises: the sum of the bounding box regression loss, confidence loss, classification loss, and mask regression loss.

Specifically, the Head part introduces a new boundary box regression loss function SIoU, replaces the original CIoU loss function, improves the convergence speed of model training and the reasoning accuracy, and compared with the CIoU, the SIoU considers the problem of angle, and is composed of angle loss, distance loss, shape loss and overlapping loss. The overall loss function includes a sum of the regression loss for the bounding box, the confidence loss, the classification loss, and the masked regression loss,

in the distance loss function Δ, (b) _cx ^gt ，b _cy ^gt ) Is the center coordinates of the truth box, (b) _cx ，b _cy ) Is the predicted frame center coordinates. C (C) _w And C _h Width and height of minimum bounding rectangle of truth box and prediction box, respectively, p _x And p _y Representing twoEuclidean distance of the center coordinates of the individual frames. In the angle loss function C _w And C _h The width and height values of the center points of the truth box and the prediction box are respectively, and sigma represents the distance between the center points of the truth box and the prediction box. In the shape loss function Ω, (w, h) and (w ^gt ，h ^gt ) θ controls the degree of attention to shape loss for the width and height of the prediction and truth boxes, respectively. K represents an output feature map, S ² And N respectively represent the number of image grids in the prediction process and the number of prediction frames in each grid, and the coefficient M ^obj _kij Whether the Kth output feature map of the jth prediction frame of the ith grid is a positive sample or not, BCE ^sig _obj Representing a binary cross entropy loss function, w _obj And w _cls Representing the weight of the positive sample. X is x _p And x _gt Respectively representing a prediction vector and a true value vector; wherein alpha is _box 、ɑ _obj 、ɑ _cls 、ɑ _seg Weights for position error, confidence error, classification error, and segmentation error are represented, respectively. Segmentation loss function L _seg Binary cross entropy is used, where P is the hxw xk matrix of prototype masks and C is the n x k matrix of mask coefficients for n instances defined by NMS and threshold. Sigma represents a sigmoid function, which predicts the mask M _p Mask M with true value _gt And after combination, sending the binary cross entropy to calculate.

Further, after identifying the image units of the plurality of objects to be identified in the photographed object, the method further includes:

step S310, dividing the image unit according to the identification result to obtain a plurality of image data of the object to be identified.

Step S320, the sizes of the image data of the plurality of objects to be identified are adjusted to a preset size.

Step S330, obtaining the type and state information of the object to be identified based on the image data of the object to be identified with the preset size.

In one embodiment of the invention, mAP0.5 and mAP0.5:0.05:0.95 are used to evaluate the segmentation performance of the model. mAP requires Precision and Recall for calculation.

Wherein True Position (TP) represents that the IOU between the predicted mask and the True value is greater than a prescribed threshold, false Position (FP) represents that the IOU between the predicted mask and the True value does not satisfy the prescribed threshold, and False Positive (FN) represents that there is no intersection between the predicted mask and the True value. The mAP is calculated as follows:

where AP represents the integral of each class Precision and Recall curve, map0.5 represents the average of all APs for all classes when the threshold for IOU is set to 0.5. mAP 0.5:0.95 calculates the average value over IOU thresholds of 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, which is obviously much more stringent than mAP0.5, their values range from 0 to 1.

TABLE 1

As can be seen from Table 1, the index of the power distribution cabinet component is higher than that of other algorithms mAP0.5 and mAP0.5:0.05:0.95.

As can be seen from fig. 5a to fig. 5l, by testing the visible light images and the corresponding depth images of the shooting angles of the 3 different power distribution cabinet components, it can be seen that all the network segmentation results are good in the visible light scene. However, in the low light scenario, the RGB mode is under-divided by a knob (ee_rot_swch_hdl) of the rotary switch, a load switch handle (ee_load_swchl_hdl), and a door handle (ee_gt_lk_hdl). The CBAM mode can divide the handle, but the knob of the rotary switch and the handle of the load switch are not divided, and the CFT mode may also be divided. The image recognition method based on cross-modal feature fusion in the invention avoids the problems.

According to the method, the problems that a power distribution cabinet robot faces recognition difficulty and operation difficulty in low illumination or night scenes are solved, the newly designed Transformer modules are densely inserted into a double-flow network frame, the self-attention mechanism is matched with the staggered attention mechanism to capture the relation between different positions in a sequence, the relation is used for relying and integrating global context information, and the spatial relation of components can be accurately understood; by considering global information around the components, the robot can accurately locate and segment the targets; finally, the SIoU loss function is adopted to replace the original CIoU loss function, the SIoU considers the angle problem on the basis of the CIoU, the angle loss, the distance loss, the shape loss and the overlapping loss, the training convergence speed and the segmentation precision of the model are improved, and the model resource consumption is low and is easy to deploy on edge equipment.

Accordingly, referring to fig. 6, a second aspect of the embodiment of the present invention provides an image recognition system based on cross-modal feature fusion, including:

an image acquisition module 1 for acquiring an RGB image and a depth image of a photographic subject;

The image recognition module 2 is used for recognizing RGB images and depth images based on the cross-modal feature fusion model, recognizing image units of a plurality of targets to be recognized in the shooting object, and acquiring the types and state information of the targets to be recognized according to the image units of the targets to be recognized;

Further, referring to fig. 7, the image recognition module 2 includes:

an image dividing unit 21 for dividing the image unit according to the recognition result to obtain image data of a plurality of objects to be recognized;

an image adjustment unit 22 for adjusting the sizes of the image data of the plurality of objects to be identified to a preset size;

an information acquisition unit 23 for acquiring the kind and state information of the object to be recognized based on the image data of the object to be recognized of a preset size.

Further, referring to fig. 8, the image recognition system based on cross-modal feature fusion further includes: model training module 3, model training module 3 includes:

A history data acquiring unit 31 for acquiring history image data of a subject under various photographing conditions, the history image data including: shooting a plurality of historical RGB images and corresponding historical depth images of an object;

the model recognition training unit 32 is configured to perform recognition training on the cross-modal feature fusion model based on the historical image data of the preset scale.

the method comprises the steps that a back bone part receives an RGB image and a depth image respectively, the convolution module extracts the characteristics of the RGB image and the depth image in multiple scales, and the characteristic fusion module performs characteristic fusion to obtain characteristic images in multiple scales, and the characteristic images are sent to a Neck part through a channel attention module respectively;

the Neck part extracts the characteristics output by the channel attention module, performs the fusion processing on the scale, and sends the characteristics after the fusion processing to the Head part;

the Head part determines a segmented region of the object to be identified according to the features.

The first branch road and the second branch road are respectively provided with a plurality of corresponding picture feature extraction units, and the picture feature extraction unit includes: a Conv module, a C3 module and/or an SPPF module;

the first branch and the second branch are provided with a TransSACA module corresponding to the picture feature extraction unit, the TransSACA module respectively receives the features extracted by the picture feature extraction units corresponding to the first branch and the second branch, and the features are respectively sent to the corresponding branches after being fused.

Further, the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature map, a second input end is a D image convolution feature map, the RGB image convolution feature map and the D image convolution feature map are flattened respectively, a matrix sequence is retrained, and an input sequence of the Transformer module is obtained after position embedding is added;

input sequence based on a transducer module, using Q through self-attention mechanism _RGB And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _saRGB And Z _saD In using Q through a cross-attention mechanism _D And K _RGB Is used to calculate the attention weight and then multiplied by V _RGB To obtain output Z _caRGB And Z _caD ；

Accordingly, a third aspect of the embodiment of the present invention provides an electronic device, including: at least one processor; and a memory coupled to the at least one processor; the memory stores instructions executable by a processor, the instructions being executable by the processor, to cause the at least one processor to perform the cross-modal feature fusion-based image recognition method.

The embodiment of the invention aims to protect an image recognition method and system based on cross-modal feature fusion, wherein the method comprises the following steps: acquiring an RGB image and a depth image of a shooting object; identifying RGB images and depth images based on the cross-modal feature fusion model, identifying image units of a plurality of targets to be identified in a shooting object, and acquiring the types and state information of the targets to be identified according to the image units of the targets to be identified; the cross-modal feature fusion model is used for carrying out feature extraction on the RGB image and the depth image, obtaining features of multiple levels of the RGB image and the depth image, and utilizing a self-attention mechanism, a staggered attention mechanism and a multi-head attention mechanism to fuse complementary semantic information between the features of the RGB image and the depth image, so as to fuse the features of multiple scales step by step. The technical scheme has the following effects:

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. An image recognition method based on cross-modal feature fusion is characterized by comprising the following steps:

acquiring an RGB image and a depth image of a shooting object;

2. The method for identifying images based on cross-modal feature fusion according to claim 1, further comprising, before identifying the RGB image and the depth image based on the cross-modal feature fusion model:

3. The method for identifying images based on cross-modal feature fusion as claimed in claim 2, wherein,

the cross-modal feature fusion model comprises: a Backbone portion, a neg portion, and a Head portion;

4. The method for identifying images based on cross-modal feature fusion as claimed in claim 3,

the backlight portion includes a first branch that receives the RGB image and a second branch that receives the depth image;

5. The method for identifying images based on cross-modal feature fusion as claimed in claim 4, wherein,

the TransSACA module adopts a multi-mode feature fusion mechanism, a first input end is an RGB image convolution feature image, a second input end is a D image convolution feature image, the RGB image convolution feature image and the D image convolution feature image are flattened and the matrix sequence is retrained respectively, and an input sequence of the Transformer module is obtained after position embedding is added;

6. The method for identifying images based on cross-modal feature fusion as claimed in claim 3,

the loss function of the Head part is a bounding box regression loss function;

7. The cross-modal feature fusion-based image recognition method as claimed in any one of claims 1 to 6, wherein after the recognition of the image units of the plurality of objects to be recognized in the photographic subject, further comprising:

8. An image recognition system based on cross-modal feature fusion, comprising:

9. An electronic device, comprising: at least one processor; and a memory coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the cross-modality feature fusion-based image recognition method of any of claims 1-7.

10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the cross-modal feature fusion based image recognition method of any of claims 1-7.