CN115965786A

CN115965786A - Occlusion target identification method based on local semantic perception attention neural network

Info

Publication number: CN115965786A
Application number: CN202310018475.2A
Authority: CN
Inventors: 毛建旭; 易俊飞; 王耀南; 张辉; 曾凯; 陶梓铭; 钟杭; 刘彩苹; 朱青; 刘敏
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-01-06
Filing date: 2023-01-06
Publication date: 2023-04-14

Abstract

The invention discloses a method for identifying a sheltered target based on a local semantic perception attention neural network, which comprises the steps of acquiring data for network training, verification and testing, preprocessing and marking, and establishing a training, verification and testing data set according to a proportion; building a local semantic perception attention enhancement neural network; training the local semantic perception attention enhancement neural network according to the constructed training set, performing back propagation on the local semantic perception attention enhancement neural network according to a preset network loss function, and verifying the network through a verification set; inputting the test data set into a trained local semantic perception attention enhancing neural network to obtain the confidence coefficient and the position of the shielding target, and determining a final output result as an identification result by combining a maximum suppression algorithm. The semantic recognition capability of the model to the shielded target is improved by sensing and enhancing the semantic of the shielded target, and finally the recognition effect of the robot to the shielded target is improved.

Description

Occlusion target identification method based on local semantic perception attention neural network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method for identifying an occluded target based on a local semantic perception attention neural network.

Background

With the continuous development of economy in China, intelligent robots become a very important part in daily life, such as electric power inspection robots, meal delivery robots and the like. These robots usually need to carry on a visual sensor to sense and recognize the external environment, and then autonomously complete preset tasks. In the process that the robot senses the environment, the recognition of the target is the basis of the autonomous operation of the robot. In the target identification process of the intelligent robot, the target is shielded, so that the robot is difficult to identify the specific position and the type of the target, and the identification effect of the shielded target is greatly reduced. The reasons that the recognition effect of the robot on the shielding target is poor include: 1) The corresponding features are not obvious due to the fact that the target is shielded, and therefore the feature extraction effect of the target is influenced; 2) The shielded part usually belongs to other backgrounds, which causes deviation in cognition of the robot on the target, and mistakenly considers that the background features also belong to the target features, thereby causing false detection. In summary, aiming at the problem that the recognition capability of the robot is reduced when the target is occluded, it is necessary to develop a research on an occlusion target recognition method based on a local semantic perception attention neural network.

Disclosure of Invention

Aiming at the technical problems, the invention provides a method for identifying an occluded target based on a local semantic perception attention neural network.

The technical scheme adopted by the invention for solving the technical problem is as follows:

a method for identifying an occluded target based on a local semantic perception attention neural network comprises the following steps:

s100: acquiring image data of a region to be tested, preprocessing the image data, marking the position and the type of a shielded target in the image data, and constructing a data set for training, verifying and testing according to a preset division ratio;

s200: building a local semantic perception attention enhancement neural network, wherein the network comprises a residual error network based on local semantic perception attention enhancement, a feature pyramid network and a detection head network, and the residual error network based on local semantic perception attention enhancement is used for extracting depth feature maps with different resolutions; the feature pyramid network is used for fusing the depth feature maps with different resolutions to obtain the fused depth feature maps with different resolutions; the detection head network is used for predicting the position and the category of the shielded target according to the fused depth feature maps with different resolutions;

s300: training the local semantic perception attention enhancing neural network according to the constructed training set, reversely propagating the local semantic perception attention enhancing neural network according to the verification data set and a preset network loss function, updating the network weight, and obtaining the trained local semantic perception attention enhancing neural network after finishing a preset training round;

s400: and inputting the data set into a trained local semantic perception attention enhancing neural network to obtain the confidence coefficient and the position of the occluded target, and determining a final output result as a target recognition result by combining a maximum suppression algorithm according to the confidence coefficient and the position of the occluded target.

Preferably, the residual network based on local semantic perception attention enhancement in S200 includes a first residual block, a local semantic perception attention enhancement network, a second residual block, a third residual block and a fourth residual block connected in sequence,

the first residual block is used for carrying out semantic feature extraction on the input training set picture to obtain a first feature picture and inputting the first feature picture to the local semantic perception attention enhancement network; the local semantic perception attention enhancement network is used for enhancing the attention of the network to local feature blocks with the same semantic information in the first feature map, acquiring the feature map with local semantic enhancement and sending the feature map to the second residual block; the second residual block is used for performing semantic feature extraction on the feature map with local semantic enhancement to obtain a second feature map, the third residual block is used for performing semantic feature extraction on the second feature map to obtain a third feature map, the fourth residual block is used for performing semantic feature extraction on the third feature map to obtain a fourth feature map, and the second feature map, the third feature map and the fourth feature map are input into the feature pyramid network.

Preferably, the local semantic perception attention enhancing network in S200 is used to enhance the attention of the network to the feature block with the same local semantic information in the first feature map, and obtain the feature map with local semantic enhancement, including:

s221: performing sliding window operation on the first feature map F to obtain a multi-dimensional local semantic block F1 with the same size, and performing transposition operation on the multi-dimensional local semantic block F1 to obtain a multi-dimensional local semantic block F2;

s222: matrix multiplication is carried out on the multidimensional local semantic block F1 and the multidimensional local semantic block F2 to obtain a gram matrix G, and the specific calculation formula is as follows:

G＝F1⊙F2

wherein an |, indicates a matrix multiplication, each element G in the gram matrix G _ij Representing the similarity between the local semantic block i and the local semantic block j;

s223: carrying out normalization operation on the gram matrix to obtain a similarity matrix G _n The calculation formula is as follows:

wherein | G | purple ₂ A two-norm representing G;

s224: similarity matrix G _n And averaging according to rows to obtain the weight w of the local semantic block, carrying out element multiplication on the multidimensional semantic block F1 and the weight coefficient w to obtain a local semantic enhanced feature, reducing the local semantic enhanced feature into a local semantic feature F 'with the same size as the input feature image F through the operation of a reverse sliding window, and sending the semantic feature F' into a dynamic correction linear function to adjust to obtain an occlusion target feature image with local semantic enhancement.

Preferably, in S224, the local semantic feature F' is sent to the dynamic modified linear function to be adjusted to obtain the occlusion target feature map with local semantic enhancement, including:

s2241: obtaining a global feature g by performing global average pooling operation on the local semantic feature F';

s2242: and (3) carrying out convolution-Relu and convolution-Sigmoid operations on the global feature g to obtain a coefficient C of the dynamic correction linear function, and limiting the numerical range of the coefficient C between [ -0.5,0.5], wherein the specific calculation formula is as follows:

C＝Sigmoid(conv2(R(conv1(g))))-0.5

wherein conv1 and conv2 respectively represent 1x1 convolution operation, R represents Relu, and the Sigmod function limits the output result of the convolution to [0,1];

s2243: dividing the coefficient C according to channel dimensions to respectively obtain linear coefficients a1, b1, a2 and b2, wherein the final DyRelu result calculation formula is as follows:

o＝Max(F′*(a1*2+1)+b1,F′*(a2*2)+b2)

max represents a maximum function, elements are multiplied, and O is an occlusion target feature map with local semantic enhancement output after DyRelu.

Preferably, the feature pyramid network in S200 is used to fuse depth feature maps with different resolutions, and obtain fused feature maps with different resolutions, including:

s225: performing dimensionality reduction on the second feature map, the third feature map and the fourth feature map through 1x1 convolution to obtain a dimensionality-reduced feature map;

s226: performing feature fusion on the second feature map, the third feature map and the fourth feature map with different resolutions after dimensionality reduction from top to bottom, performing bilinear interpolation on the features of the upper layer to obtain a feature map with the same size as the features of the lower layer, performing element-by-element addition on the two feature maps to realize semantic information fusion to obtain the second feature map, the third feature map and the fourth feature map with different resolutions after fusion, and finally performing 3 × 3 convolution on the fused fourth feature map to obtain a fifth feature map, and performing 3 × 3 convolution on the fifth feature map to obtain a sixth feature map, so as to obtain feature maps with different resolutions after fusion, namely the second feature map after fusion, the third feature map after fusion, the fourth feature map after fusion, the fifth feature map and the sixth feature map after fusion.

Preferably, the detection head network comprises a classification branch, a regression branch and an intersection ratio prediction branch, wherein the classification branch comprises four 3x3 convolution layers and a 1x1 convolution layer, and the fused features with different resolutions output class prediction results through the five convolution layers to obtain classification scores; the regression branch comprises four 3x3 convolution layers and a 1x1 convolution layer, and the fused features with different resolutions output the prediction result of the position of the occluded target through the five convolution layers; the cross-over ratio prediction branch comprises a 1x1 convolution layer, and the output result is the cross-over ratio of the result of the predicted regression branch and the real result after the last 3x3 convolution layer of the regression branch.

Preferably, the network loss function preset in S300 includes classification loss, regression loss and semantic consistency loss;

selecting FL loss and GIoU loss respectively from the classification loss and the regression loss, wherein the calculation formula is as follows:

FL(p,y)＝-y(1-p) ^γ log(p)-(1-y)p ^γ log(1-p)

wherein y represents a label of the classification, p represents a classification prediction value, γ is a hyper-parameter, ioU represents an intersection ratio between the label and the prediction frame, C represents a minimum closed shape, A represents a minimum closed shape _c Denotes the area of C, and U denotes the areas of A and B.

Selecting Pull Loss as semantic consistency Loss, wherein the calculation formula is as follows:

L _pull ＝-ln(1-N _t +IoU(b _max ,b _m ))s _m

wherein L is _pull Indicating Pull Loss, N _t Denotes a predetermined threshold value, ioU denotes calculation of the cross-over ratio, b _max A detection box representing correspondence of classification scores, b _m Representing the true label box, s, corresponding to the maximum score _m Representing the maximum score of the network prediction.

Preferably, in S400, determining a final output result as a target recognition result according to the confidence and the position of the occluded target in combination with a maximum suppression algorithm, includes:

s410: sorting the prediction results according to the confidence score of the occluded target, and selecting the prediction result with the highest confidence score to be stored as an algorithm output result list;

s420: calculating the intersection ratio of the rest of prediction results and the prediction result with the highest confidence score, inhibiting the prediction result of which the intersection ratio is greater than or equal to a preset intersection ratio threshold, and keeping the prediction result of which the intersection ratio is less than the preset intersection ratio threshold;

s430: repeating S410-S420 on the reserved prediction result until the prediction result with the intersection ratio larger than the preset intersection ratio threshold value does not occur any more finally, and obtaining the final reserved prediction result;

s440: and selecting the prediction result with the highest confidence score from the finally reserved prediction results as the target recognition result.

Preferably, S400 is followed by:

s500: calculating the recognition precision of the local semantic perception attention enhancement neural network, and carrying out quantitative analysis on the recognition performance of the neural network;

s600: and visualizing the output result of the neural network by using a visualization tool, visually analyzing the detection effect of the neural network by visual comparison of the visualized result, and qualitatively analyzing the recognition performance of the neural network.

Preferably, S500 includes:

calculating the precision and recall ratio of the local semantic perception attention enhancement neural network, specifically comprising the following steps:

wherein Precision is Precision, recall is Recall, TP represents the number of accurate frames of the predicted target, FP represents the number of frames of the predicted error target, and FN represents the number of frames of the error prediction as background.

According to the method for identifying the shielding target based on the local semantic perception attention neural network, the local semantic perception attention is designed, so that the feature blocks with the same semantic meaning can be enhanced in the feature space, then the feature graphs with different semantic resolutions are fused through the feature pyramid, and finally the convergence of the neural network is restrained through the preset loss function. The method can effectively improve the identification effect of the fault of the shielded target, can promote the autonomous inspection process of the power industry, and can bring economic and social benefits.

Drawings

Fig. 1 is a flowchart of a method for identifying an occluded target based on a local semantic perceptual attention neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a frame of an occlusion target recognition method based on a local semantic perception attention neural network according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a residual error network based on local semantic perception attention enhancement according to an embodiment of the present invention;

FIG. 4 is a display interface of a visualization tool in accordance with an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention is further described in detail below with reference to the accompanying drawings.

In one embodiment, as shown in fig. 1 and 2, the method for identifying the occluded target based on the local semantic perception attention neural network comprises the following steps:

s100: acquiring image data of a region to be tested, preprocessing the image data, marking the position and the type of a shielded target in the image data, and constructing a data set for training, verifying and testing according to a preset division ratio.

Specifically, an inspection worker operates an unmanned aerial vehicle to carry out aerial photography inspection on a line of a region to be tested to obtain an aerial image of the line, and then a special data labeling tool Labelme is used for labeling the position and the type of a shielded target in the image to construct a data set for training, verification and testing (ratio: 8. In this embodiment, the occluded target is a bird nest.

The data preprocessing is mainly to process the pictures into data convenient for network training, and includes operations of changing size, randomly turning, randomly cutting, normalizing and the like, so as to improve the training efficiency of the network.

S200: building a local semantic perception attention enhancement neural network, wherein the network comprises a residual error network based on local semantic perception attention enhancement, a feature pyramid network and a detection head network, and the residual error network based on local semantic perception attention enhancement is used for extracting depth feature maps with different resolutions; the characteristic pyramid network is used for fusing the depth characteristic maps with different resolutions to obtain the fused depth characteristic maps with different resolutions; and the detection head network is used for predicting the position and the category of the occluded target according to the fused depth feature maps with different resolutions.

In one embodiment, as shown in fig. 3, the residual error network based on local semantic perception attention enhancement in S200 includes a first residual block, a local semantic perception attention enhancement network, a second residual block, a third residual block and a fourth residual block connected in sequence,

the first residual block is used for carrying out semantic feature extraction on the input training set picture to obtain a first feature picture and inputting the first feature picture to the local semantic perception attention enhancement network; the local semantic perception attention enhancement network is used for enhancing the attention of the network to local feature blocks with the same semantic information in the first feature map, acquiring the feature map with local semantic enhancement and sending the feature map to the second residual block; the second residual block is used for carrying out semantic feature extraction on the feature map with local semantic enhancement to obtain a second feature map, the third residual block is used for carrying out semantic feature extraction on the second feature map to obtain a third feature map, the fourth residual block is used for carrying out semantic feature extraction on the third feature map to obtain a fourth feature map, and the second feature map, the third feature map and the fourth feature map are input into the feature pyramid network.

Specifically, a local semantic aware attention enhancement network is placed between a first residual layer and a second residual layer, the semantic aware attention based residual network comprising the local semantic aware attention enhancement network and the residual network. The residual error network comprises four residual error blocks and is used for extracting semantic features of the shielded target and effectively preventing the gradient disappearance or explosion phenomenon of the network, the local semantic perception attention network is used for enhancing the attention of the network to the feature blocks with the same semantic information and obtaining a feature map with the semantic enhancement of the shielded target so as to improve the detection effect of the model when the shielded target is shielded by a line.

As shown in fig. 3, in one embodiment, the local semantic perception attention enhancing network in S200 is used to enhance the attention of the network to the feature block with the same local semantic information in the first feature map, and the obtaining of the feature map with local semantic enhancement includes:

G＝F1⊙F2

wherein | G | purple ₂ A two-norm representing G;

In one embodiment, the step S224 of sending the local semantic feature F' into the dynamically modified linear function to adjust to obtain the occlusion target feature map with local semantic enhancement includes:

s2242: and successively performing convolution-Relu and convolution-Sigmoid operations on the global feature g to obtain a coefficient C of the dynamic correction linear function, and limiting the numerical range of the coefficient C between [ -0.5,0.5], wherein the specific calculation formula is as follows:

C＝Sigmoid(conv2(R(conv1(g))))-0.5

o＝Max(F′*(a1*2+1)+b1,F′*(a2*2)+b2)

max represents a maximum function, wherein x represents element multiplication, and O is an occlusion target feature map with local semantic enhancement output after DyRelu.

Specifically, the feature map with the enhanced semantic features of the occluded target can dynamically obtain the output features with more obvious semantic features, the network semantic perception capability is improved, and the features are further sent into a subsequent residual error layer to complete extraction of the semantic features of the occluded target.

And finally, taking the outputs of the second, third and fourth residual layers as the input of the feature pyramid network based on the residual network with the enhanced local semantic perception attention.

In one embodiment, the feature pyramid network in S200 is used to fuse depth feature maps with different resolutions, and obtain fused feature maps with different resolutions, including:

s225: performing dimension reduction on the second feature map, the third feature map and the fourth feature map through 1x1 convolution to obtain a feature map subjected to dimension reduction;

Specifically, in the present embodiment, the feature pyramid network is used to fuse the input three feature maps with different resolutions. The features are first reduced to 256 dimensions by 1x1 convolution, reducing the computational load of the model. And then, fusing the features through element-by-element addition operation from top to bottom, keeping the feature sizes of different resolutions consistent through linear interpolation operation, and then performing element addition to enable the bottom-layer features to have high-layer semantic information, and further enhancing the semantic perception capability of the network through the feature pyramid network. And sequentially passing the feature map with the lowest resolution ratio through two 3x3 convolutional layers to further obtain two feature maps, and finally taking the five feature maps as the input of the final detection map network.

In one embodiment, the detection head network comprises a classification branch, a regression branch and an intersection ratio prediction branch, wherein the classification branch comprises four 3x3 convolution layers and a 1x1 convolution layer, and the fused features with different resolutions output class prediction results through five convolution layers to obtain classification scores; the regression branch comprises four 3x3 convolution layers and a 1x1 convolution layer, and the fused features with different resolutions output the prediction result of the position of the occluded target through the five convolution layers; the intersection ratio prediction branch comprises a 1x1 convolution layer, and the output result is the intersection ratio of the result of the predicted regression branch and the real result after the last 3x3 convolution layer of the regression branch.

Specifically, the classification branch is used for predicting whether the bird nest belongs to the bird nest class, the regression branch is used for predicting coordinates of the bird nest in an original image, in addition, the intersection ratio prediction branch is used for predicting the intersection ratio of the result of the positioning branch and the real result, and in the model test stage, the judgment result is obtained by multiplying the prediction intersection ratio and the classification score, so that the recognition capability of the model on the bird nest can be improved.

S300: training the semantic perception attention neural network according to the bird nest data training set, performing back propagation on the semantic perception attention neural network according to the bird nest verification data set and a preset network loss function, updating the network weight, and obtaining the trained semantic perception attention neural network after finishing a preset training round.

Specifically, after the network completes a preset training turn, a model configuration file, a log file and training weights in the data training process are stored uniformly so as to perform result analysis and network model optimization.

Further, the server for training is GTX3090 video memory 24G, CUDA version 11.1, pytorh version 1.8.0. In addition, the picture size is uniformly cut into 1333 × 800 as input, the random flip rate is set to 0.5, the batch size is 4, the gpu thread number is set to 4, the learning round is 12 rounds, the learning rate is 0.01, the learning rate decay strategy is 8 th, 11 th rounds are respectively reduced to 0.1 times of the current learning rate, and the optimizer is SGD.

In one embodiment, the network loss function preset in S300 includes classification loss, regression loss and semantic consistency loss;

FL(p,y)＝-y(1-p) ^γ log(p)-(1-y)p ^γ log(1-p)

L _pull ＝-ln(1-N _t +IoU(b _max ,b _m ))s _m

wherein L is _pull Indicating Pull Loss, N _t Denotes a predetermined threshold value, ioU denotes calculation of the cross-over ratio, b _max Representing detection boxes corresponding to classification scores, b _m Representing the true label box, s, corresponding to the maximum score _m Representing the maximum score of the network prediction.

Specifically, the semantic consistency loss is obtained by reducing the loss distance between the detection frame corresponding to the highest score in the prediction result and the actual real labeling frame, and the detection frames of the same target should have semantic consistency. The semantics of the shielded target are consistent, the error detection frames of the same target can be constrained by utilizing the semantic consistency, the phenomenon that the error prediction frames are reserved by a maximum suppression algorithm in the training process is suppressed, and the detection effect of the shielded target is improved.

Specifically, the trained weight file is loaded into the neural network model, the output result of the detection head network is used as a model identification result, the confidence coefficient and the position of the shielded target are output, and the optimal output result is screened out through a maximum suppression algorithm to be used as the model identification result.

In one embodiment, determining a final output result as the target recognition result in S400 according to the confidence and the position of the occluded target in combination with a maximum suppression algorithm includes:

As shown in fig. 3, in an embodiment, after S400, the method further includes:

s500: and calculating the recognition precision of the local semantic perception attention enhancement neural network, and carrying out quantitative analysis on the recognition performance of the neural network.

In one embodiment, S500 includes:

S600: and visualizing the output result of the neural network by using a visualization tool, visually analyzing the detection effect of the neural network by comparing the visualization result, and qualitatively analyzing the recognition performance of the neural network.

Specifically, a visualization tool is used to perform qualitative analysis on the detection result, and the tool can map the model prediction result and the real labeling result on the original image, so as to further perform qualitative analysis on the detection effect, as shown in fig. 4. The tool has the functions of 1) controlling and displaying the visualization of different detection results through self confidence degree threshold and cross ratio threshold, so that the model performance can be analyzed, and the recognition results in the line bird nest can be displayed more intuitively. 2) The difference between the current detection result and the real labeling result can be displayed, the difference comprises the number of the real labeling frames, the number of the detection frames and the detection accuracy, if the detection is not accurate, the color of the detection frames is red, if the detection is accurate, the color of the detection frames is green, and the real labeling result is a yellow frame. The qualitative analysis of the model can be realized through the visualization tool, the targeted detection method is facilitated, and the detection result can be more visually expressed.

According to the method for identifying the shielded target based on the local semantic perception attention neural network, the local semantic perception attention enhancement network is designed, so that the feature blocks with the same semantics can be enhanced in the feature space, the feature graphs with different semantic resolutions are fused through the feature pyramid, finally, the convergence of the model is restrained through classification, regression and semantic consistency loss, and the detection result of the model is analyzed quantitatively and qualitatively, so that the method has important significance for subsequent optimization algorithm programs. The method can effectively improve the identification effect of the shielded target, can promote the autonomous inspection process of the power industry, and can bring economic and social benefits.

The method for identifying the occluded target based on the local semantic perception attention neural network provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the core concepts of the present invention. It should be noted that, for those skilled in the art, without departing from the principle of the present invention, it is possible to make various improvements and modifications to the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. The method for identifying the occluded target based on the local semantic perception attention neural network is characterized by comprising the following steps of:

s100: acquiring image data of a region to be tested, preprocessing the image data, marking the position and the category of a shielded target in the image data, and constructing a data set for training, verifying and testing according to a preset division ratio;

s200: building a local semantic perception attention enhancement neural network, wherein the network comprises a residual error network based on local semantic perception attention enhancement, a feature pyramid network and a detection head network, and the residual error network based on local semantic perception attention enhancement is used for extracting depth feature maps with different resolutions; the characteristic pyramid network is used for fusing the depth characteristic maps with different resolutions to obtain the fused depth characteristic maps with different resolutions; the detection head network is used for predicting the position and the category of the shielded target according to the fused depth feature maps with different resolutions;

s300: training the local semantic perception attention enhancing neural network according to the constructed training set, reversely propagating the local semantic perception attention enhancing neural network according to a verification data set and a preset network loss function, updating the network weight, and obtaining the trained local semantic perception attention enhancing neural network after finishing a preset training round;

s400: inputting the data set into a trained local semantic perception attention enhancing neural network to obtain the confidence coefficient and the position of the occluded target, and determining a final output result as a target recognition result by combining a maximum suppression algorithm according to the confidence coefficient and the position of the occluded target.

2. The method according to claim 1, wherein the residual network based on local semantic aware attention enhancement in S200 comprises a first residual block, a local semantic aware attention enhancement network, a second residual block, a third residual block and a fourth residual block connected in sequence,

the first residual block is used for carrying out semantic feature extraction on an input training set picture to obtain a first feature picture and inputting the first feature picture to the local semantic perception attention enhancement network; the local semantic perception attention enhancement network is used for enhancing the attention of the network to local feature blocks with the same semantic information in the first feature map, acquiring the feature map with local semantic enhancement and sending the feature map to the second residual block; the second residual block is used for performing semantic feature extraction on the feature map with local semantic enhancement to obtain a second feature map, the third residual block is used for performing semantic feature extraction on the second feature map to obtain a third feature map, the fourth residual block is used for performing semantic feature extraction on the third feature map to obtain a fourth feature map, and the second feature map, the third feature map and the fourth feature map are input into the feature pyramid network.

3. The method according to claim 2, wherein the local semantic awareness attention enhancing network in S200 is configured to enhance network attention to feature blocks with the same local semantic information in the first feature map, and obtain a feature map with local semantic enhancement, and includes:

s222: performing matrix multiplication on the multidimensional local semantic block F1 and the multidimensional local semantic block F2 to obtain a gram matrix G, wherein the specific calculation formula is as follows:

G＝F1⊙F2

wherein | G | calucity ₂ A two-norm representing G;

s224: the similarity matrix G _n Averaging according to rows to obtain the weight w of a local semantic block, carrying out element multiplication on the multidimensional semantic block F1 and the weight coefficient w to obtain a local semantic enhanced feature, reducing the local semantic enhanced feature into a local semantic feature F 'with the same size as the input feature F through an anti-sliding window operation, and sending the semantic feature F' into a dynamic modified linear function to be adjusted to obtain a shielding target feature graph with local semantic enhancement.

4. The method according to claim 3, wherein the step S224 of feeding the local semantic features F' into a dynamically modified linear function for adjustment to obtain an occlusion target feature map with local semantic enhancement comprises:

s2242: and successively carrying out convolution-Relu and convolution-Sigmoid operations on the global feature g to obtain a coefficient C of the dynamic correction linear function, and limiting the numerical range of the coefficient C between [ -0.5,0.5], wherein the calculation formula is as follows specifically:

C＝Sigmoid(conv2(R(conv1(g))))-0.5

O＝Max(F′*(a1*2+1)+b1,F′*(a2*2)+b2)

5. The method according to claim 4, wherein the feature pyramid network in S200 is used for fusing depth feature maps of different resolutions to obtain fused feature maps of different resolutions, and includes:

s226: performing feature fusion on the second feature map, the third feature map and the fourth feature map with different reduced dimensions from top to bottom, performing bilinear interpolation on the features of the previous layer to obtain a feature map with the same size as the features of the next layer, performing element-by-element addition on the two feature maps to realize semantic information fusion to obtain the second feature map, the third feature map and the fourth feature map with different resolutions after fusion, and finally performing 3 × 3 convolution on the fused fourth feature map to obtain a fifth feature map, and performing 3 × 3 convolution on the fifth feature map to obtain a sixth feature map, so as to obtain feature maps with different resolutions after fusion, namely the second feature map, the third feature map after fusion, the fourth feature map after fusion, the fifth feature map and the sixth feature map after fusion.

6. The method of claim 5, wherein the detector head network comprises a classification branch, a regression branch and an intersection ratio prediction branch, the classification branch comprises four 3x3 convolutional layers and one 1x1 convolutional layer, and the fused features with different resolutions output class prediction results through the five convolutional layers to obtain classification scores; the regression branch comprises four 3x3 convolution layers and a 1x1 convolution layer, and the fused features with different resolutions output the prediction result of the shielded target position through five convolution layers; the intersection ratio prediction branch comprises a 1x1 convolution layer, and the output result is the intersection ratio of the result of the predicted regression branch and the real result after the last 3x3 convolution layer of the regression branch.

7. The method according to claim 6, wherein the network loss function preset in S300 comprises classification loss, regression loss and semantic consistency loss;

the classification loss and the regression loss are respectively selected from FLloss and GIoUloss, and the calculation formula is as follows:

FL(p,y)＝-y(1-p) ^γ log(p)-(1-y)p ^γ log(1-p)

wherein y represents a label of classification, p represents a classification prediction value, γ is a hyper-parameter, ioU represents an intersection ratio between the label and a prediction frame, C represents a minimum closed shape, A represents a minimum closed shape _c Denotes the area of C, and U denotes the areas of A and B.

L _pull ＝-ln(1-N _t +IoU(b _max ,b _m ))s _m

8. The method of claim 7, wherein determining a final output result as the target recognition result in S400 according to the confidence and the position of the occluded target in combination with a maximum suppression algorithm comprises:

s410: sorting the prediction results according to the confidence score of the shielded target, and selecting the prediction result with the highest confidence score to be stored as an algorithm output result list;

s420: calculating the intersection ratio of the rest prediction results and the prediction result with the highest confidence score, inhibiting the prediction result with the intersection ratio larger than or equal to a preset intersection ratio threshold value, and keeping the prediction result with the intersection ratio smaller than the preset intersection ratio threshold value;

s440: and selecting the prediction result with the highest confidence score from the final reserved prediction results as a target recognition result.

9. The method of claim 8, further comprising, after S400:

s600: and visualizing the output result of the neural network by using a visualization tool, and carrying out qualitative analysis on the recognition performance of the neural network by visually comparing and analyzing the detection effect of the neural network through the visualization result.

10. The method of claim 9, wherein S500 comprises:

calculating the precision and recall ratio of the local semantic perception attention enhancement neural network, specifically: