CN117422975A

CN117422975A - Image identification recognition method, device and equipment based on YOLOv3

Info

Publication number: CN117422975A
Application number: CN202311378736.8A
Authority: CN
Inventors: 王君至; 沈大勇; 姚锋; 张忠山; 王涛; 程力; 王沛; 闫俊刚; 潘雨; 杜永浩; 陈英武; 吕济民; 陈宇宁; 陈盈果; 刘晓路
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-01-19

Abstract

The application relates to a method, a device and equipment for identifying image identifiers based on YOLOv 3. The method comprises the following steps: the method comprises the steps of constructing an identification model, wherein the identification model comprises a feature extraction unit, a feature pyramid and a spatial feature fusion unit which are sequentially connected, each feature extraction module in the feature extraction unit comprises a spatial attention layer and a residual block and is used for carrying out spatial feature extraction on an input feature image, outputting the spatial feature image, the feature pyramid is used for carrying out enhancement processing on each input spatial feature image to obtain an enhancement feature image, the spatial feature fusion unit is used for carrying out spatial feature fusion on each enhancement feature image to obtain a corresponding fusion feature image, the fusion feature image is input into a detection head of the identification model to output a corresponding identification result, and identification and recognition are carried out on an identification image to be detected by using the trained identification model. The method can accurately identify the identification image information.

Description

Image identification recognition method, device and equipment based on YOLOv3

Technical Field

The present disclosure relates to the field of image recognition technologies, and in particular, to a YOLOv 3-based image identification recognition method, device, and equipment.

Background

Identification image recognition is widely used in many fields including autopilot, object recognition, medical image analysis, etc., where identification on the identification image may aid in automated processing. Identification image recognition methods aim to identify and understand the identification, symbol, or specific pattern in an image by computer vision and image processing techniques, template matching is a basic recognition method that involves comparing a known image template with the image to be recognized. When the template is highly matched with a certain area of the image, the identification can be identified, and the method is suitable for simple identification, and can not effectively identify the identification in the complex identification image. Using machine learning techniques, the model can be trained to automatically extract features from the image and perform identification recognition, which is very effective in complex identification images.

However, the conventional method cannot effectively identify dense marks in an image, errors exist in performing target positioning, the positions of the marks are not accurate enough, and in the small-sized and dense-mark image, the marks are possibly blocked by other objects, in this case, the conventional method cannot correctly detect and identify the marks, so that automatic processing of the mark information cannot be realized.

Disclosure of Invention

Accordingly, there is a need to provide a YOLOv 3-based image identification recognition method, device and equipment for solving the above-mentioned technical problems.

A YOLOv 3-based image identification recognition method, the method comprising:

acquiring a plurality of preprocessed identification image samples; the identification image sample marks the identification on the sample through a label;

constructing an identification recognition model; the identification recognition model adopts a YOLOv3 model as a basic framework; the identification model comprises a feature extraction unit, a feature pyramid and a spatial feature fusion unit which are connected in sequence; each feature extraction module in the feature extraction unit comprises a spatial attention layer and a residual block, and is used for carrying out spatial feature extraction on an input feature map and outputting a spatial feature map; the feature pyramid is used for carrying out enhancement processing on each input space feature map to obtain an enhanced feature map; the spatial feature fusion unit is used for carrying out spatial feature fusion on each enhancement feature map to obtain a corresponding fusion feature map; the fusion feature map is input into a detection head of the identification recognition model to output a corresponding identification recognition result;

training the identification recognition model according to the label of the identification image sample and a prediction result obtained by inputting the identification image sample into the identification recognition model to obtain a trained identification recognition model;

inputting the identification image to be detected into the trained identification recognition model to obtain an identification recognition result of the identification image to be detected, and inputting identification image information according to the identification recognition result.

In one embodiment, the method further comprises: clustering the boundary boxes corresponding to the identification areas on each identification image sample to obtain prior boxes; and guiding the training of the identification recognition model according to the prior frame.

In one embodiment, the method further comprises: the identification recognition model comprises a residual error network; the residual error network comprises an upper layer residual error block, a middle upper layer residual error block and the characteristic extraction unit which are connected in sequence; the feature extraction unit comprises a middle layer feature extraction module, a middle-lower layer feature extraction module and a bottom layer feature extraction module which are sequentially connected.

In one embodiment, the method further comprises: performing convolution processing on the bottom layer space feature map output by the bottom layer feature extraction module for multiple times, and outputting a bottom layer enhancement feature map; the bottom layer enhanced feature map is rolled and up-sampled and then spliced with the middle-lower layer space feature map output by the middle-lower layer feature extraction module, and a middle-lower layer enhanced feature map is output; and the middle-lower layer enhanced feature map is spliced with the middle-layer space feature map output by the middle-layer feature extraction module after being rolled and up-sampled, and the middle-layer enhanced feature map is output.

In one embodiment, the method further comprises: and adjusting the sizes of the enhancement feature images corresponding to the other two layers according to the sizes of the enhancement feature images of the target layer, and fusing the enhancement feature images with the same size according to the weight of each layer of enhancement feature images to obtain a fused feature image corresponding to the target layer.

In one embodiment, the method further comprises: the loss function of the identification recognition model is GIoU.

An YOLOv 3-based image identification recognition device, the device comprising:

the sample acquisition module is used for acquiring an identification image sample; the identification image sample is marked by a label;

the model construction module is used for constructing an identification recognition model; the identification recognition model adopts a YOLOv3 model as a basic framework; the identification model comprises a feature extraction unit, a feature pyramid and a spatial feature fusion unit which are connected in sequence; each feature extraction module in the feature extraction unit comprises a spatial attention layer and a residual block, and is used for carrying out spatial feature extraction on an input feature map and outputting a spatial feature map; the feature pyramid is used for carrying out enhancement processing on each input space feature map to obtain an enhanced feature map; the spatial feature fusion unit is used for carrying out spatial feature fusion on each enhancement feature map to obtain a corresponding fusion feature map; the fusion feature map is input into a detection head of the identification recognition model to output a corresponding identification recognition result;

the model training module is used for training the identification recognition model according to the label of the identification image sample and a prediction result obtained by inputting the identification image sample into the identification recognition model to obtain a trained identification recognition model;

the identification recognition module is used for inputting the identification image to be detected into the trained identification recognition model to obtain an identification recognition result of the identification image to be detected.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

acquiring an identification image sample; the identification image sample is marked by a label;

inputting the identification image to be detected into the trained identification recognition model to obtain an identification recognition result of the identification image to be detected.

The image identification method, the device and the equipment based on the YOLOv3 perform label identification on the identification image sample by acquiring the identification image sample, so as to train an identification model, the identification model adopts the YOLOv3 model as a basic frame, the identification model comprises a feature extraction unit, a feature pyramid and a spatial feature fusion unit which are sequentially connected, each feature extraction module in the feature extraction unit comprises a spatial attention layer and a residual block and is used for performing spatial feature extraction on an input feature image, outputting a spatial feature image, detecting key features of an identification area through the feature extraction unit, thereby reducing errors existing in target positioning, the method comprises the steps of accurately identifying identification positions, wherein a feature pyramid is used for carrying out enhancement processing on each input space feature map to obtain enhancement feature maps, a space feature fusion unit is used for carrying out space feature fusion on each enhancement feature map to obtain corresponding fusion feature maps, adaptive space feature fusion is carried out on a plurality of enhancement feature maps through the space feature fusion unit, useless information can be filtered, useful information is reserved, and accordingly accuracy of prediction identification information is improved, the fusion feature maps are input into a detection head of an identification recognition model to output corresponding identification recognition results, and identification recognition is carried out on identification images to be detected by using a trained identification recognition model. The embodiment of the invention can realize the efficient identification of the identification information on the identification image.

Drawings

FIG. 1 is a flow chart of a method for identifying image identifiers based on Yolov3 in one embodiment;

FIG. 2 is a schematic diagram of a structure identifying an identification model in one embodiment;

FIG. 3 is a block diagram of an image identification recognition device based on YOLOv3 in one embodiment;

fig. 4 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a YOLOv 3-based image identification recognition method, including the steps of:

step 102, acquiring an identification image sample.

The identification image samples are marked by a label. The identification image samples contain identifications, and identification areas on each identification image sample are marked to obtain a training set. The identification image can be a personnel photo image, the personnel photo image comprises a work card and a certificate photo, the work card or the certificate photo contains information such as personnel level, authority, work kind and the like in large-scale organizations such as enterprises, factories and the like, and personnel identity information can be obtained through identification and identification of the work card or the certificate photo, so that automatic statistics of basic personnel information is further realized.

And 104, constructing an identification recognition model.

The identification recognition model adopts a YOLOv3 model as a basic framework; the identification model comprises a feature extraction unit, a feature pyramid and a spatial feature fusion unit which are connected in sequence; each feature extraction module in the feature extraction unit comprises a spatial attention layer and a residual block, and is used for carrying out spatial feature extraction on an input feature map and outputting a spatial feature map; the feature pyramid is used for carrying out enhancement processing on each input space feature map to obtain an enhanced feature map; the spatial feature fusion unit is used for carrying out spatial feature fusion on each enhancement feature map to obtain a corresponding fusion feature map; the fusion feature map is input into a detection head of the identification model to output a corresponding identification result. The identification recognition model can be used for rapidly and efficiently recognizing the identification information, and the recognition speed can reach the real-time recognition standard.

Step 106, training the identification recognition model according to the label of the identification image sample and a prediction result obtained by inputting the identification image sample into the identification recognition model to obtain a trained identification recognition model;

and step 108, inputting the identification image to be detected into a trained identification recognition model to obtain an identification recognition result of the identification image to be detected.

By identifying the identification on the identification image to be detected, automatic input of identification information can be further carried out, so that data processing and data analysis are carried out by utilizing the input identification information.

In the image identification method based on YOLOv3, the identification image sample is obtained to carry out label identification on the identification image sample, so that the identification model is trained, the identification model adopts the YOLOv3 model as a basic frame, the identification model comprises a feature extraction unit, a feature pyramid and a spatial feature fusion unit which are sequentially connected, each feature extraction module in the feature extraction unit comprises a spatial attention layer and a residual block and is used for carrying out spatial feature extraction on an input feature image, a spatial feature image is output, key features of an identification area are detected through the feature extraction unit, the error existing in target positioning is reduced, the identification position is accurately identified, the feature pyramid is used for carrying out enhancement processing on each input spatial feature image to obtain an enhanced feature image, the spatial feature fusion unit is used for carrying out spatial feature fusion on each enhanced feature image to obtain a corresponding fusion feature image, useless information can be filtered and use information can be reserved through the spatial feature fusion unit, the accuracy of the prediction identification information is improved, the fusion feature image is input into a detection head of the identification model to output a corresponding identification result to be used for carrying out identification training image identification by detecting and identifying the identification image is well identified. The embodiment of the invention can realize the efficient identification of the identification information on the identification image.

In one embodiment, the method further comprises: obtaining boundary frames corresponding to the identification areas on each identification image sample in the training set, and clustering to obtain prior frames; and guiding the training of the identification recognition model according to the prior frame. In this embodiment, the initial size of YOLOv3 bounding box prediction (Bounding Box Prediction) is based on K-Means clustering, which passes the width and height of all bounding boxes of the annotated dataset as input to the K-Means clustering algorithm, and finds the best 9 cluster numbers, 9 cluster numbers being 3 feature graphs of different sizes output by YOLOv3, and each feature graph has 3 anchor boxes for a total of 9 anchor boxes. A picture will generate 10647 anchor frames in total.

The boundary box prediction of YOLOv3 is based on the anchor box direct prediction relative position, and the position size and confidence of the b-box are calculated by a coordinate offset formula. The predictive formula is:

wherein t is _x 、t _y 、t _w 、t _h Is the model predictive output, c _x 、c _y For grid cell coordinates, p _w 、p _h Is the size of the pre-prediction binding box, b _x 、b _y 、b _w 、b _h Is obtained by predictionAnd the coordinates and size of the center of the binding box. Confidence=pr (Object) IoU.

In the final output result, 10647 anchor frames generated by the network are reduced through a certain rule so as to achieve the aim of identification. The anchor boxes with lower scores are filtered by setting an item score threshold. The problem of multiple anchor boxes detecting an item is then solved using a non-maximum suppression (NMS) method. The NMS performs a series of screening and sorting operations on the detection results to remove the overlapped frames, and in the first step, sorts all the detection results according to the confidence scores. And secondly, selecting a frame with the highest confidence score, comparing the frame with the rest frames, and determining indexes of all frames with the overlapping area larger than a preset threshold value. And thirdly, carrying out subset screening and comparison again on the frames screened in the second step, removing the frames with lower scores, and retaining the frames with higher scores, so that the result is more accurate. And a fourth step of sequentially performing the second and third steps described above on the remaining detection results until the overlapping areas of all the frames are calculated and screened.

In one embodiment, the loss function identifying the recognition model is GIoU. In this embodiment, the identifier is often blocked by other objects or background, resulting in small or non-existent intersections between bounding boxes, and thus, an IOU cannot be accurately calculated. Since noise data tends to be very close to, and possibly even coincident with, the target object, when noise data occurs during training, it can cause significant interference with the calculation of the IOU. According to the invention, GIoU (Generalized IoU) is adopted as a loss function of YOLOv3, so that the model can accurately detect and identify the marks in the small-size and dense-mark images, and the model precision is improved. GIoU is:

wherein A, B is a frame A and a frame B, and C is the smallest circumscribed rectangle of the two frames. The GIoU considers the factors of the aspect ratio and the area difference of the two bounding boxes, and introduces a normalization term to correct IoU, so that the similarity between the bounding boxes can be better represented in the identification task. The GIoU focuses not only on the overlapping area, but also considers other non-overlapping areas between two frames, and can better reflect the overlapping degree of the two frames.

In one embodiment, as shown in fig. 2, a schematic structural diagram of an identification model is provided, where the identification model includes an input layer, a backbone network, a neck network and a detection head, the backbone network includes a residual network, the residual network includes a convolution layer, an upper residual block, and a feature extraction unit, which are sequentially connected, and the neck network includes a feature pyramid and a spatial feature fusion unit.

In one embodiment, the identification recognition model includes a residual network; the residual network comprises an upper layer residual block, a middle upper layer residual block and a feature extraction unit which are connected in sequence; the feature extraction unit comprises a middle layer feature extraction module, a middle-lower layer feature extraction module and a bottom layer feature extraction module which are sequentially connected. In the present embodiment, the middle layer feature extraction module, the middle lower layer feature extraction module, and the bottom layer feature extraction module include a spatial attention layer and a residual block, respectively, and Coordinate Attention (self-attention mechanism) highlights important feature areas by modeling positional information of pixels on spatial coordinates. Coordinate Attention the spatial coordinates of the pixels of the identified region are encoded into an enabling vector, the corresponding weights are calculated based on the coordinate values of the dimensions, and then the weighted output is obtained by multiplying the weights and the encoded vector by bits. By the aid of the method, the network can be helped to better detect key features of the identification area, so that errors existing in target positioning are reduced, identification positions are accurately identified, and model performance is improved. Specifically, an input feature map is obtained, the input feature map is a feature map output by an upper residual block, position coordinates of each pixel are encoded into a vector, the encoded vector and the input feature map are multiplied element by element to obtain a weighted feature map, and the weighted feature map is subjected to pooling operation to obtain an output feature map. The feature extraction unit provided by the invention can highlight important features by utilizing the space coordinate information, and has good performance in target detection. The model training and reasoning can be accelerated without considering the interaction between channels, and the calculation is relatively light.

In one embodiment, performing enhancement processing on each spatial feature map input to obtain an enhanced feature map includes: carrying out convolution processing on the bottom space feature map output by the bottom feature extraction module for multiple times, and outputting a bottom enhancement feature map; the bottom layer enhancement feature map is spliced with the middle-lower layer space feature map output by the middle-lower layer feature extraction module after being rolled and up-sampled, and the middle-lower layer enhancement feature map is output; and after the middle-lower layer enhancement feature images are rolled and up-sampled, the middle-lower layer enhancement feature images are spliced with the middle-layer spatial feature images output by the middle-layer feature extraction module, and the middle-layer enhancement feature images are output.

In one embodiment, performing spatial feature fusion on each enhancement feature map to obtain a corresponding fusion feature map includes: and adjusting the sizes of the enhancement feature images corresponding to the other two layers according to the sizes of the enhancement feature images of the target layer, and fusing the enhancement feature images with the same size according to the weight of each layer of enhancement feature images to obtain a fused feature image corresponding to the target layer. In this embodiment, the spatial feature fusion unit includes a first spatial feature fusion module, a second spatial feature fusion module, and a third spatial feature fusion module, where each spatial feature fusion module in the spatial feature fusion unit is a ASFF (Adaptive Spatial Feature Fusion) module, and is configured to fuse feature maps from different scales and different scenes, so as to improve image recognition performance. The ASFF module includes an adaptive pooling sub-module, a feature selection sub-module, and a channel control sub-module. The ASFF process is to input a multi-scale enhancement feature map. Each enhancement feature map is adaptively pooled and scaled in size. The weight of each feature map is calculated to select the most representative feature map. And weighting the channels in the selected feature map according to the weight to obtain feature vectors. This step also involves dynamically adjusting the weights of the channels within the feature map using a channel control mechanism. And cascading or fusing the plurality of feature vectors to form a corresponding fused feature map. The spatial feature fusion unit enhances the expression capability of the network by carrying out adaptive spatial feature fusion on the plurality of enhanced feature graphs, can avoid the problem of feature unbalance caused by large feature proportion difference while fully utilizing the feature information of different scales, and can filter useless information and retain useful information, thereby improving the information input accuracy of personnel photos.

In one embodiment, the present invention evaluates each generation of models trained using mAP (mean Average Precision, average accuracy) to evaluate model detection effects. The mAP calculation formula is:

where n is the number of categories, AP _i The average precision of the ith category represents the detection effect of the model on a certain category of targets. The larger the value, the better the detection effect. Wherein the calculation formula of each category is as follows:

wherein R is the number of targets actually existing in the category, P (R) is the accuracy when the first R targets are predicted to be correct, and rec (R) is the recall when the first R targets are predicted. In calculating the AP score for each category, it is necessary to first rank the AP scores from high to low according to confidence, then calculate the Precision and Recall values for each identification box, and calculate the AP accordingly. The Precision and Recall values here are calculated from the GIoU between the predicted and real Bounding boxes.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

In one embodiment, as shown in fig. 3, there is provided a YOLOv 3-based image identification recognition apparatus, including: a sample acquisition module 302, a model construction module 304, a model training module 306, and an information entry module 308, wherein:

a sample acquiring module 302, configured to acquire a plurality of preprocessed identification image samples; marking the mark on the sample by the mark image sample through the label;

the model construction module 304 is configured to construct an identification recognition model; the identification recognition model adopts a YOLOv3 model as a basic framework; the identification model comprises a feature extraction unit, a feature pyramid and a spatial feature fusion unit which are connected in sequence; each feature extraction module in the feature extraction unit comprises a spatial attention layer and a residual block, and is used for carrying out spatial feature extraction on an input feature map and outputting a spatial feature map; the feature pyramid is used for carrying out enhancement processing on each input space feature map to obtain an enhanced feature map; the spatial feature fusion unit is used for carrying out spatial feature fusion on each enhancement feature map to obtain a corresponding fusion feature map; inputting the fusion feature map into a detection head of the identification model to output a corresponding identification result;

the model training module 306 is configured to train the identification recognition model according to the label of the identification image sample and a prediction result obtained by inputting the identification image sample into the identification recognition model, so as to obtain a trained identification recognition model;

the information input module 308 is configured to input the identification image to be detected into the trained identification recognition model, obtain an identification recognition result of the identification image to be detected, and perform information input of the identification image according to the identification recognition result.

In one embodiment, the method further comprises: the identification recognition model comprises a residual error network; the residual network comprises an upper layer residual block, a middle upper layer residual block and a feature extraction unit which are connected in sequence; the feature extraction unit comprises a middle layer feature extraction module, a middle-lower layer feature extraction module and a bottom layer feature extraction module which are sequentially connected.

In one embodiment, the method further comprises: carrying out convolution processing on the bottom space feature map output by the bottom feature extraction module for multiple times, and outputting a bottom enhancement feature map; the bottom layer enhancement feature map is spliced with the middle-lower layer space feature map output by the middle-lower layer feature extraction module after being rolled and up-sampled, and the middle-lower layer enhancement feature map is output; and after the middle-lower layer enhancement feature images are rolled and up-sampled, the middle-lower layer enhancement feature images are spliced with the middle-layer spatial feature images output by the middle-layer feature extraction module, and the middle-layer enhancement feature images are output.

In one embodiment, the method further comprises: the loss function identifying the model is identified as GIoU.

For specific limitations on the YOLOv 3-based image identification recognition apparatus, reference may be made to the above limitations on the YOLOv 3-based image identification recognition method, and no further description is given here. The above-described blocks in the YOLOv 3-based image identification apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a YOLOv 3-based image identification recognition method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method of the above embodiments when the computer program is executed.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A YOLOv 3-based image identification recognition method, the method comprising:

2. The method of claim 1, wherein the identity recognition model comprises a residual network; the residual error network comprises an upper layer residual error block, a middle upper layer residual error block and the characteristic extraction unit which are connected in sequence; the feature extraction unit comprises a middle layer feature extraction module, a middle-lower layer feature extraction module and a bottom layer feature extraction module which are sequentially connected.

3. The method according to claim 1 or 2, wherein the enhancing each spatial feature map of the input to obtain an enhanced feature map comprises:

performing convolution processing on the bottom layer space feature map output by the bottom layer feature extraction module for multiple times, and outputting a bottom layer enhancement feature map;

the bottom layer enhanced feature map is rolled and up-sampled and then spliced with the middle-lower layer space feature map output by the middle-lower layer feature extraction module, and a middle-lower layer enhanced feature map is output;

and the middle-lower layer enhanced feature map is spliced with the middle-layer space feature map output by the middle-layer feature extraction module after being rolled and up-sampled, and the middle-layer enhanced feature map is output.

4. The method of claim 3, wherein the performing spatial feature fusion on each enhancement feature map to obtain a corresponding fusion feature map comprises:

and adjusting the sizes of the enhancement feature images corresponding to the other two layers according to the sizes of the enhancement feature images of the target layer, and fusing the enhancement feature images with the same size according to the weight of each layer of enhancement feature images to obtain a fused feature image corresponding to the target layer.

5. The method of claim 4, wherein the loss function of the identity recognition model is GIoU.

6. The method according to claim 1, wherein the method further comprises:

clustering the boundary boxes corresponding to the identification areas on each identification image sample to obtain prior boxes;

and guiding the training of the identification recognition model according to the prior frame.

7. An image identification recognition device based on YOLOv3, the device comprising:

8. The apparatus of claim 7, wherein the model building module is further configured to identify that the recognition model comprises a residual network; the residual error network comprises an upper layer residual error block, a middle upper layer residual error block and the characteristic extraction unit which are connected in sequence; the feature extraction unit comprises a middle layer feature extraction module, a middle-lower layer feature extraction module and a bottom layer feature extraction module which are sequentially connected.

9. The apparatus of claim 7, wherein the model building module is further configured to perform a convolution process on the bottom-layer spatial feature map output by the bottom-layer feature extraction module multiple times, and output a bottom-layer enhancement feature map; the bottom layer enhanced feature map is rolled and up-sampled and then spliced with the middle-lower layer space feature map output by the middle-lower layer feature extraction module, and a middle-lower layer enhanced feature map is output; and the middle-lower layer enhanced feature map is spliced with the middle-layer space feature map output by the middle-layer feature extraction module after being rolled and up-sampled, and the middle-layer enhanced feature map is output.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.