CN111428875A

CN111428875A - Image recognition method and device and corresponding model training method and device

Info

Publication number: CN111428875A
Application number: CN202010165382.9A
Authority: CN
Inventors: 张珂; 罗钧峰; 范铭源; 魏晓明
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-17

Abstract

The application discloses an image recognition method and device and a corresponding model training method and device. The training method of the image recognition model comprises the following steps: extracting a multi-scale feature map of a training image according to a backbone network; determining a candidate region in the training image according to the region generation network based on the multi-scale feature map; predicting effective targets contained in the training images according to the effective target prediction branch network based on the multi-scale feature map and the candidate regions; predicting a fuzzy target contained in the training image according to the fuzzy target prediction branch network based on the candidate region; calculating a model loss value according to the labeling information of the training image and the prediction result; and updating the parameters of the image recognition model according to the model loss value, or finishing training. According to the scheme, an end-to-end image recognition model is obtained through training, whether fuzzy targets exist in real-scene images such as road collection can be effectively recognized, and the recognition accuracy and recall rate of the effective targets are improved.

Description

Image recognition method and device and corresponding model training method and device

Technical Field

The application relates to the field of computer vision, in particular to an image recognition method and device and a corresponding model training method and device.

Background

The neural network can be used for effectively identifying the target in the image, thereby providing a technical basis for scenes such as automatic driving and the like. However, many images are affected by factors such as the environment when they are taken, and many objects appear blurry in the images, which poses a challenge to image recognition. Therefore, a solution for recognizing a blurred object in an image, which can achieve both accuracy and efficiency, is required.

Disclosure of Invention

In view of the above, the present application is proposed to provide an image recognition method, an apparatus and a corresponding model training method, apparatus that overcome or at least partially solve the above problems.

According to a first aspect of the present application, there is provided a training method for an image recognition model, wherein the image recognition model includes a trunk network, a region generation network, an effective target prediction branch network, and a fuzzy target prediction branch network, the method includes a training phase performing a plurality of iterations, each training phase includes: extracting a multi-scale feature map of a training image according to the backbone network; determining a candidate region in a training image according to the region generation network based on the multi-scale feature map; predicting effective targets contained in the training images according to the effective target prediction branch network based on the multi-scale feature map and the candidate regions; predicting a fuzzy target contained in the training image according to the fuzzy target prediction branch network based on the candidate region; calculating a model loss value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and updating the parameters of the image recognition model according to the model loss value, or finishing training.

Optionally, the backbone network includes a cascaded multi-scale feature extraction network and a multi-scale feature fusion network; the extracting of the multi-scale feature map of the training image according to the backbone network comprises: extracting an image feature map of the training image under multiple scales according to the multi-scale feature extraction network; and according to the multi-scale feature fusion network, carrying out feature fusion processing on the image feature map of the training image under multiple scales to obtain the multi-scale feature map of the training image.

Optionally, the multi-scale feature fusion network is specifically a feature pyramid network FPN, and the extracting of the image feature map of the training image under multiple scales includes: extracting a plurality of image characteristic graphs with sequentially reduced scales in a bottom-up mode to obtain an image characteristic graph pyramid of the training image; the performing of feature fusion processing on the image feature map of the training image under multiple scales according to the multi-scale feature fusion network comprises: and utilizing the FPN to perform top-down processing on the image feature map pyramid.

Optionally, the determining a candidate region in a training image according to the region generation network based on the multi-scale feature map includes: generating anchor point samples based on the multi-scale characteristic diagram by the area generation network according to the preset number and/or proportion of anchor points; candidate regions are determined based on the confidence of the anchor samples.

Optionally, the predicting, based on the multi-scale feature map and the candidate region, the effective target included in the training image according to the effective target prediction branch network includes: generating a feature map of a candidate region based on the multi-scale feature map and the candidate region; and obtaining the position regression prediction result of the effective target and the classification prediction result of the effective target by the effective target prediction branch network according to the feature map of the candidate region.

Optionally, the predicting, based on the candidate region, a fuzzy target included in the training image according to the fuzzy target prediction branch network includes: and performing two-classification prediction on whether the image of the candidate region contains a fuzzy target or not by the fuzzy target prediction branch network.

Optionally, the performing, by the fuzzy object prediction branch network, two-class prediction on whether the image of the candidate region contains a fuzzy object includes: and extracting the fuzzy feature map of the image of the candidate region according to a plurality of residual error modules connected in series in the fuzzy target prediction branch network, performing global average pooling on the extracted fuzzy feature map to obtain a feature vector with the same size as the fuzzy feature map, and performing binary prediction on whether the fuzzy target is contained according to the feature vector.

Optionally, the calculating a model loss value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network includes: respectively calculating a position regression loss value of the effective target, a category cross entropy loss value of the effective target and a category cross entropy loss value of the fuzzy target according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and determining a model loss value according to the position regression loss value of the effective target, the class cross entropy loss value of the effective target and the class cross entropy loss value of the fuzzy target.

According to a second aspect of the present application, there is provided an image recognition method comprising: acquiring an image to be identified; the image recognition model obtained by utilizing the training method of the image recognition model is used for training, the image to be recognized is recognized, and an image recognition result is obtained, wherein the image recognition result comprises at least one of the following: the position of the fuzzy object, the position of the effective object and the category of the effective object.

According to a third aspect of the present application, there is provided a training apparatus for an image recognition model, wherein the image recognition model includes a trunk network, a region generation network, an effective target prediction branch network, and a fuzzy target prediction branch network, the apparatus is configured to perform a plurality of iterative training phases, and the apparatus includes: the multi-scale feature map extracting unit is used for extracting a multi-scale feature map of a training image according to the backbone network; a candidate region determining unit, configured to determine a candidate region in a training image according to the region generation network based on the multi-scale feature map; the effective target prediction unit is used for predicting an effective target contained in the training image according to the effective target prediction branch network based on the multi-scale feature map and the candidate region; a fuzzy target prediction unit, configured to predict a fuzzy target included in the training image according to the fuzzy target prediction branch network based on the candidate region; the training control unit is used for calculating a loss function value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and updating the parameters of the image recognition model according to the loss function value, or finishing training.

Optionally, the backbone network includes a cascaded multi-scale feature extraction network and a multi-scale feature fusion network; the multi-scale feature map extraction unit is used for extracting the image feature maps of the training images under multiple scales according to the multi-scale feature extraction network; and according to the multi-scale feature fusion network, carrying out feature fusion processing on the image feature map of the training image under multiple scales to obtain the multi-scale feature map of the training image.

Optionally, the multi-scale feature fusion network is specifically a feature pyramid network FPN; the multi-scale feature map extraction unit is used for extracting a plurality of image feature maps with sequentially reduced scales in a bottom-up mode to obtain an image feature map pyramid of the training image; the performing of feature fusion processing on the image feature map of the training image under multiple scales according to the multi-scale feature fusion network comprises: and utilizing the FPN to perform top-down processing on the image feature map pyramid.

Optionally, the candidate region determining unit is configured to generate, by the region generation network, anchor point samples based on the multi-scale feature map according to a preset number and/or ratio of anchor points; candidate regions are determined based on the confidence of the anchor samples.

Optionally, the effective target prediction unit is configured to generate a feature map of a candidate region based on the multi-scale feature map and the candidate region; and obtaining the position regression prediction result of the effective target and the classification prediction result of the effective target by the effective target prediction branch network according to the feature map of the candidate region.

Optionally, the fuzzy object prediction unit, configured to predict the fuzzy object included in the training image according to the fuzzy object prediction branch network, includes: and performing two-classification prediction on whether the image of the candidate region contains a fuzzy target or not by the fuzzy target prediction branch network.

Optionally, the fuzzy target prediction unit is configured to extract a fuzzy feature map of the image of the candidate region according to a plurality of residual error modules connected in series in the fuzzy target prediction branch network, perform global average pooling on the extracted fuzzy feature map to obtain a feature vector having a size same as that of the fuzzy feature map, and perform classification prediction on whether the fuzzy target is included according to the feature vector.

Optionally, the training control unit is configured to calculate a position regression loss value of the effective target, a category cross entropy loss value of the effective target, and a category cross entropy loss value of the fuzzy target according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and determining a model loss value according to the position regression loss value of the effective target, the class cross entropy loss value of the effective target and the class cross entropy loss value of the fuzzy target.

According to a fourth aspect of the present application, there is provided an image recognition method comprising: acquiring an image to be identified; the image recognition model obtained by training with the training device of the image recognition model is used for recognizing the image to be recognized to obtain an image recognition result, wherein the image recognition result comprises at least one of the following: the position of the fuzzy object, the position of the effective object and the category of the effective object.

According to a fifth aspect of the present application, there is provided an electronic device comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method of training an image recognition model as defined in any one of the above, or cause the processor to perform a method of image recognition as defined in any one of the above.

According to a sixth aspect of the present application, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of training an image recognition model as defined in any of the above, or implement the method of image recognition as defined in any of the above.

From the above, according to the technical scheme of the application, an image recognition model including a backbone network, a region generation network, an effective target prediction branch network and a fuzzy target prediction branch network is provided, and a method for training the same is provided, specifically, after a multi-scale feature map of a training image is extracted through the backbone network, a candidate region in the training image is determined according to the region generation network based on the multi-scale feature map, and different processing is performed through two branches, wherein an effective target included in the training image is predicted according to the effective target prediction branch network based on the multi-scale feature map and the candidate region; predicting a fuzzy target contained in the training image according to the fuzzy target prediction branch network based on the candidate region; calculating a model loss value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and updating the parameters of the image recognition model according to the model loss value, or finishing training. According to the scheme, an end-to-end image recognition model is obtained through training, whether fuzzy targets exist in real-scene images such as road collection can be effectively recognized, and the recognition accuracy and recall rate of the effective targets are improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a flow diagram of a method for training an image recognition model according to an embodiment of the present application;

FIG. 2 shows a schematic flow diagram of an image recognition method according to an embodiment of the present application;

FIG. 3 shows a detailed flow diagram of an image recognition method according to an embodiment of the present application;

FIG. 4 illustrates an image recognition effect graph according to one embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for training an image recognition model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an image recognition apparatus according to an embodiment of the present application;

FIG. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 8 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Taking the map automation production scene as an example, the computer vision algorithm is used for detecting and identifying the traffic signs on the roads, and finally, various road elements such as pedestrian crosswalks, speed limits, straight walks and the like are identified on the common street view by an automation means. The recognition results are delivered to a back-end warehousing process of the map as production data, and are checked by combining manpower and then are on-line for serving user navigation.

However, the performance of detecting and identifying the traffic sign is greatly influenced by the natural environment in the shooting process, the distortion and the blur of the image caused by the shielding, the abrasion, the weather and the like of trees or vehicles are very common in the natural street view, and the shot image can not be identified because the contained target is too small.

Although the type of the fuzzy object is difficult to distinguish according to a single frame image, if the images have continuity (such as a road acquisition image), the fuzzy object can be confirmed manually or in other ways according to the previous and next related frames. That is, if it can be identified by the image identification model which targets are fuzzy targets, the targets can be identified in a targeted manner in a subsequent manual review process.

Two possible solutions are given below. The training image is derived from actual acquisition, and therefore the training image includes both valid targets (both the position and the category of the target can be marked), and fuzzy targets (only the approximate position of the target can be marked according to a single frame image, and the category of the target cannot be marked, and can also be considered as "fuzzy"). The image recognition model adopts a two-stage model, and the performance of the two-stage model is more robust than that of the one-stage model. In the first stage, target detection, namely target position regression and coarse classification, can be performed by using fast-RCNN, particularly in a scene of fine target identification in a road image; the two phases perform a fine classification of targets, which can be implemented based on a ResNet family network such as ResNet 101.

In the first scheme, only effective targets are labeled in the training image, that is, fuzzy targets are not processed. After a stage of detection, object recall is enabled, but the fuzzy object is recalled at the same time as the valid object is recalled. The two-stage network carries out fine classification on all targets, and due to the existence of fuzzy targets, the fuzzy targets can be identified into a specific class (while the actual class is required to be fuzzy), so that the overall accuracy rate is difficult to achieve the expected effect.

In the second scheme, the training image is labeled with a valid target and a fuzzy target. After a stage of detection, object recall can be performed, and while a valid object is recalled, a fuzzy object is recalled. At the moment, a fuzzy judgment module is used for judging whether the recalled target is a fuzzy target or an effective target, if the recalled target is the fuzzy target, subsequent manual examination and verification are directly carried out, and if the recalled target is the effective target, fine classification is carried out through a two-stage network. Thus, the accuracy is improved.

Considering the lightweight of the image recognition model, the fuzzy judgment module is usually implemented by selecting a traditional fuzzy judgment algorithm, such as an entropy function based on statistical characteristics, an energy gradient function, an L aplican gradient function, a gray variance product method, and the like, or a BRISQUE algorithm for unspecific distortion, but the fuzzy judgment algorithms all present the same problem, that is, the fuzzy judgment result is greatly different from the labeling information (ground route), the algorithm is unstable, so that some effective targets are removed while the fuzzy target is removed, and the recall rate is reduced.

The statistics is carried out on data which are shot by a driving recorder and provided with 18599 effective targets and 1615 fuzzy targets, the fuzzy judgment algorithm judges the effective targets on the ground channel to be fuzzy 5.5%, judges the fuzzy targets to be fuzzy 20%, and the expected effect that the fuzzy judgment module does not accord with the scheme can be obtained according to the proportion.

According to the two schemes, the performance of the recall rate needs to be guaranteed during training in the target detection stage, the detected fuzzy target needs to be removed before the fine classification stage so as to improve the accuracy, and the workload of manual review is reduced by combining the two schemes. Therefore, designing a network framework for improving the image recognition accuracy by effectively screening fuzzy objects is a problem to be solved by the scheme.

The technical idea of the application is that an end-to-end image recognition model is trained, fuzzy target prediction is used as a task branch of the image recognition model, the accuracy and recall rate of image recognition are effectively improved, and the method can be applied to recognition of the map road acquisition image. Images identified as containing fuzzy objects may be reviewed manually.

Fig. 1 shows a flow chart of a training method of an image recognition model according to an embodiment of the present application. The image recognition model comprises a backbone network, a region generation network, an effective target prediction branch network and a fuzzy target prediction branch network, the method comprises a plurality of iterative training stages, and each training stage comprises:

and step S110, extracting the multi-scale feature map of the training image according to the backbone network. The multi-scale feature map may refer to an image feature map at multiple scales, and of course, the image feature map may be subjected to feature fusion and the like. The targets in the image may have different sizes and different sizes, and the same target details and the whole may have various features, so that the features are extracted on different scales and are correspondingly identified, and the effect is better.

And step S120, determining a candidate region in the training image according to the region generation network based on the multi-scale feature map. Specifically, the region generation network rpn (region pro-social network) can give candidate regions in the form of candidate boxes.

And S130, predicting the effective target contained in the training image according to the effective target prediction branch network based on the multi-scale feature map and the candidate region. Wherein, the effective target prediction branch network can be realized by referring to a Faster regional convolutional neural network (Faster R-CNN), which can be regarded as fine identification of the effective target.

And step S140, predicting the fuzzy target contained in the training image according to the fuzzy target prediction branch network based on the candidate region. It can be seen that in the image recognition model of the present application, the effective target prediction branch network and the fuzzy target prediction branch network are parallel to each other, and there is no context. Therefore, the problem that the output result of the later network is difficult to be influenced by the output result of the former network when the output of one network is used as the input of the other network is overcome.

And S150, calculating a model loss value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network.

And step S160, updating the parameters of the image recognition model according to the model loss value, or finishing training. For example, if the model loss value loss converges, the training is ended; otherwise, optimizing the image recognition model by using a gradient updating mode and the like so as to perform the next round of training.

It can be seen that, in the training method of the image recognition model shown in fig. 1, by adding the fuzzy target prediction branch network, the image recognition model can perform end-to-end training and evaluation, the image recognition model can effectively recognize whether a fuzzy target exists in the road acquisition and other live-action images, and the recognition accuracy and the recall rate of the effective target are improved; specifically, a manual review process can be docked, images containing the fuzzy targets are handed to manual review during data production, the effective targets are used as high-quality recognition results for automatic fine recognition, and the manual review efficiency is further improved.

In an embodiment of the present application, in the training method of the image recognition model, the backbone network includes a cascaded multi-scale feature extraction network and a multi-scale feature fusion network; the method for extracting the multi-scale feature map of the training image according to the backbone network comprises the following steps: extracting an image feature map of the training image under multiple scales according to the multi-scale feature extraction network; and according to the multi-scale feature fusion network, carrying out feature fusion processing on the image feature map of the training image under multiple scales to obtain the multi-scale feature map of the training image.

The multi-scale feature extraction network can be realized by adopting a convolutional neural network, and one specific scheme is to select ResnexXt and use the ResnexXt after fine adjustment. Under the scenes of identifying traffic signs and the like, because the target is small and the change is various, and the multi-scale feature fusion network is selected for multi-scale feature fusion, better recall results can be obtained, and the two networks can be cascaded.

Specifically, in an embodiment of the present application, in the training method for an image recognition model, the multi-scale feature fusion network is specifically a Feature Pyramid Network (FPN), and extracting an image feature map of a training image under multiple scales includes: extracting a plurality of image characteristic graphs with sequentially reduced scales in a bottom-up mode to obtain an image characteristic graph pyramid of the training image; according to the multi-scale feature fusion network, the feature fusion processing of the image feature map of the training image under multiple scales comprises the following steps: and (4) utilizing the FPN to perform top-down processing on the image feature map pyramid.

In the image feature map pyramid, the higher the convolution layer is, the larger the receptive field of the image feature map is, the less abundant the positioning information is, the poorer the resolution is, but the stronger the semantic information is. Therefore, fusing image feature maps of different scales can combine the strong semantic features of high convolutional layers with the strong resolution of low convolutional layers. Specifically, the uppermost layer of the image feature map pyramid may be selected for upsampling, the upsampled result is fused with the image feature map with the same size generated from bottom to top (merge), and a convolution may be performed to eliminate an aliasing effect of the upsampling. And then repeating the process to know that the fusion of the bottom layer of the image feature map pyramid is finished. Thus, a fusion feature map corresponding to each layer of the image feature pyramid is obtained, and finally a multi-scale fusion feature map is formed to serve as the multi-scale feature map.

In an embodiment of the application, in the training method for the image recognition model, determining the candidate region in the training image according to the region generation network based on the multi-scale feature map includes: according to the number and/or proportion of the preset anchor points, generating an anchor point sample by the area generation network based on the multi-scale feature map; candidate regions are determined based on the confidence of the anchor samples.

The area generation network can generate a coarse coordinate of positive and negative anchor samples and position regression according to the number and proportion of preset anchors (anchors), wherein each anchor sample has a confidence degree, the confidence degrees can be sequenced from high to low, and the anchors are sent to a subsequent effective target detection branch network and a subsequent fuzzy target monitoring network.

In an embodiment of the application, in the training method of the image recognition model, predicting the effective target included in the training image according to the effective target prediction branch network based on the multi-scale feature map and the candidate region includes: generating a feature map of the candidate region based on the multi-scale feature map and the candidate region; and obtaining the position regression prediction result of the effective target and the classification prediction result of the effective target by the effective target prediction branch network according to the feature map of the candidate region.

The candidate region can be mapped into the multi-scale feature map according to the candidate region to obtain the feature map of the candidate region, and then position regression prediction and classification prediction of the effective target are carried out, namely position refinement and class subdivision are carried out. The specific implementation here can be realized by referring to the target detection structure of fast-RCNN, and is not described herein again.

In an embodiment of the application, in the training method of the image recognition model, predicting the fuzzy target included in the training image according to the fuzzy target prediction branch network based on the candidate region includes: and performing two-classification prediction on whether the image of the candidate region contains the fuzzy target or not by the fuzzy target prediction branch network.

Since the fuzzy target does not need to give a fine detection frame, the detection frame of the candidate region can be used, that is, the fuzzy target prediction branch network only needs to judge whether the fuzzy target exists. It may also be considered to identify whether an image has a blurring property. In summary, this task can be implemented as a two-class problem.

Specifically, in an embodiment of the present application, in the training method of the image recognition model, the performing, by the fuzzy object prediction branch network, two-class prediction on whether the image of the candidate region includes the fuzzy object includes: and according to a plurality of residual error modules connected in series in the fuzzy target prediction branch network, extracting fuzzy feature maps of images in the candidate regions, performing global average pooling on the extracted fuzzy feature maps to obtain feature vectors with the same size as the fuzzy feature maps, and performing two-classification prediction on whether the fuzzy targets are included according to the feature vectors.

The type and number of residual modules can be selected in combination with an actual application scenario, for example, in a map road acquisition image recognition scenario, two concatenated ResNet blocks can be selected, where a ResNet Block is specifically a three-layer residual module, and sequentially includes a convolutional layer using a 1 × 1 convolution kernel, a convolutional layer using a 3 × 3 convolution kernel, and a convolutional layer using a 1 × 1 convolution kernel, and after each convolutional layer, Batch Normalization (Batch Normalization) processing and activation (e.g., using a Re L U function) processing can be performed.

The Global Average pooling (Global Average Pool) is mainly used for solving the problem of full connection, the fuzzy feature map is subjected to one Average pooling of the whole map to form a feature point, and the feature points form a final feature vector which is smaller than the full connection parameter.

The two classifications of fuzzy objects and the classification of valid objects mentioned in the above embodiments are both classification problems and thus can be implemented using a softmax classifier.

In an embodiment of the application, in the training method of the image recognition model, calculating the model loss value according to the labeled information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network includes: respectively calculating a position regression loss value of the effective target, a category cross entropy loss value of the effective target and a category cross entropy loss value of the fuzzy target according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and determining a model loss value according to the position regression loss value of the effective target, the class cross entropy loss value of the effective target and the class cross entropy loss value of the fuzzy target.

That is, for the position regression of the effective target, the classification of the effective target and the judgment of the fuzzy target, a loss function is respectively adopted to calculate the loss value loss, and then the model loss value of the whole image recognition model is calculated in a summing or weighted summing mode. And finally, the gradient is updated in different branches to obtain different semantic features, so that the final effect of the image recognition model is better.

Fig. 2 shows a flow diagram of an image recognition method according to an embodiment of the present application.

As shown in fig. 2, the method includes:

and step S210, acquiring an image to be identified.

And step S220, identifying the image to be identified by using the image identification model to obtain an image identification result. The image recognition result includes at least one of: the position of the fuzzy object, the position of the effective object and the category of the effective object. The image recognition model is obtained by training with a training device of the image recognition model according to any one of the above embodiments.

Fig. 3 shows a detailed flow diagram of an image recognition method according to an embodiment of the present application. As shown in fig. 3, the image recognition model includes a backbone network composed of ResneXt and FPN, RPN, a valid target detection branch network, and a fuzzy target detection branch network. When an image to be identified is identified, inputting the image to be identified into a backbone network, outputting a multi-scale feature map of the image to be identified by the backbone network, then inputting the multi-scale feature map into an RPN, and outputting a candidate region, which can be a candidate frame specifically, by the RPN; and then the two branch networks are used for processing according to the candidate frames respectively, wherein the effective target detection branch network maps the candidate regions into the multi-scale feature map to obtain a region of interest (ROI) (region of interest), and finally the effective target classification prediction result and the effective target position regression prediction result are output through the full connection layer. And the fuzzy target detection branch network convolves the images of the candidate areas through two ResNet blocks, obtains a fuzzy characteristic diagram, passes through a GAP (Global Average Pooling) layer, and finally outputs a fuzzy target prediction result.

Fig. 4 illustrates an image recognition effect graph according to an embodiment of the present application. It can be seen that due to the influence of rainy days, fine granularity is difficult to distinguish in a fuzzy current frame, the fine granularity is output to manual examination through judgment of a fuzzy target branch network, external personnel judge the category according to the track of the previous frame and the track of the next frame, and effective targets are automatically identified through an effective target prediction branch network and are specifically identified as traffic lights and traffic signs prohibiting parking. Because the fuzzy target output by the fuzzy prediction branch network can effectively judge the target difficult to be classified in fine granularity, the accuracy rate of identifying the image collected on the map is improved under the condition that the recall is not reduced.

Specifically, the five targets shown from left to right in fig. 4 are: traffic light with confidence level 0.999478877; fuzzy object with confidence level 0.874475777; traffic light with confidence level 0.99915278; fuzzy object with confidence level 0.91761905; the stop traffic flag is disabled with a confidence level of 0.993221879.

In a set of control experiments, which were trained using 11142 images and tested using 7450 images, taking into account only targets larger than 30 x 30 pixels in size, the control data for the first protocol (V1), the second protocol (V2) and the protocol using the method of fig. 3 (V3) described above are shown in the following table.

Model (model)	Recall rate	Rate of accuracy
			V1	11377/11955＝95.17％	11377/13991＝81.32％
V2	10614/11955＝88.78％	10641/12660＝84.05％
			V3	11391/11955＝95.28％	11391/12985＝87.72％

The recall rate represents the proportion of samples which are predicted to be correct in the original ground truth samples; the accuracy rate represents the correct sample fraction in the predicted result corresponding to the ground truth.

Fig. 5 is a schematic structural diagram illustrating an apparatus for training an image recognition model according to an embodiment of the present application. The image recognition model includes a trunk network, a region generation network, an effective target prediction branch network, and a fuzzy target prediction branch network, the training apparatus 500 of the image recognition model is configured to execute a plurality of iterative training stages, and the training apparatus 500 of the image recognition model specifically includes:

and a multi-scale feature map extracting unit 510, configured to extract a multi-scale feature map of the training image according to the backbone network. The multi-scale feature map here refers to an image feature map at multiple scales, and of course, the image feature map may be subjected to processing such as feature fusion. The targets in the image may have different sizes and different sizes, and the same target details and the whole may have various features, so that the features are extracted on different scales and are correspondingly identified, and the effect is better.

A candidate region determining unit 520, configured to determine a candidate region in the training image according to the region generation network based on the multi-scale feature map. Specifically, the region generation network rpn (region pro-social network) can give candidate regions in the form of candidate boxes.

And the effective target prediction unit 530 is used for predicting the effective target contained in the training image according to the effective target prediction branch network based on the multi-scale feature map and the candidate region. Wherein, the effective target prediction branch network can be realized by referring to a Faster regional convolutional neural network (Faster R-CNN), which can be regarded as fine identification of the effective target.

And a fuzzy target prediction unit 540, configured to predict a fuzzy target included in the training image according to the fuzzy target prediction branch network based on the candidate region. It can be seen that in the image recognition model of the present application, the effective target prediction branch network and the fuzzy target prediction branch network are parallel to each other, and there is no context. Therefore, the problem that the output result of the later network is difficult to be influenced by the output result of the former network when the output of one network is used as the input of the other network is overcome.

A training control unit 550, configured to calculate a loss function value according to the labeling information of the training image and prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and updating the parameters of the image recognition model according to the loss function values, or finishing the training.

For example, if the model loss value loss converges, the training is ended; otherwise, optimizing the image recognition model by using a gradient updating mode and the like so as to perform the next round of training.

It can be seen that, in the training device of the image recognition model shown in fig. 5, by adding the fuzzy target prediction branch network, the image recognition model can perform end-to-end training and evaluation, and the image recognition model can effectively recognize whether a fuzzy target exists in live-action images such as road acquisition and the like, and improve the recognition accuracy and recall rate of effective targets; specifically, a manual review process can be docked, images containing the fuzzy targets are handed to manual review during data production, the effective targets are used as high-quality recognition results for automatic fine recognition, and the manual review efficiency is further improved.

In an embodiment of the present application, in the training apparatus for an image recognition model, the backbone network includes a cascaded multi-scale feature extraction network and a multi-scale feature fusion network; a multi-scale feature map extraction unit 510, configured to extract an image feature map of the training image in multiple scales according to a multi-scale feature extraction network; and according to the multi-scale feature fusion network, carrying out feature fusion processing on the image feature map of the training image under multiple scales to obtain the multi-scale feature map of the training image.

In an embodiment of the present application, in the training device for the image recognition model, the multi-scale feature fusion network is specifically a feature pyramid network FPN; a multi-scale feature map extraction unit 510, configured to extract, in a bottom-up manner, a plurality of image feature maps with successively decreasing scales to obtain an image feature map pyramid of a training image; according to the multi-scale feature fusion network, the feature fusion processing of the image feature map of the training image under multiple scales comprises the following steps: and (4) utilizing the FPN to perform top-down processing on the image feature map pyramid.

In an embodiment of the present application, in the training apparatus for an image recognition model, the candidate region determining unit 520 is configured to generate, by the region generation network, anchor point samples based on the multi-scale feature map according to a preset number and/or proportion of anchor points; candidate regions are determined based on the confidence of the anchor samples.

In an embodiment of the present application, in the training apparatus for an image recognition model, the effective target prediction unit 530 is configured to generate a feature map of a candidate region based on a multi-scale feature map and the candidate region; and obtaining the position regression prediction result of the effective target and the classification prediction result of the effective target by the effective target prediction branch network according to the feature map of the candidate region.

In an embodiment of the application, in the training apparatus for the image recognition model, the fuzzy target prediction unit 540, configured to predict the fuzzy target included in the training image according to the fuzzy target prediction branch network, includes: and performing two-classification prediction on whether the image of the candidate region contains the fuzzy target or not by the fuzzy target prediction branch network.

In an embodiment of the present application, in the training apparatus for an image recognition model, the fuzzy target prediction unit 540 is configured to extract a fuzzy feature map of an image in a candidate region according to a plurality of residual error modules connected in series in a fuzzy target prediction branch network, perform global average pooling on the extracted fuzzy feature map to obtain a feature vector having the same size as that of the fuzzy feature map, and perform classification prediction on whether the fuzzy target is included according to the feature vector.

In an embodiment of the present application, in the training apparatus for the image recognition model, the training control unit 550 is configured to calculate a position regression loss value of the effective target, a class cross entropy loss value of the effective target, and a class cross entropy loss value of the fuzzy target according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and determining a model loss value according to the position regression loss value of the effective target, the class cross entropy loss value of the effective target and the class cross entropy loss value of the fuzzy target.

Fig. 6 shows a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application. As shown in fig. 6, the image recognition apparatus 600 includes:

an image obtaining unit 610, configured to obtain an image to be identified.

An image recognition unit 620, configured to train the obtained image recognition model by using the training apparatus 500 for image recognition models according to any of the embodiments, and recognize an image to be recognized, so as to obtain an image recognition result, where the image recognition result includes at least one of the following: the position of the fuzzy object, the position of the effective object and the category of the effective object.

It should be noted that the training device for the image recognition model and the pattern recognition model device shown in the foregoing embodiments may be respectively used for executing the training method for the image recognition model and the image recognition method in the foregoing embodiments, and details are not repeated herein.

To sum up, the technical solution of the present application provides an image recognition model including a backbone network, a region generation network, an effective target prediction branch network, and a fuzzy target prediction branch network, and a method for training the same, specifically, after extracting a multi-scale feature map of a training image through the backbone network, determining a candidate region in the training image according to the region generation network based on the multi-scale feature map, and performing different processing through two branches, wherein an effective target included in the training image is predicted according to the effective target prediction branch network based on the multi-scale feature map and the candidate region; predicting a fuzzy target contained in the training image according to the fuzzy target prediction branch network based on the candidate region; calculating a model loss value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and updating the parameters of the image recognition model according to the model loss value, or finishing training. According to the scheme, an end-to-end image recognition model is obtained through training, whether fuzzy targets exist in real-scene images such as road collection can be effectively recognized, and the recognition accuracy and recall rate of the effective targets are improved.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It is appreciated that the subject matter described herein can be implemented in accordance with a variety of programming languages, and that any descriptions above in specific languages are provided for disclosure of enablement and best mode of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the training means and the image recognition means of the image recognition model according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

For example, fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 700 comprises a processor 710 and a memory 720 arranged to store computer executable instructions (computer readable program code). The memory 720 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 720 has a storage space 730 storing computer readable program code 731 for performing any of the method steps described above. For example, the storage space 730 for storing the computer readable program code may comprise respective computer readable program codes 731 for respectively implementing various steps in the above method. The computer readable program code 731 can be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 8.

FIG. 8 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 800 stores computer readable program code 731 for performing the steps of the method according to the present application, which is readable by the processor 710 of the electronic device 700 and which, when executed by the electronic device 700, causes the electronic device 700 to perform the steps of the method described above, in particular the computer readable program code 731 stored by the computer readable storage medium is capable of performing the method for training an image recognition model or the method for image recognition shown in any of the embodiments described above. The computer readable program code 731 may be compressed in a suitable form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A training method of an image recognition model is characterized in that the image recognition model comprises a trunk network, a region generation network, an effective target prediction branch network and a fuzzy target prediction branch network, the method comprises a plurality of iterative training stages, and each training stage comprises:

extracting a multi-scale feature map of a training image according to the backbone network;

determining a candidate region in a training image according to the region generation network based on the multi-scale feature map;

predicting effective targets contained in the training images according to the effective target prediction branch network based on the multi-scale feature map and the candidate regions;

predicting a fuzzy target contained in the training image according to the fuzzy target prediction branch network based on the candidate region;

calculating a model loss value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network;

and updating the parameters of the image recognition model according to the model loss value, or finishing training.

2. The training method of the image recognition model according to claim 1, wherein the backbone network comprises a cascade of a multi-scale feature extraction network and a multi-scale feature fusion network;

the extracting of the multi-scale feature map of the training image according to the backbone network comprises: extracting an image feature map of the training image under multiple scales according to the multi-scale feature extraction network; and according to the multi-scale feature fusion network, carrying out feature fusion processing on the image feature map of the training image under multiple scales to obtain the multi-scale feature map of the training image.

3. The method for training the image recognition model according to claim 2, wherein the multi-scale feature fusion network is specifically a feature pyramid network FPN, and the extracting the image feature map of the training image at multiple scales includes:

extracting a plurality of image characteristic graphs with sequentially reduced scales in a bottom-up mode to obtain an image characteristic graph pyramid of the training image;

the performing of feature fusion processing on the image feature map of the training image under multiple scales according to the multi-scale feature fusion network comprises: and utilizing the FPN to perform top-down processing on the image feature map pyramid.

4. The method for training an image recognition model according to claim 1, wherein the determining a candidate region in a training image according to the region generation network based on the multi-scale feature map comprises:

generating anchor point samples based on the multi-scale characteristic diagram by the area generation network according to the preset number and/or proportion of anchor points;

candidate regions are determined based on the confidence of the anchor samples.

5. The method for training an image recognition model according to claim 1, wherein the predicting the effective target included in the training image according to the effective target prediction branch network based on the multi-scale feature map and the candidate region comprises:

generating a feature map of a candidate region based on the multi-scale feature map and the candidate region;

and obtaining the position regression prediction result of the effective target and the classification prediction result of the effective target by the effective target prediction branch network according to the feature map of the candidate region.

6. The method for training an image recognition model according to claim 1, wherein the predicting, based on the candidate region, the blurred object included in the training image according to the blurred object prediction branch network includes:

and performing two-classification prediction on whether the image of the candidate region contains a fuzzy target or not by the fuzzy target prediction branch network.

7. The method for training an image recognition model according to claim 6, wherein the performing, by the fuzzy object prediction branch network, the two-class prediction on whether the image of the candidate region contains a fuzzy object comprises:

and extracting the fuzzy feature map of the image of the candidate region according to a plurality of residual error modules connected in series in the fuzzy target prediction branch network, performing global average pooling on the extracted fuzzy feature map to obtain a feature vector with the same size as the fuzzy feature map, and performing binary prediction on whether the fuzzy target is contained according to the feature vector.

8. The training method of the image recognition model according to any one of claims 1 to 7, wherein the calculating a model loss value according to the labeled information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network comprises:

respectively calculating a position regression loss value of the effective target, a category cross entropy loss value of the effective target and a category cross entropy loss value of the fuzzy target according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network;

and determining a model loss value according to the position regression loss value of the effective target, the class cross entropy loss value of the effective target and the class cross entropy loss value of the fuzzy target.

9. An image recognition method, comprising:

acquiring an image to be identified;

the image recognition model obtained by training the training method of the image recognition model according to any one of claims 1 to 8, and recognizing the image to be recognized to obtain an image recognition result, where the image recognition result includes at least one of the following: the position of the fuzzy object, the position of the effective object and the category of the effective object.

10. An apparatus for training an image recognition model, wherein the image recognition model comprises a trunk network, a region generation network, an effective target prediction branch network, and a fuzzy target prediction branch network, and the apparatus is configured to perform a plurality of iterative training phases, and the apparatus comprises:

the multi-scale feature map extracting unit is used for extracting a multi-scale feature map of a training image according to the backbone network;

a candidate region determining unit, configured to determine a candidate region in a training image according to the region generation network based on the multi-scale feature map;

the effective target prediction unit is used for predicting an effective target contained in the training image according to the effective target prediction branch network based on the multi-scale feature map and the candidate region;

a fuzzy target prediction unit, configured to predict a fuzzy target included in the training image according to the fuzzy target prediction branch network based on the candidate region;

the training control unit is used for calculating a loss function value according to the labeling information of the training image and the prediction results of the effective target prediction branch network and the fuzzy target prediction branch network; and updating the parameters of the image recognition model according to the loss function value, or finishing training.

11. An image recognition apparatus comprising:

the image acquisition unit is used for acquiring an image to be identified;

an image recognition unit, configured to train the obtained image recognition model by using the training apparatus of image recognition models according to claim 10, and recognize the image to be recognized to obtain an image recognition result, where the image recognition result includes at least one of the following: the position of the fuzzy object, the position of the effective object and the category of the effective object.

12. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the method of any one of claims 1-8 or cause the processor to perform the method of claim 9.

13. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-8 or the method of claim 9.