CN113052159A

CN113052159A - Image identification method, device, equipment and computer storage medium

Info

Publication number: CN113052159A
Application number: CN202110400954.1A
Authority: CN
Inventors: 林东青; 马军; 陈涛
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanxi Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanxi Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-06-29
Anticipated expiration: 2041-04-14
Also published as: CN113052159B

Abstract

The embodiment of the application provides an image identification method, an image identification device, image identification equipment and a computer storage medium, relates to the field of image detection, and aims to improve the accuracy of image identification. The method comprises the following steps: acquiring an image to be identified, wherein at least one object to be identified is in the image to be identified; inputting an image to be recognized into a first network in a pre-trained image recognition model, and determining text characteristics of the image to be recognized; inputting an image to be recognized into a second network in the image recognition model, and determining a pooling characteristic image and a spatial relationship characteristic of at least one object to be recognized; performing feature fusion on text features of an image to be recognized, the pooled feature images of at least one object to be recognized and spatial relationship features, and determining a shared feature image corresponding to the image to be recognized; and inputting the shared characteristic image into a third network in the image recognition model, and determining the recognition information of the image to be recognized, wherein the recognition information comprises the category information and the position information of each object to be recognized.

Description

Image identification method, device, equipment and computer storage medium

Technical Field

The present application relates to the field of image detection, and in particular, to an image recognition method, apparatus, device, and computer storage medium.

Background

The identification of target objects in images is one of the important research directions in the field of computer vision, and plays an important role in the fields of public safety, road traffic, video monitoring and the like. In the prior art, the spatial relationship characteristics of a target object in an image can be utilized to identify the target object; the recognition accuracy of the neural network to the target object can be improved by reasonably matching the image feature weights in the neural network.

However, in the prior art, due to the complex diversity of scenes contained in the image and the uncertainty of the position of the target to be detected in the image, the method cannot adapt to more scenes, and further cannot improve the accuracy of image identification.

Disclosure of Invention

The embodiment of the application provides an image identification method, an image identification device, image identification equipment and a computer storage medium, which are used for improving the accuracy of image identification.

In a first aspect, an embodiment of the present application provides an image recognition method, where the method includes:

acquiring an image to be identified, wherein at least one object to be identified is in the image to be identified;

inputting an image to be recognized into a first network in a pre-trained image recognition model, and determining text characteristics of the image to be recognized;

inputting an image to be recognized into a second network in the image recognition model, and determining a pooling characteristic image and a spatial relationship characteristic of at least one object to be recognized;

performing feature fusion on text features of an image to be recognized, the pooled feature images of at least one object to be recognized and spatial relationship features, and determining a shared feature image corresponding to the image to be recognized;

and inputting the shared characteristic image into a third network in the image recognition model, and determining the recognition information of the image to be recognized, wherein the recognition information comprises the category information and the position information of each object to be recognized.

In a second aspect, an embodiment of the present application provides an image recognition apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a recognition module, wherein the first acquisition module is used for acquiring an image to be recognized, and at least one object to be recognized is arranged in the image to be recognized;

the first determining module is used for inputting the image to be recognized into a first network in a pre-trained image recognition model and determining the text characteristics of the image to be recognized;

the second determination module is used for inputting the image to be recognized into a second network in the image recognition model and determining the pooling characteristic image and the spatial relationship characteristic of at least one object to be recognized;

the fusion module is used for performing feature fusion on the text feature of the image to be recognized, the pooling feature image of at least one object to be recognized and the spatial relationship feature to determine a shared feature image corresponding to the image to be recognized;

and the identification module is used for inputting the shared characteristic image to a third network in the image identification model and determining the identification information of the image to be identified, wherein the identification information comprises the category information and the position information of each object to be identified.

In a third aspect, an embodiment of the present application provides an image recognition apparatus, including:

a processor, and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the image recognition method as provided by the first aspect of the embodiments of the present application.

In a fourth aspect, an embodiment of the present application provides a computer storage medium, on which computer program instructions are stored, and when executed by a processor, the computer program instructions implement the image recognition method provided in the first aspect of the embodiment of the present application.

The image recognition method provided by the embodiment of the application extracts the text features of the image to be detected, and the pooling feature image and the spatial relationship features of at least one first target object in the image to be detected, performs feature fusion on the three features, inputs the fused shared feature map into a third network in an image recognition model, and determines the recognition information of the image to be recognized, wherein the recognition information comprises the category information and the position information of each object to be recognized. Compared with the prior art, the method has the advantages that the complementation of the image information is realized through feature fusion, the defects of the image feature information on details and scenes are overcome while redundant noise is avoided, and meanwhile, the extraction of the text features can reflect the difference and the commonality of the image in different scenes, so that the method is suitable for more complex scenes and improves the accuracy of image identification.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a training method of an image recognition model according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a multi-modal feature fusion module provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of an image recognition method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of an image recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an image recognition device according to an embodiment of the present application.

Detailed Description

Features and exemplary embodiments of various aspects of the present application will be described in detail below, and in order to make objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are intended to be illustrative only and are not intended to be limiting. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by illustrating examples thereof.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The image recognition algorithm is one of important research directions in the field of computer vision, and plays an important role in the fields of public safety, road traffic, video monitoring and the like. In recent years, image recognition accuracy has been increasing due to the development of image recognition algorithms based on deep learning.

In the prior art, image recognition is performed in two ways:

multi-view image target detection method based on visual saliency

And calculating saliency maps of a plurality of view angle images aiming at a scene with a foreground target not being shielded, projecting the saliency maps of two view angles to a middle target view angle by utilizing a spatial relation between the view angles, and fusing the projected saliency maps and the saliency maps of the middle view angle to obtain a fused saliency map. The area shielded by the foreground object cannot be really mapped to a target view angle during projection, a projection hole is generated around the foreground object in the projection saliency map, and the projection hole area is regarded as a background area in the fusion saliency map. And dividing image areas by using the multi-view projection holes, wherein the areas between the projection holes and the image edges and the areas between the projection holes of different foreground objects are both taken as background areas. In the fusion significance map, the significance value of the background area obtained above is set to be zero, and the target object with clear edge and no background interference can be obtained after binarization.

Second, small target detection algorithm under complex background

By taking the thought of the feature pyramid algorithm as a reference, the features of the Conv4-3 layer are fused with the features of the Conv7 and the Conv3-3 layer, and the number of default boxes corresponding to each position of the fused feature map is increased. A clipping-weight distribution network (SEnet) is added in the network structure, and weight distribution is carried out on the feature channels of each layer, so that useful feature weights are improved, and invalid feature weights are suppressed. Meanwhile, in order to enhance the generalization ability of the network, a series of enhancement processing is carried out on the training data set.

The two algorithms are common technologies for detecting and identifying the target object in the image, however, the conventional target detection method has poor robustness in different application scenes due to the complex diversity of the scenes contained in the image and the uncertainty of the position of the target to be detected in the image. According to the multi-view image target detection method based on visual saliency, only the spatial relation characteristic of the target to be detected in the image is considered, but the information supplement is not performed by utilizing various characteristic information in the image so as to improve the accuracy of final image recognition. The small target detection algorithm under the complex background does not consider the context information in the complex background and the spatial relationship of the target to be detected, has a narrow application range, mainly improves the detection and identification accuracy of the small target in the image, and omits the application of the algorithm in more complex scenes.

Based on this, the embodiment of the application provides an image identification method, which realizes the complementation of image information through feature fusion, overcomes the defects of image feature information on details and scenes while avoiding redundant noise, and simultaneously, the extraction of text features can reflect the difference and the commonality of images in different scenes, thereby being applicable to more complex scenes and improving the accuracy of image identification.

In the image recognition method according to the embodiment of the present application, since it is necessary to recognize an image using an image recognition model trained in advance, it is necessary to train the image recognition model before performing image recognition using the image recognition model. Therefore, a specific implementation of the training method for an image recognition model provided in the embodiments of the present application is first described below with reference to the drawings.

As shown in fig. 1, an embodiment of the present application provides a training method for an image recognition model, which includes obtaining a sample image, fusing information such as a pooling feature map, text features, and spatial relationship features extracted from the sample image to form a shared feature map with richer information, and performing iterative training on a preset image recognition model through a classification and regression detection algorithm until a training stop condition is satisfied. The method can be realized by the following steps:

firstly, acquiring a plurality of images to be marked.

In some embodiments, a plurality of images to be annotated can be acquired by a vehicle-mounted camera or a plurality of images to be annotated can be obtained by performing frame extraction processing on an acquired video.

And secondly, manually labeling the images to be labeled, wherein the content to be labeled is label identification information of the target object, the label identification information comprises classification information and position information of the target identification object, and the position information is a coordinate value surrounding a boundary frame of the target object.

In some embodiments, the image captured by the vehicle-mounted camera mainly takes road traffic as a main scene, so that the labeled object of the image to be labeled can include target objects such as pedestrians, riders, bicycles, motorcycles, automobiles, trucks, buses, trains, traffic signs, traffic lights and the like, and the labeling result is the category of the target object and the coordinate value of a boundary frame surrounding the target object; and simultaneously, performing text annotation on each image to be annotated from three angles of time, place and weather.

Specifically, for each image to be annotated, from a temporal perspective, selectable values include day, dusk/dawn, night; annotating from a location perspective, optional values including highways, city streets, residences, parking lots, gas stations, tunnels; from a weather perspective, the selectable values include snow, cloudy, sunny, cloudy, rain, fog.

And thirdly, integrating the artificially labeled images and the labeling information corresponding to each image into a training sample set, wherein the training sample set comprises a plurality of sample image groups.

It should be noted that, because the image recognition model needs to perform iterative training for multiple times to adjust the loss function value until the loss function value meets the training stop condition, the trained image recognition model is obtained, and in each iterative training, if only one sample image is input, the sample amount is too small to be beneficial to the training adjustment of the image recognition model, so that the training sample set is divided into multiple sample image groups, each sample image group contains multiple sample images, and the iterative training of the image recognition model is performed by using the multiple sample image groups in the training sample set.

And fourthly, training the image recognition model by utilizing the sample image group in the training sample set until the training stopping condition is met, and obtaining the trained image recognition model. The method specifically comprises the following steps:

and 4.1, extracting a sample pooling feature map and sample spatial relationship features of the identifiable objects in the sample image by using a second network in the preset image identification model.

In some embodiments, the second network in the preset image recognition model may be a fast regional convolutional neural network, fast RCNN network, which is not limited in this application.

Specifically, obtaining a sample pooling feature map and a sample spatial relationship feature of an identifiable object in a sample image can be implemented by the following steps:

and 4.1.1, uniformly adjusting the sample images in the training set to a fixed size of 1000 × 600 pixels to obtain the sample images after size adjustment.

And 4.1.2, inputting the sample image group after size adjustment into a depth residual error network ResNet, a region generation network RPN and a fast region convolution neural network to extract image features, and obtaining a pooling feature map.

1) Firstly, inputting a sample image after being adjusted in size into a convolution layer conv1 of 7 multiplied by 64, and then extracting an original feature map of the sample image through convolution layer conv2_ x, conv3_ x, conv4_ x, conv5_ x and a full connection layer fc in sequence;

2) inputting an original feature map output by conv4_ x in a ResNet network structure into an area extraction network RPN, and picking out the top 300 anchor frames (anchors) with the highest score and candidate frames corresponding to the anchor frames from a prediction result;

3) and inputting the position mapping chart of the 300 candidate frames into a region-of-interest Pooling layer ROI Pooling in the fast regional convolutional neural network by comparing with the original feature chart output by conv4_ x to obtain a fixed-size Pooling feature chart of the identifiable object.

4.1.3, calculating an Intersection over Intersection (IOU) between the candidate frames by using the coordinates of the 300 anchors and the candidate frames corresponding to the anchors, calculating the spatial relationship characteristic between the identifiable objects by the following formula 1,

F_r＝f(w,h,area,d_x,d_yIOU) equation 1

Wherein the content of the first and second substances,_wand h represents the width and height of the candidate box,_areaindicates the area of the candidate frame, d_xAnd d_yIs the horizontal and vertical distance between the geometric centers of the two candidate frames, IOU is the intersection ratio between the candidate frames, F (-) represents the activation function, F_rRepresenting predicted spatial relationship features between identifiable objects.

4.2, inputting the sample image into a first network in a preset image recognition model, determining at least one text vector according to the context information of the sample image, splicing the at least one text vector, and determining sample text characteristics F corresponding to the sample image_t。

It should be noted that the first network in the image recognition model may be a pre-training model such as Word2vec, Glove, or BERT; the text vector determined according to the context information of the sample image may be a word vector that converts text annotation information describing time, place, and weather of the sample image, which is not limited in the present application.

And 4.3, as shown in fig. 2, constructing a multi-modal feature fusion module, and complementarily fusing the sample text features extracted according to the context information of the sample image, the sample spatial relationship features determined by the second network based on the image recognition model and the sample pooling feature map to obtain a sample sharing feature image. The fusion calculation method can be realized by formula 2 and formula 3:

F_v＝ReLu(F_roi,F_r) Equation 2

F_out＝F_v*F_tEquation 3

Wherein, F_roiRepresenting the fixed-size feature map output after Pooling through the Pooling layer ROI Pooling, F_vRepresenting the original feature map, F_outAnd the shared characteristic image of the sample obtained after the text characteristic of the sample, the spatial relation characteristic of the sample and the pooling characteristic of the sample are fused is represented.

And 4.4, inputting the sample sharing characteristic image into a third network in a preset image recognition model, and determining reference recognition information of each recognizable object, wherein the reference recognition information comprises classification information and reference position information of the recognizable object.

And 4.5, carrying out non-maximum value suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all identifiable objects.

In some embodiments, a Non Maximum Suppression (NMS) process is performed on the reference location information of each type of identifiable object, the NMS obtains the prediction lists arranged according to the scores and iterates the sorted prediction lists, the prediction results with IOU values larger than a predefined threshold are discarded, where the threshold is set to 0.7, candidate frames with a large overlap degree are filtered, and the suppressed location information is determined as the predicted location information.

4.6, calculating a loss value between the predicted identification information and the labeled identification information, optimizing the image identification model according to a target loss function shown in a formula 4, reversely updating network parameters by using a gradient descent algorithm to obtain an updated image identification model, stopping optimization training until a loss function value is smaller than a preset value, and determining the trained image identification model.

Where i denotes the index of the anchor, p_iRepresenting the probability that the ith anchor is predicted to be the target,

a probability of whether the ith anchor is a true sample label of the sample, λ is a parameter representing a weight,

representing the log loss of both classes (target and non-target),

denotes classification loss, t ═ t_x,t_y,t_w,t_hDenotes the offset predicted by the anchor during the RPN training phase (rois during Fast RCNN),

representing the actual offset of the anchor relative to the true tag during the RPN training phase (rois during Fast RCNN),

represents the regression loss.

It should be noted that, in order to improve the accuracy of the image recognition model, the image recognition model may also be continuously trained with new training samples in practical applications, so as to continuously update the image recognition model, improve the accuracy of the image recognition model, and further improve the accuracy of image recognition.

As described above for the specific implementation of the training method of the image recognition model provided in the embodiment of the present application, the image recognition model obtained through the training can be applied to the image recognition method provided in the following embodiment.

The following describes a specific implementation of the image recognition method provided in the present application in detail with reference to fig. 3.

As shown in fig. 3, an embodiment of the present application provides an image recognition method, where the method includes:

s301, an image to be recognized is obtained, and at least one object to be recognized is in the image to be recognized.

In some embodiments, the object to be recognized may be acquired by a vehicle-mounted camera, or a pre-acquired video may be subjected to frame extraction to determine an image to be recognized.

Taking a road traffic scene as an example, the object to be recognized in the image to be recognized may be a pedestrian, a rider, a bicycle, a motorcycle, an automobile, a truck, a bus, a train, a traffic sign, a traffic light, and the like.

S302, inputting the image to be recognized into a first network in a pre-trained image recognition model, and determining the text characteristics of the image to be recognized.

In some embodiments, the image to be recognized is input to a first network in a pre-trained image recognition model, and at least one text vector is determined according to context information of the image to be recognized; and splicing the at least one text vector to determine the text characteristics of the image to be recognized.

It should be noted that the text vector is a word vector determined by converting text annotation information describing time, location, and weather of the sample image according to context information of the image to be recognized based on the first network, so that the environmental information of the image to be recognized can be represented by splicing text features determined by a plurality of text vectors, and differences and commonalities of the image to be recognized in different scenes can be further reflected, so as to enhance the recognition degree of the object to be recognized.

S303, inputting the image to be recognized into a second network in the image recognition model, and determining the pooling characteristic image and the spatial relationship characteristic of at least one object to be recognized.

It should be noted that, when an object to be recognized is recognized, since there is a large amount of redundant information in an image to be recognized, it is necessary to perform convolution processing on the image, and after the image features are determined through the convolution processing, the image recognition model can be trained by using the extracted image features, but this has a high calculation cost, so it is necessary to perform pooling processing on the image to reduce the dimension of the image features, reduce the amount of calculation and the number of parameters, prevent overfitting, and improve the fault tolerance of the model.

On the other hand, the spatial relationship refers to relative spatial position and relative direction relationship between a plurality of target objects segmented from an image, and these relationships can also be classified into a connection relationship, an overlapping/overlapping relationship, and an inclusion/containment relationship. Therefore, the extraction of the spatial relationship features can enhance the distinguishing capability of the image contents.

In some embodiments, determining the pooled feature image and the spatial relationship feature of at least one of the objects to be identified may be performed by:

1. and adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining the adjusted sample image group.

In this step, the sample images in the training set can be uniformly adjusted to a fixed size of 1000 × 600 pixels.

2. And inputting the adjusted sample image group into a depth residual error network, and determining an original image set, wherein the images in the original image set correspond to the images in the adjusted sample image group one by one.

Specifically, the resized sample image may be input to the 7 × 7 × 64 convolutional layer conv1, and then the original feature map of the sample image may be extracted sequentially through the convolutional layer conv2_ x, conv3_ x, conv4_ x, conv5_ x, and one full connection layer fc.

3. Inputting an original image set into a regional extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which surround recognizable objects and are predicted by the regional extraction network, and N is an integer larger than 1; and extracting M anchor frames with the confidence degrees larger than a preset confidence degree threshold value from the N anchor frames based on the confidence degrees of the N anchor frames, wherein M is a positive integer smaller than N.

As an example, the original feature map output by conv4_ x in the ResNet network structure may be input into the region extraction network RPN, a plurality of anchor frames and candidate frames corresponding to the anchor frames are determined, and based on the confidence of each anchor frame, 300 anchor frames with higher confidence and candidate frames corresponding to the anchor frames are selected.

4. Inputting the mapping region images of the M anchor frames into an interested region pooling layer of the regional convolutional neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame.

In this step, the position map of the 300 candidate box may be input to the region-of-interest pooling layer in the fast area convolutional neural network according to the original feature map output by conv4_ x, so as to obtain a pooled feature map of a fixed size capable of identifying the object.

S304, performing feature fusion on the text feature of the image to be recognized, the pooled feature image of the at least one object to be recognized and the spatial relationship feature, and determining a shared feature image corresponding to the image to be recognized.

In the above-mentioned S202 and S203, the text feature of the image to be recognized, the pooled feature image of at least one object to be recognized, and the spatial relationship feature are extracted, although spatial relationship features are more sensitive to recognition of rotation, inversion, size changes of an image or a target object in an image and pooling of feature maps can reduce the amount of computation in image recognition, in practical application, however, it is not enough to use only the spatial relationship features and/or the pooling features, and the scene information cannot be effectively and accurately expressed, therefore, the text feature of the image to be recognized, the pooling feature image of at least one object to be recognized and the spatial relationship feature need to be subjected to feature fusion, and a plurality of feature information in the image is fully utilized for information supplement so as to reflect the difference and commonality of the image in different scenes, the method avoids redundant noise and makes up the defects of image characteristic information in details and scenes.

S305, inputting the shared characteristic image into a third network in the image recognition model, and determining the recognition information of the image to be recognized, wherein the recognition information comprises the category information and the position information of each object to be recognized.

According to the image recognition method provided by the embodiment of the application, the text feature of the image to be recognized, the pooling feature image of at least one object to be recognized and the spatial relationship feature are determined through the image recognition model. The multiple characteristic information is complementarily fused, and the identification degree of the object to be identified in the image is enhanced, so that the final image identification performance is optimized, the method is suitable for more complex scenes, and the accuracy of image identification is improved.

In order to verify that the image recognition method provided in the above embodiment can improve the accuracy of image recognition compared with the image recognition method in the prior art, the embodiment of the present application further provides a test method of image recognition, which tests an image recognition model applied in the image recognition method of the present application. Specifically, the following steps may be included:

firstly, inputting a sample image into a trained image recognition model for testing.

Specifically, the average detection precision of all the category target objects is calculated according to the formulas 5 and 6, and the classification and prediction precision of each prediction frame is output:

n represents the number of the categories of the target to be detected, AP represents the average precision, and mAP represents the average precision mean value of all the categories.

Secondly, obtaining a detection result according to the AP and mAP calculation formula, comparing the advantages and disadvantages of an image recognition algorithm using a fast RCNN network algorithm and an image recognition model provided by the embodiment of the application in the prior art, and obtaining a conclusion:

the image recognition method provided by the embodiment of the application is used in a classical image recognition network, the image recognition effect is remarkably improved, even under the condition that the background difference of the image is large, the recognition accuracy of the target object in the image is still maintained at a stable level, and the image recognition method has a better recognition effect compared with the original algorithm.

Specifically, an embodiment of the present invention specifically describes a method for testing an image recognition model provided in the embodiment of the present application through the following simulation experiment.

The prior art adopted in the simulation experiment provided by the application is a Faster regional convolutional neural network fast RCNN; the image recognition model selects a ResNet101 structure to extract image features, an initial learning rate is set to be 0.005, a learning rate attenuation coefficient is set to be 0.1, epoch is set to be 15, and an SGD is selected by a default optimizer.

1. Simulation conditions are as follows: the simulated hardware environment of the application is as follows: intel Core i7-7700@3.60GHz,8G memory; software environment: ubuntu 16.04, python3.7, pycharm 2019.

2. Simulation content and result analysis:

firstly, taking a sample image set as input, introducing text feature extraction, spatial relationship feature extraction and pooling feature map acquisition based on context information on the basis of the traditional fast RCNN algorithm, then fusing the three features into the basic idea of the detection method, training an image recognition model by the aid of the method, inputting a test sample set into the trained improved model, and evaluating the average precision of each category and the average precision of all categories by the AP index.

The BDD100k public driving data set is used for carrying out experiments, simulation experiment results are shown in table 1, and comparison results of a classical fast RCNN algorithm and a multi-modal feature fusion detection method based on context information on the same data set are shown in the table.

TABLE 1 Performance comparison of image recognition methods

As can be seen from the experimental results in table 1, compared with the detection accuracy of the classical fast RCNN algorithm on the test data set, the image identification method provided by the embodiment of the present application improves the average detection accuracy of five types of targets by approximately 4.3% in the tasks of different scenes. Multiple experiments prove that: the multi-modal feature fusion technology utilizes the complementarity among information to enhance the representation of input features, can effectively improve the performance of a target detection algorithm, and obviously improves the average precision in a large number of categories in different image recognition scenes. In a real-life scene, image/video data are high in acquisition difficulty and often lack, so that the traditional target detection method based on images and videos is not applicable, and the image identification method provided by the embodiment of the application can enhance the complementarity among information and has important significance on detection tasks in different scenes.

Based on the same inventive concept of the image identification method, the embodiment of the application also provides an image identification device.

As shown in fig. 4, an embodiment of the present application provides an image recognition apparatus, which may include:

a first obtaining module 401, configured to obtain an image to be identified, where the image to be identified includes at least one object to be identified;

a first determining module 402, configured to input an image to be recognized to a first network in a pre-trained image recognition model, and determine a text feature of the image to be recognized;

a second determining module 403, configured to input the image to be recognized into a second network in the image recognition model, and determine a pooled feature image and a spatial relationship feature of at least one object to be recognized;

the fusion module 404 is configured to perform feature fusion on a text feature of an image to be recognized, a pooled feature image of at least one object to be recognized, and a spatial relationship feature, and determine a shared feature image corresponding to the image to be recognized;

and the identification module 405 is configured to input the shared feature image to a third network in the image identification model, and determine identification information of the image to be identified, where the identification information includes category information and location information of each object to be identified.

In some embodiments, the apparatus may further comprise:

the second acquisition module is used for acquiring a training sample set, the training sample set comprises a plurality of sample image groups, each sample image group comprises a sample image and a corresponding label image, label identification information of a target identification object and scene information of the sample image are marked in the label image, and the label identification information comprises category information and position information of the target identification object;

and the training module is used for training a preset image recognition model by utilizing the sample image group in the training sample set until a training stopping condition is met, so as to obtain the trained image recognition model.

In some embodiments, the training module may be specifically configured to:

for each sample image group, the following steps are respectively executed:

inputting the sample image group into a first network in a preset image recognition model, and determining sample text characteristics corresponding to each sample image;

inputting the sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and a sample spatial relationship feature of each recognizable object;

performing feature fusion on each sample image according to the sample text features corresponding to each sample image, the sample pooling feature map of each identifiable object and the sample spatial relationship features, and determining a sample sharing feature image corresponding to each sample image;

inputting the sample sharing characteristic image into a third network in a preset image recognition model, and determining reference recognition information of each recognizable object, wherein the reference recognition information comprises classification information and reference position information of the recognizable object;

carrying out non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all identifiable objects;

determining a loss function value of a preset image recognition model according to the predicted recognition information of the target sample image and the label recognition information of all target recognition objects on the target sample image, wherein the target sample image is any one of the sample image groups;

and under the condition that the loss function value does not meet the training stopping condition, adjusting the model parameters of the image recognition model, and training the image recognition model after parameter adjustment by using the sample image group until the loss function value meets the training stopping condition to obtain the trained image recognition model.

In some embodiments, the training module may be specifically configured to:

for each sample image, the following steps are respectively executed:

inputting a sample image into a first network in a preset image recognition model, and determining at least one text vector according to context information of the sample image;

and splicing at least one text vector, and determining sample text characteristics corresponding to the sample image.

5. The method according to claim 3, the second network in the pre-set image recognition model comprising at least a depth residual network, a region extraction network and a region convolution neural network,

inputting the sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and a sample spatial relationship feature of each recognizable object, wherein the method comprises the following steps:

adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining the adjusted sample image group;

inputting the adjusted sample image group into a depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one to one;

inputting an original image set into a regional extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which surround recognizable objects and are predicted by the regional extraction network, and N is an integer larger than 1;

extracting M anchor frames with confidence degrees larger than a preset confidence degree threshold value from the N anchor frames based on the confidence degrees of the N anchor frames, wherein M is a positive integer smaller than N;

inputting the mapping region images of the M anchor frames into an interested region pooling layer of a regional convolutional neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame;

and determining the sample spatial relationship characteristic of each identifiable object according to the intersection ratio and the relative position between at least one anchor frame corresponding to each identifiable object.

In some embodiments, the training module may be specifically configured to:

dividing all identifiable objects into a plurality of groups based on the classification information of each identifiable object, and determining reference position information of the plurality of groups of identifiable objects of different classes;

filtering the reference position information of each type of identifiable objects;

and determining the predicted identification information of each sample image according to the reference position information of the identifiable objects after filtering and the classification information of the identifiable objects after filtering.

In some embodiments, the training module may be specifically configured to:

calculating intersection and comparison between the target frame and other reference frames in sequence, wherein the target frame is any one of a plurality of reference frames, and the reference frame is a boundary frame surrounding the identifiable object determined in the reference position information;

filtering the reference frames with the intersection ratio larger than a preset intersection ratio threshold value until the intersection ratio between any two reference frames is smaller than the preset intersection ratio threshold value;

the reference frame after the filtering is determined as the predicted position information of the recognizable object.

Other details of the image recognition apparatus provided according to the embodiment of the present application are similar to the image recognition method according to the embodiment of the present application described above with reference to fig. 1, and are not repeated herein.

Fig. 5 shows a hardware structure diagram of image recognition provided by an embodiment of the present application.

The image recognition method and the image recognition device provided according to the embodiment of the application, which are described in conjunction with fig. 1 and 4, can be implemented by an image recognition device. Fig. 5 is a diagram illustrating a hardware structure 500 of an image recognition apparatus according to an embodiment of the present invention.

A processor 501 and a memory 502 storing computer program instructions may be included in the image recognition device.

Specifically, the processor 501 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement the embodiments of the present Application.

Memory 502 may include mass storage for data or instructions. By way of example, and not limitation, memory 502 may include a Hard Disk Drive (HDD), a floppy Disk Drive, flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. In one example, memory 502 can include removable or non-removable (or fixed) media, or memory 502 is non-volatile solid-state memory. The memory 502 may be internal or external to the integrated gateway disaster recovery device.

In one example, the Memory 502 may be a Read Only Memory (ROM). In one example, the ROM may be mask programmed ROM, programmable ROM (prom), erasable prom (eprom), electrically erasable prom (eeprom), electrically rewritable ROM (earom), or flash memory, or a combination of two or more of these.

The processor 501 reads and executes the computer program instructions stored in the memory 502 to implement the methods/steps S301 to S305 in the embodiment shown in fig. 3, and achieve the corresponding technical effects achieved by the method/steps executed by the example shown in fig. 3, which are not described herein again for brevity.

In one example, the image recognition device can also include a communication interface 503 and a bus 510. As shown in fig. 5, the processor 501, the memory 502, and the communication interface 503 are connected via a bus 510 to complete communication therebetween.

The communication interface 503 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present application.

Bus 510 comprises hardware, software, or both to couple the components of the online data traffic billing device to each other. By way of example, and not limitation, a Bus may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (Front Side Bus, FSB), a Hyper Transport (HT) interconnect, an Industry Standard Architecture (ISA) Bus, an infiniband interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a Micro Channel Architecture (MCA) Bus, a Peripheral Component Interconnect (PCI) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a video electronics standards association local (VLB) Bus, or other suitable Bus or a combination of two or more of these. Bus 510 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The image recognition equipment provided by the embodiment of the application realizes the complementation of image information through feature fusion, overcomes the defects of image feature information on details and scenes while avoiding redundant noise, makes full use of various feature information in images to supplement information, and simultaneously extracts text features, can reflect the difference and the commonality of the images in different scenes, and further can be suitable for more complex scenes and improve the accuracy of image recognition.

In addition, in combination with the image recognition method in the foregoing embodiments, the embodiments of the present application may be implemented by providing a computer storage medium. The computer storage medium having computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement any of the image recognition methods in the above embodiments.

It is to be understood that the present application is not limited to the particular arrangements and instrumentality described above and shown in the attached drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications, and additions or change the order between the steps after comprehending the spirit of the present application.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic Circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware for performing the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims

1. An image recognition method, characterized in that the method comprises:

acquiring an image to be recognized, wherein the image to be recognized is provided with at least one object to be recognized;

inputting the image to be recognized into a first network in a pre-trained image recognition model, and determining text characteristics of the image to be recognized;

inputting the image to be recognized into a second network in the image recognition model, and determining a pooling characteristic image and a spatial relationship characteristic of the at least one object to be recognized;

performing feature fusion on the text feature of the image to be recognized, the pooling feature image of the at least one object to be recognized and the spatial relationship feature, and determining a shared feature image corresponding to the image to be recognized;

inputting the shared characteristic image into a third network in the image recognition model, and determining the recognition information of the image to be recognized, wherein the recognition information comprises the category information and the position information of each object to be recognized.

2. The method of claim 1, wherein prior to said acquiring an image to be identified, the method further comprises:

acquiring a training sample set, wherein the training sample set comprises a plurality of sample image groups, each sample image group comprises a sample image and a corresponding label image, label identification information of a target identification object and scene information of the sample image are marked in the label image, and the label identification information comprises category information and position information of the target identification object;

and training a preset image recognition model by using the sample image group in the training sample set until a training stopping condition is met, and obtaining the trained image recognition model.

3. The method according to claim 2, wherein the training of the image recognition model using the sample image group in the training sample set until a training stop condition is satisfied to obtain the trained image recognition model specifically comprises:

for each sample image group, respectively executing the following steps:

performing feature fusion on each sample image according to the sample text feature corresponding to each sample image, the sample pooling feature map of each identifiable object and the sample spatial relationship feature, and determining a sample sharing feature image corresponding to each sample image;

determining a loss function value of the preset image recognition model according to the predicted recognition information of a target sample image and the label recognition information of all target recognition objects on the target sample image, wherein the target sample image is any one of the sample image groups;

4. The method of claim 3, wherein inputting the set of sample images to a first network in a predetermined image recognition model, determining a sample text feature corresponding to each of the sample images, comprises:

for each sample image, respectively executing the following steps:

inputting the sample image into a first network in the preset image recognition model, and determining at least one text vector according to the context information of the sample image;

and splicing the at least one text vector, and determining sample text features corresponding to the sample image.

5. The method of claim 3, wherein the second network in the pre-set image recognition model comprises at least a depth residual network, a region extraction network, and a region convolution neural network,

inputting the adjusted sample image group into the depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one to one;

inputting the original image set into the regional extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the regional extraction network and surround the identifiable object, and N is an integer greater than 1;

inputting the mapping region images of the M anchor frames into a region-of-interest pooling layer of the regional convolutional neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame;

6. The method of claim 3, wherein the performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet preset requirements, and determining the predicted identification information of each sample image comprises:

7. The method according to claim 6, wherein the filtering the reference location information of each identifiable object type comprises:

sequentially calculating the intersection ratio between a target frame and other reference frames, wherein the target frame is any one of a plurality of reference frames, and the reference frame is a boundary frame surrounding the identifiable object determined in the reference position information;

8. An image recognition apparatus, characterized in that the apparatus comprises:

the device comprises a first acquisition module, a second acquisition module and a recognition module, wherein the first acquisition module is used for acquiring an image to be recognized, and the image to be recognized is provided with at least one object to be recognized;

the second determination module is used for inputting the image to be recognized into a second network in the image recognition model and determining the pooling characteristic image and the spatial relation characteristic of the at least one object to be recognized;

the fusion module is used for performing feature fusion on the text feature of the image to be recognized, the pooling feature image of the at least one object to be recognized and the spatial relationship feature, and determining a shared feature image corresponding to the image to be recognized;

and the identification module is used for inputting the shared characteristic image to a third network in the image identification model and determining identification information of the image to be identified, wherein the identification information comprises category information and position information of each object to be identified.

9. An image recognition apparatus, characterized in that the apparatus comprises: a processor, and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the image recognition method of any one of claims 1-7.

10. A computer storage medium having computer program instructions stored thereon which, when executed by a processor, implement the image recognition method of any one of claims 1-7.