CN115331150A

CN115331150A - Image recognition method, image recognition device, electronic equipment and storage medium

Info

Publication number: CN115331150A
Application number: CN202211041337.8A
Authority: CN
Inventors: 张恒
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-11

Abstract

The present disclosure relates to an image recognition method, apparatus, electronic device, storage medium and computer program product, the method comprising: responding to an image identification request carrying an object description text, and extracting target text characteristics corresponding to the object description text; acquiring a feature map corresponding to an image to be identified, and identifying area position information and area image features corresponding to an image area where at least one object is located from the feature map; if the target area image features matched with the target text features exist in the area image features, taking the object corresponding to the target area image features as the target object described by the object description text; and generating an object recognition result of the image to be recognized according to the area position information corresponding to the target object and the text identification information obtained based on the object description text. By adopting the method, the image identification can be carried out on the objects in different fields, the object elements in the video data are effectively extracted, and the image identification efficiency is improved.

Description

Image recognition method, image recognition device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to an image recognition method, an image recognition apparatus, an electronic device, a storage medium, and a computer program product.

Background

With the rapid development of the internet, a large amount of video resources are generated, and in order to perform more detailed processing on videos, content understanding work needs to be performed on video data. In a detailed content understanding process for video data, object elements in a video need to be identified and extracted. In the traditional method, a video label system is usually constructed based on a certain field, and the original video data are manually marked to extract corresponding video elements in a mode of manually marking training data, so that time and labor are wasted, the identification capability is limited in a certain field, and the generalization is not realized.

Therefore, the related art has the problem that the efficiency of identifying and extracting the object elements in the video is low.

Disclosure of Invention

The present disclosure provides an image recognition method, an image recognition apparatus, an electronic device, a storage medium, and a computer program product, so as to at least solve the problem in the related art that the efficiency of recognizing and extracting object elements in a video is low. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an image recognition method, including:

responding to an image identification request carrying an object description text, and extracting a target text characteristic corresponding to the object description text;

acquiring a feature map corresponding to an image to be identified, and identifying area position information and area image features corresponding to an image area where at least one object is located from the feature map; the regional image features are image features of a foreground region;

if each regional image feature has a target regional image feature matched with the target text feature, taking an object corresponding to the target regional image feature as a target object described by the object description text;

and generating an object recognition result of the image to be recognized according to the area position information corresponding to the target object and the text identification information obtained based on the object description text.

In a possible implementation manner, the extracting, in response to an image recognition request carrying an object description text, a target text feature corresponding to the object description text includes:

responding to an image recognition request, and acquiring an object description text from the image recognition request;

inputting the object description text into a pre-trained text coding model to obtain the target text characteristics; the pre-trained text coding model is obtained by training based on the matched sample text and sample image in combination with the text coding model to be trained and the image coding model to be trained.

In a possible implementation manner, the obtaining a feature map corresponding to an image to be recognized, and recognizing, from the feature map, area position information and area image features corresponding to an image area where at least one object is located, includes:

acquiring an image to be recognized, and inputting the image to be recognized into a pre-trained feature map generation model to obtain the feature map;

inputting the characteristic diagram into a pre-trained object region identification model to obtain region position information and region image characteristics corresponding to an image region where the at least one object is located;

the pre-trained feature map generation model is obtained by training in combination with the pre-trained object region identification model and the pre-trained image coding model; the pre-trained image coding model is used for outputting first image characteristics in a training stage of a characteristic diagram generation model so as to adjust second image characteristics processed by the characteristic diagram generation model to be trained and the pre-trained object region identification model; the pre-trained image coding model is obtained by training based on the matched sample text and sample image in combination with the text coding model to be trained and the image coding model to be trained.

In a possible implementation manner, if each of the area image features has a target area image feature that matches the target text feature, taking an object corresponding to the target area image feature as a target object described by the object description text, includes:

acquiring a mapping relation between an image feature space and a text feature space; the mapping relation is obtained according to a pre-trained text coding model and a pre-trained image coding model;

and determining the regional image features matched with the target text features as the target regional image features according to the mapping relation, and taking the objects corresponding to the target regional image features as the target objects.

In one possible implementation, the pre-trained image coding model and the pre-trained text coding model are trained by:

acquiring first training data; the first training data comprise positive samples formed by matched sample texts and sample images and negative samples formed by unpaired sample texts and sample images, and the matched sample texts and sample images comprise texts and images of objects in different fields;

inputting the first training data into an image coding model to be trained for coding to obtain a sample image feature queue and sample image features corresponding to the positive samples;

inputting the first training data into a text coding model to be trained for coding to obtain a sample text feature queue and sample text features corresponding to the positive samples;

and performing model training on the image coding model to be trained and the text coding model to be trained based on the sample text features corresponding to the positive sample and the sample image feature queue, and the sample image features corresponding to the positive sample and the sample text feature queue to obtain the pre-trained image coding model and the pre-trained text coding model.

In one possible implementation manner, the performing model training on the image coding model to be trained and the text coding model to be trained based on the sample text feature and the sample image feature queue corresponding to the positive sample and the sample image feature and the sample text feature queue corresponding to the positive sample to obtain the pre-trained image coding model and the pre-trained text coding model includes:

determining a first similarity according to the sample text features corresponding to the positive sample and the sample image feature queue, and determining a second similarity according to the sample image features corresponding to the positive sample and the sample text feature queue;

determining a target loss value according to the first similarity and the second similarity;

adjusting model parameters in the image coding model to be trained and model parameters in the text coding model to be trained according to the target loss value until a first training end condition is met to obtain the pre-trained image coding model and the pre-trained text coding model; the first training end condition includes maximizing a first comparison result for the same paired sample text and sample image, and minimizing a second comparison result for an unrelated feature in the unpaired sample text and sample image.

In one possible implementation, the pre-trained feature map generation model is obtained by training:

acquiring second training data; the second training data comprises sample image data carrying labeling box information;

inputting the sample image data into a feature map generation model to be trained to obtain a sample feature map, and inputting the sample feature map into the pre-trained object region identification model to obtain the second image feature;

obtaining an annotated region image obtained by cutting the sample image data according to the annotated frame information, and inputting the annotated region image into the pre-trained image coding model to obtain the first image characteristic;

determining an alignment loss value according to the alignment result of the first image characteristic and the second image characteristic;

adjusting model parameters in the feature diagram generation model to be trained according to the alignment loss value until a second training end condition is met to obtain the pre-trained feature diagram generation model; the second training end condition includes that the first image feature and the second image feature are aligned in a feature space.

In a possible implementation manner, before the step of obtaining a feature map corresponding to an image to be recognized, and recognizing, from the feature map, area position information and area image features corresponding to an image area where at least one object is located, the method further includes:

acquiring a video to be processed, and determining a target video frame set from the video to be processed;

taking each target video frame in the target video frame set as the image to be identified;

after the step of generating an object recognition result of the image to be recognized according to the region position information corresponding to the target object and the text identification information obtained based on the object description text, the method further includes:

and obtaining an object identification result of the video to be processed according to the object identification results obtained from the target video frames.

According to a second aspect of the embodiments of the present disclosure, there is provided an image recognition apparatus including:

the text feature extraction unit is configured to execute the steps of responding to an image identification request carrying an object description text and extracting a target text feature corresponding to the object description text;

the image feature recognition unit is configured to acquire a feature map corresponding to an image to be recognized, and recognize area position information and area image features corresponding to an image area where at least one object is located from the feature map; the regional image features are image features of a foreground region; the feature matching unit is configured to execute, if each of the area image features has a target area image feature matched with the target text feature, taking an object corresponding to the target area image feature as a target object described by the object description text;

and the object recognition result generation unit is configured to execute the generation of the object recognition result of the image to be recognized according to the area position information corresponding to the target object and the text identification information obtained based on the object description text.

According to a third aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image recognition method as described in any one of the above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the image recognition method according to any one of the above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the image recognition method according to any one of the above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the scheme, the target text characteristics corresponding to the object description texts are extracted in response to the image identification requests carrying the object description texts, then the characteristic graphs corresponding to the images to be identified are obtained, the area position information and the area image characteristics corresponding to the image areas where at least one object is located are identified from the characteristic graphs, the area image characteristics are the image characteristics of the foreground areas, if the image characteristics of the areas have the target area image characteristics matched with the target text characteristics, the object corresponding to the image characteristics of the target areas is used as the target object described by the object description texts, and then the object identification results of the images to be identified are generated according to the area position information corresponding to the target object and the text identification information obtained based on the object description texts. Therefore, the target text features can be extracted based on the received object description text, and then the matched image features are identified according to the target text features so as to generate an object identification result of the image to be identified, the image identification can be performed on objects in different fields, the object elements in the video data are effectively extracted, manual marking on the video data in each field is not needed for training, and the image identification efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating an image recognition method according to an exemplary embodiment.

FIG. 2 is a schematic diagram illustrating an image recognition process according to an exemplary embodiment.

FIG. 3a is a schematic diagram illustrating a multi-modal model training in accordance with an exemplary embodiment.

FIG. 3b is a schematic diagram illustrating a combined model training in accordance with an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating a video data recognition process in accordance with an exemplary embodiment.

FIG. 5 is a flow chart illustrating another method of image recognition according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating an image recognition apparatus according to an exemplary embodiment.

FIG. 7 is a block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure.

It should be further noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Fig. 1 is a flow chart illustrating an image recognition method, which may be used in a server, as shown in fig. 1, according to an exemplary embodiment, including the following steps. In this embodiment, the terminal may interact with the server through the network, the terminal may send the image to be recognized to the server, and the server executes the image recognition method provided by the present disclosure to process the received image. Of course, the image recognition method in the present disclosure may also be applied to a terminal, that is, the terminal may execute the image recognition method in the present disclosure to process an image stored in the terminal.

As an example, the terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, wherein the portable wearable devices may be smart watches, smart bands, head-mounted devices, and the like. The server may be an independent physical server, or a server cluster composed of a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud server, a cloud database, a cloud storage, and a CDN.

In step S110, in response to the image identification request carrying the object description text, extracting a target text feature corresponding to the object description text;

the image recognition request may be a request for recognizing an object shown in an image, and the image recognition request may carry an object description text, where the object description text may be used to represent an object to be recognized and extracted, and if the object description text is the name of the object a, the image in which the object a exists may be recognized and the object element of the object a in the image may be extracted in response to the image recognition request.

In practical application, in response to an image recognition request, an object description text carried in the image recognition request can be acquired, feature extraction can be performed according to the object description text, text features corresponding to the object description text are acquired and serve as target text features, image recognition is further performed based on the target text features, and image object elements corresponding to the target text features are extracted.

Specifically, as shown in fig. 2, an input text may be obtained according to an object description text, then the input text may be text-coded by using a pre-trained text coding model to obtain text features corresponding to the object description text, where the object description text may include one or more object names, and further text features (i.e., target text features) corresponding to the one or more object names may be obtained, for example, an input text obtained from an object description text including an object a name and an object B name is input to the pre-trained text coding model, and a text feature for the object a name and a text feature for the object B name, i.e., B in fig. 2, may be output ₁ …B _n 、N ₁ …N _k 。

In step S120, a feature map corresponding to the image to be recognized is obtained, and area position information and area image features corresponding to the image area where the at least one object is located are recognized from the feature map; the regional image features are image features of the foreground region;

as an example, the image to be recognized may be a single-frame image obtained by extracting frames from a video to be processed, so as to extract object elements included in the video to be processed by performing image recognition on a plurality of single-frame images of the video to be processed, where the image to be recognized may include object objects in different fields.

In specific implementation, a feature map corresponding to an image to be recognized may be obtained for the image to be recognized, feature extraction may be performed according to the feature map, and region position information and region image features corresponding to an image region where at least one object is located may be recognized, for example, the image region where the at least one object is located may be used as a foreground region, and position frame position information and foreground image features may be obtained based on the foreground region.

For example, as shown in fig. 2, a pre-trained feature map generation model and a pre-trained object region identification model may be used to process an image to be recognized to obtain image features corresponding to an image region where a recognized object is located, when a plurality of objects are recognized, foreground image features (i.e., region image features) of the image regions where the plurality of objects are located may be obtained, for example, an image to be recognized including an object a and an object B is input to the pre-trained feature map generation model and the pre-trained object region identification model, and image feature size adjustment and convolution processing are performed on the output image features to obtain foreground image features of the image region where the object a is located and foreground image features of the image region where the object B is located, i.e., R in fig. 2 ₁ 、R ₂ 。

In step S130, if there is a target area image feature matching the target text feature in each area image feature, taking an object corresponding to the target area image feature as a target object described in the object description text;

after the regional image features corresponding to the image region where the at least one object is located are obtained, feature matching can be performed on the regional image features according to the target text features, further, the regional image features matched with the target text features can be determined as the target regional image features, and the object corresponding to the target regional image features can be used as the target object described by the object description text.

In particular, the pre-trained feature map generation model and the pre-trained object region identification model pair may be pairedSimilarity calculation is performed between each foreground image feature (i.e., regional image feature) obtained after image processing to be recognized and a target text feature, and then foreground image features matched with the target text feature can be determined according to the similarity, as shown in fig. 2, a feature matching result N can be obtained by performing image text feature matching based on the similarity ₁ R ₁ 、N _k R ₂ Can be based on the feature (such as N) associated with the target text ₁ 、N _k ) Matching target area image features (e.g. R) ₁ 、R ₂ ) Corresponding object, determining the target object described by the object description text, and obtaining the text identification information (such as R) corresponding to the target object according to the object description text ₁ N ₁ Corresponding text identification information 1, R ₂ N _k Corresponding text identification information 2).

In step S140, an object recognition result of the image to be recognized is generated according to the region position information corresponding to the target object and the text identification information obtained based on the object description text.

As an example, the text identification information corresponding to the target object may be obtained based on an object description text describing the target object, which may be used to identify the description text corresponding to the target object.

In practical application, an object recognition result of an image to be recognized can be generated according to the area position information and the text identification information corresponding to the recognized target object, so that the image of the target object described by the object description text can be recognized based on the object description text carried in the image recognition request, the object element of the target object in the image is extracted, and the effect of recognizing the object element of a single-frame image is achieved.

In one example, compared with a traditional method for constructing a video tag system based on a certain field, the identification capability of the video tag system is limited to the certain field and has no generalization; according to the technical scheme, the to-be-recognized images containing the object objects in different fields can be processed to recognize and extract the object elements in the images, so that the to-be-recognized images can be suitable for automatically extracting the object elements in the video from the video data in different fields, the generalization capability is strong, and the image recognition efficiency is improved.

In the image identification method, the target text characteristics corresponding to the object description text are extracted in response to an image identification request carrying the object description text, then the characteristic diagram corresponding to the image to be identified is obtained, the area position information and the area image characteristics corresponding to the image area where at least one object is located are identified from the characteristic diagram, if the image characteristics of each area have the target area image characteristics matched with the target text characteristics, the object corresponding to the target area image characteristics is used as the target object described by the object description text, and then the object identification result of the image to be identified is generated according to the area position information corresponding to the target object and the text identification information obtained based on the object description text. Therefore, target text features can be extracted based on the received object description text, matched image features are recognized according to the target text features, an object recognition result of an image to be recognized is generated, image recognition can be performed on objects in different fields, object elements in video data are effectively extracted, manual marking of the video data in all the fields is not needed to be trained, and image recognition efficiency is improved.

In an exemplary embodiment, in response to an image recognition request carrying an object description text, extracting a target text feature corresponding to the object description text includes: responding to the image recognition request, and acquiring an object description text from the image recognition request; and inputting the object description text into a pre-trained text coding model to obtain target text characteristics.

The pre-trained text coding model may be obtained by training a text coding model to be trained and an image coding model to be trained based on paired sample texts and sample images, and the training process may be as shown in fig. 3 a.

In specific implementation, an object description text carried in an image recognition request can be acquired in response to the image recognition request, and then the object description text can be input into a pre-trained text coding model, and a target text feature can be obtained by text coding the object description text.

In an example, a combined text may be obtained according to a preset template text and an object description text by presetting the template text, and then the combined text may be input to a pre-trained text coding model, and the template text may be set in the following manner:

"this is a picture of { XXX })"

Where XXX may be replaced with object description text, such as words or object names that may be describing the target object element. As shown in fig. 2, a combined text may be obtained according to a preset template text and an object description text as an input text, and then a text coding result for the object description text may be output by inputting the input text into a pre-trained text coding model, which may be used to represent text features of the object description text, so as to obtain target text features corresponding to the object description text.

According to the technical scheme, the object description text is obtained from the image recognition request in response to the image recognition request, then the object description text is input to the pre-trained text coding model, the target text characteristic is obtained, the target text characteristic can be determined according to the description text input by the user, and accurate text characteristic indication is provided for subsequent image recognition.

In an exemplary embodiment, acquiring a feature map corresponding to an image to be recognized, and recognizing area position information and area image features corresponding to an image area where at least one object is located from the feature map, includes: acquiring an image to be recognized, and inputting the image to be recognized into a pre-trained feature map generation model to obtain a feature map; and inputting the characteristic diagram into a pre-trained object region recognition model to obtain region position information and region image characteristics corresponding to the image region where at least one object is located.

The pre-trained feature map generation model may be obtained by training in combination with the pre-trained object region identification model and the pre-trained image coding model, and the training process may be as shown in fig. 3 b.

As an example, the pre-trained image coding model may be obtained by training, based on the paired sample text and sample image, the text coding model to be trained and the image coding model to be trained, and the training process may be as shown in fig. 3a, and may be used to output a first image feature in a training stage of the feature map generation model, so as to adjust a second image feature processed by the feature map generation model to be trained and the pre-trained object region recognition model.

In practical application, an image to be recognized can be obtained, the image to be recognized is input into a pre-trained feature map generation model, a feature map corresponding to the image to be recognized is obtained, then the feature map is input into a pre-trained object region recognition model, an image region where at least one object is located can be recognized based on the feature map and serves as a foreground region, and then position frame position information and foreground image features, namely region position information and region image features, can be obtained according to the foreground region.

In an optional embodiment, as shown in fig. 2, for the result obtained after the pre-trained feature map generation model and the pre-trained object region identification model of the image to be identified are processed, that is, the output image feature, the image feature size may be adjusted for the output image feature, for example, a feature image with a specified size may be generated based on the mapping of the position frame identified by at least one foreground region, and then, each adjusted feature image may be convolved to obtain the foreground image feature corresponding to each foreground region, for example, R in fig. 2 ₁ 、R ₂ 。

According to the technical scheme, the image to be recognized is acquired, the image to be recognized is input into the pre-trained feature map generation model to obtain the feature map, the feature map is input into the pre-trained object region recognition model to obtain the region position information and the region image features corresponding to the image region where the at least one object is located, the image features can be recognized and extracted aiming at the images of the object objects in different fields, and effective data to be matched are provided for subsequent image recognition.

In an exemplary embodiment, if each of the area image features has a target area image feature matching the target text feature, taking an object corresponding to the target area image feature as a target object described by the object description text, includes: acquiring a mapping relation between an image feature space and a text feature space; and determining the regional image features matched with the target text features as the target regional image features according to the mapping relation, and taking the objects corresponding to the target regional image features as the target objects.

The mapping relationship can be obtained according to a pre-trained text coding model and a pre-trained image coding model.

As an example, the image feature space may be a feature space formed by image features obtained based on object objects in the image, and the text feature space may be a feature space formed by text features obtained based on description information corresponding to the object objects in the image.

In practical application, a mapping relationship between an image feature space and a text feature space can be obtained according to a pre-trained text coding model and a pre-trained image coding model, the mapping relationship can be determined based on a similarity value between a matched image feature and a text feature, further, a similarity calculation can be performed on each regional image feature and a target text feature, and a regional image feature matched with the target text feature is determined according to the mapping relationship, for example, when the target text feature contains a description text feature for one or more objects, an object described by each description text feature can be determined according to an image feature corresponding to each matched description text feature.

According to the technical scheme of the embodiment, the mapping relation between the image feature space and the text feature space is obtained, the target text features are used as the text features to be matched in the text feature space, the region image features matched with the target text features are determined according to the mapping relation and are used as the target region image features, the objects corresponding to the target region image features are used as the target objects, the corresponding object objects can be accurately extracted based on the target text features to be matched, and the method and the device can be suitable for object elements in different fields.

In an exemplary embodiment, the pre-trained image coding model and the pre-trained text coding model are trained by: acquiring first training data; inputting the first training data into an image coding model to be trained for coding to obtain a sample image feature queue and sample image features corresponding to the positive sample; inputting the first training data into a text coding model to be trained for coding to obtain a sample text feature queue and sample text features corresponding to the positive samples; and performing model training on the image coding model to be trained and the text coding model to be trained on the basis of the sample text features and the sample image feature queue corresponding to the positive sample and the sample image features and the sample text feature queue corresponding to the positive sample to obtain a pre-trained image coding model and a pre-trained text coding model.

The first training data may include a positive sample composed of paired sample texts and sample images, and a negative sample composed of unpaired sample texts and sample images, where the paired sample texts and sample images may include texts and images of different domain objects.

As an example, when the first training data is obtained, a paired sample text and sample image may be crawled from a network resource provided by a network forum or a social platform, or an image resource or a video resource uploaded to the social platform by a user, for example, the sample image may be obtained and a title corresponding to the sample image is used as the paired sample text, or the sample image may be obtained from the video resource, and then a title or a title corresponding to the video resource is used as the paired sample text, or the paired sample text may be extracted from a document corresponding to the sample image, where an extraction manner of the paired data is not specifically limited in this embodiment.

In an example, the text feature may be a feature vector corresponding to object description information in the sample text, and the text coding model to be trained may be a text encoder constructed based on a CNN (Convolutional Neural Networks) or an RNN (Recurrent Neural Networks); the image features can be feature vectors corresponding to the sample images, and the image coding model to be trained can be an image coder constructed based on a CNN network; a multimodal model can be constructed in conjunction with a text encoder and an image encoder.

In a specific implementation, as shown in fig. 3a, a preset number (batch) of first training data may be input, the preset number may be adjusted in a model training process, and by regarding sample texts and sample images that are paired with each other in the first training data as positive samples and sample images that are not paired with each other as negative samples, corresponding text features and image features may be obtained by using the first training data through a text encoder and an image encoder; the sample text feature queue and the sample image feature queue can be preset, and can be used for respectively storing text features obtained by a text encoder of a sample text in a negative sample and image features obtained by an image encoder of a sample image in a negative sample, simultaneously placing text features obtained by the text encoder of the sample text in a positive sample into the sample text feature queue, and placing image features obtained by the image encoder of the sample image in the positive sample into the sample image feature queue, wherein the queue length can be determined by computing resources.

In an example, when first training data is obtained, a plurality of paired sample texts and sample images may be obtained, and then each of the obtained paired sample texts and sample images may be subjected to data cleaning, for example, the paired sample texts and sample images may be cleaned by taking the paired image text pairs as a unit, if the image quality is lower than a preset quality requirement, for example, the image size or the image definition is too low, the text length is too long/too short (that is, the number of characters of the text is greater than or less than a preset threshold), and a special character (for example, a special character or a sensitive keyword is included), the paired image text pairs may be filtered and removed together, and the first training data is obtained based on the paired image text pairs which satisfy the data specification and are obtained after filtering. Therefore, through data cleaning, the reliability of the formal image text pair in the first training data can be improved, and the processing accuracy of the text encoder and the image encoder is ensured.

In another example, in order to improve the model expression capability of the image coding model to be trained and the text coding model to be trained, data enhancement may be performed on the paired image text pairs obtained after data cleaning, so that the data training model is expanded based on a data enhancement mode, the scale of the training data is increased, and the multi-modal model effect can be further improved.

For example, when a sample image is obtained from a video, such as a video cover, the same video may be subjected to frame extraction processing, and an image similar to the obtained sample image may be obtained from the video by using at least one frame extraction strategy, such as uniform frame extraction, frame extraction at intervals of a fixed number of frames, and frame extraction according to an inter-frame difference, so that a plurality of video images in the video cover are obtained by performing frame extraction processing on the video, and the plurality of video images may correspond to the same video title, that is, a plurality of paired image text pairs may be obtained.

For another example, when the sample image is obtained by directly crawling data, at least one transformation operation may be performed on the sample image to obtain a transformed image, such as a rotation operation, a flipping operation, a scaling transformation, a translation transformation, a scale transformation, a noise disturbance, a color transformation, and an occlusion operation, so that a plurality of images with the same image content or higher similarity may be obtained by performing different transformation operations on the sample image.

For another example, for a text in a paired image text, data enhancement of the text may be implemented through at least one operation, such as replacing corresponding content in the text with a synonym, randomly replacing adjacent words, replacing a chinese equivalent word, translating into each other, and transforming into a sentence pattern (e.g., transforming into a flip sentence), so that through the above operations, a plurality of texts with the same semantics but different actual expression modes may be obtained.

For each paired image text, after data enhancement processing is performed, a plurality of images and texts obtained after data enhancement are paired to obtain a plurality of training samples, and then the sample texts and the sample images after data cleaning and data enhancement processing can be used as first training data. Therefore, a large number of training samples can be quickly obtained through the method based on the plurality of paired image texts, manpower and material resources are saved, the quality of training data can be improved while the scale of the training data is enlarged, and the object recognition effect obtained in the process of subsequent image recognition is improved.

According to the technical scheme, the image coding model to be trained and the text coding model to be trained are jointly trained on the basis of the sample text features and the sample image feature queue corresponding to the positive sample and the sample image feature and sample text feature queue corresponding to the positive sample, so that the pre-trained image coding model and the pre-trained text coding model are obtained, a queue mechanism can be introduced into multi-modal training, the dependence on computing resources can be reduced while the contrast learning effect can be enhanced, multi-modal training is performed by adopting contrast learning, sufficient interaction between modes can be realized, and the training effect is improved.

In an exemplary embodiment, based on a sample text feature and sample image feature queue corresponding to a positive sample and a sample image feature and sample text feature queue corresponding to the positive sample, model training is performed on an image coding model to be trained and a text coding model to be trained to obtain a pre-trained image coding model and a pre-trained text coding model, including: determining a first similarity according to the sample text features corresponding to the positive sample and the sample image feature queue, and determining a second similarity according to the sample image features corresponding to the positive sample and the sample text feature queue; determining a target loss value according to the first similarity and the second similarity; and adjusting model parameters in the image coding model to be trained and model parameters in the text coding model to be trained according to the target loss value until a first training end condition is met, so as to obtain the pre-trained image coding model and the pre-trained text coding model.

Because the same object can be described and represented through the information of multiple modalities, for example, when a certain object is described, the object can be expressed through characters, voice or images, through similarity calculation, the similarity can represent the similarity between the image features corresponding to the image modality and the text features corresponding to the text modality, and thus the matched text features and image features can be determined for the same described object.

As an example, the first training end condition may include maximizing a first comparison result, such as an inner product, of the same paired sample text and sample image, and minimizing a second comparison result, such as an inner product, of an unpaired sample text and an irrelevant feature in the sample image.

In practical application, the similarity of the sample text feature corresponding to the positive sample and the feature in the sample image feature queue may be respectively calculated to obtain a first similarity, and the similarity of the sample image feature corresponding to the positive sample and the feature in the sample text feature queue may be respectively calculated to obtain a second similarity.

For example, a corresponding matrix may be generated by calculating an inner product corresponding to each text feature and image feature based on a sample text feature and sample image feature queue corresponding to a positive sample and a sample image feature and sample text feature queue corresponding to a positive sample, and then an objective function may be used to maximize the similarity of positive samples and minimize the similarity of negative samples during model training to determine a target loss value.

In an example, after the target loss value is determined, the model parameters in the image coding model to be trained and the model parameters in the text coding model to be trained may be adjusted according to the target loss value to reduce the target loss value corresponding to the next training process, and then the model training process may be repeated until a training end condition is satisfied, for example, the current target loss value tends to be stable, and the fluctuation of the target loss value is smaller than a threshold value. Therefore, the image coding model to be trained and the text coding model to be trained can be subjected to self-supervision model training by adopting the pairing relation training sample and the unpaired relation training sample based on a comparison learning mode, supervision information can be automatically constructed according to a large amount of matched and unpaired training data, the text mode and the image mode are aligned, so that the modes are fully interacted, the effect is better, and the reliable pre-trained image coding model and the pre-trained text coding model are obtained.

In an optional embodiment, aiming at a text encoder and an image encoder in a multi-modal contrast learning framework, other networks and training methods can be adopted to achieve the alignment effect of a text mode and an image mode; the loss values for contrast learning and model training may also have a variety of forms or strategies; the form of calculation of the similarity in recognition may not be fixed, and other forms may be adopted.

According to the technical scheme of the embodiment, the multi-mode contrast learning method for determining the target loss value is adopted to integrate the information of the text mode into the information of the image mode to obtain the pre-trained image coding model and the pre-trained text coding model which are aligned with the text mode and the image mode by calculating the similarity according to the sample text feature and the sample image feature queue corresponding to the positive sample and the sample image feature and the sample text feature queue corresponding to the positive sample, so that a basis is provided for the subsequent image recognition processing.

In an exemplary embodiment, the pre-trained feature map generation model is obtained by training the following method: acquiring second training data; inputting sample image data into a feature map generation model to be trained to obtain a sample feature map, and inputting the sample feature map into a pre-trained object region identification model to obtain a second image feature; obtaining an annotated region image obtained by cutting sample image data according to the annotated frame information, and inputting the annotated region image into a pre-trained image coding model to obtain a first image characteristic; determining an alignment loss value according to the alignment result of the first image characteristic and the second image characteristic; and adjusting the model parameters in the feature map generation model to be trained according to the alignment loss value until a second training end condition is met to obtain the pre-trained feature map generation model.

The second training data may include sample image data carrying labeling frame information, for example, the sample image, the class of the sample image, and the corresponding target frame information may be acquired by acquiring an open-source target detection data set according to the class and the target frame information in the target detection data set, and as the sample image data, the sample image data may include a sample image, the class of the sample image (which may be used as a label), and target frame information (x, y, w, h, where x, y may be coordinates of an angle a in four corners of a target frame, w, h may be a width and a height of the target frame based on the angle a), and an image area in the target frame may be an area where an object is located.

As an example, the second training end condition may include the first image feature and the second image feature being aligned in a feature space.

In an example, the second training data may be used to perform model training on an object Region identification model to be trained, where the object Region identification model to be trained may be a multi-layer cnn convolutional Network constructed by an RPN Network (Region pro-social Network, based on the fast-RCNN Network), and the pre-trained object Region identification model may be obtained by training in the following manner:

by inputting a preset number (batch) of second training data, sample image data carrying labeling frame information can be processed through a multilayer image convolution network to obtain a training feature map, then the training feature map can be input into an object area identification model to be trained, each pixel point is located in a foreground area (such as a target frame area) or a background area in an image according to the training feature map, relevant coordinate points are recorded, such as position coordinates corresponding to the foreground area, whether the foreground area and the background area output by the model are correct or not can be judged according to the target frame information, whether the position coordinates corresponding to the obtained foreground area are accurate or not can be judged, corresponding deviation information can be obtained according to the position coordinates corresponding to the target frame information and the foreground area, the deviation information is used as loss of the model to update training of a guidance model until the model converges, and a pre-trained object area identification model is obtained. The form of the RPN network may be replaced by other methods of extracting the pre/background region.

In practical application, a multi-modal model (i.e., a pre-trained image coding model and a pre-trained text coding model) and an RPN network (i.e., a pre-trained object region recognition model) may be combined, and second training data is used to perform model training on a feature map generation model to be trained, where the feature map generation model to be trained may be constructed by a multilayer CNN network, and as shown in fig. 3b, the pre-trained feature map generation model may be obtained by training in the following manner:

by inputting the second training data, the sample image data carrying the labeling box information can be processed by the feature map generation model to be trained to obtain a sample feature map, then the sample feature map can be input to the pre-trained object region identification model to obtain the output foreground box information and the image features corresponding to the foreground box, image feature size adjustment can be performed on the image features corresponding to the foreground box, for example, a feature image with a specified size can be generated based on foreground box information mapping, and convolution processing can be respectively performed on the adjusted feature image to obtain second image features, for example, R in fig. 3b, and image features corresponding to the foreground box can be obtained ₁ 、R ₂ 。

At the same time, one or more marking frames carried in the sample image data can be usedThe information is cut out to obtain an image (i.e. a labeled region image) in the labeled frame, such as labeled region image 1 and labeled region image 2 in fig. 3b, and the cut out image in the labeled frame can be input to a pre-trained image coding model to obtain an image feature (i.e. a first image feature) corresponding to the image in the labeled frame, such as I in fig. 3b ₁ 、I ₂ And then, knowledge distillation processing can be performed according to the first image characteristic and the second image characteristic, for example, distillation loss of the distillation processing can be used for image characteristic alignment, label information (such as the class of the sample image in the sample image data) can be used as the prediction loss of the feature map generation model to be trained, the alignment loss value and the prediction loss obtained by the distillation processing are used as the loss of the model training, and the model training is updated and guided until the model converges to obtain the pre-trained feature map generation model. Therefore, the model generated based on the pre-trained feature map can have two capabilities, one is that the extracted image features of the model and the image features extracted by the multi-modal image encoder are aligned in the feature space, and the other is that the obtained image features can identify the region of the object in the image through the RPN.

According to the technical scheme of the embodiment, model training is performed on the feature map generation model to be trained by adopting the sample image data and combining the pre-trained image coding model and the pre-trained object region recognition model, so that automatic recognition of object elements applicable to the whole field can be realized on the basis of a mode combining multi-mode pre-training and an RPN network, and excessive human intervention is not needed.

In an exemplary embodiment, before the step of obtaining a feature map corresponding to an image to be recognized, and recognizing area position information and area image features corresponding to an image area where at least one object is located from the feature map, the method further includes: acquiring a video to be processed, and determining a target video frame set from the video to be processed; taking each target video frame in the target video frame set as an image to be identified; after the step of generating the object recognition result of the image to be recognized according to the region position information corresponding to the target object and the text identification information obtained based on the object description text, the method further comprises the following steps: and obtaining an object identification result of the video to be processed according to the object identification result obtained from each target video frame.

In a specific implementation, as shown in fig. 4, a real-time frame extraction process may be performed on a video to be processed, a single frame image (i.e., a target video frame) is obtained from the video to be processed and is used as an image to be recognized, and then, by performing image recognition on a plurality of single frame images of the video to be processed, object elements included in the video to be processed are extracted by combining object recognition results (such as region position information and text identification information) corresponding to the single frame images, so as to obtain object recognition results of the video to be processed, such as all object elements included in all the single frame images.

According to the technical scheme, the target video frame set is determined from the video to be processed by obtaining the video to be processed, each target video frame in the target video frame set is used as the image to be recognized, the object recognition result of the video to be processed is obtained according to each object recognition result obtained by each target video frame, the object elements in the video can be automatically extracted according to the video data in different fields, the generalization capability is high, and the image recognition efficiency is improved.

FIG. 5 is a flow chart illustrating another image recognition method, as shown in FIG. 5, for use in a computer device, such as a server, according to an exemplary embodiment, including the following steps.

In step S510, first training data is obtained, and a pre-trained image coding model and a pre-trained text coding model are obtained by training according to the first training data. In step S520, second training data is obtained, and a pre-trained feature map generation model is obtained through training according to the second training data in combination with the pre-trained object region identification model and the pre-trained image coding model. In step S530, in response to the image recognition request, an object description text is obtained from the image recognition request, and the object description text is input to a pre-trained text coding model, so as to obtain a target text feature. In step S540, an image to be recognized is obtained, the image to be recognized is input to the pre-trained feature map generation model to obtain a feature map, and the feature map is input to the pre-trained object region recognition model to obtain region position information and region image features corresponding to the image region where the at least one object is located. In step S550, a mapping relationship between the image feature space and the text feature space is obtained, and the mapping relationship is obtained according to the pre-trained text coding model and the pre-trained image coding model. In step S560, the target text feature is taken as a text feature to be matched in the text feature space, and each region image feature is taken as an image feature to be matched in the image feature space, a region image feature matched with the target text feature is determined according to the mapping relationship, and is taken as a target region image feature, and an object corresponding to the target region image feature is taken as a target object. In step S570, an object recognition result of the image to be recognized is generated according to the area position information corresponding to the target object and the text identification information obtained based on the object description text. It should be noted that, for the specific limitations of the above steps, reference may be made to the specific limitations of an image recognition method, and details are not described herein again.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Based on the same inventive concept, the embodiment of the present disclosure further provides an image recognition apparatus for implementing the image recognition method.

Fig. 6 is a block diagram illustrating an image recognition apparatus according to an exemplary embodiment. Referring to fig. 6, the apparatus includes:

a text feature extraction unit 601 configured to perform, in response to an image recognition request carrying an object description text, extracting a target text feature corresponding to the object description text;

an image feature recognition unit 602 configured to perform acquiring a feature map corresponding to an image to be recognized, and recognizing, from the feature map, area position information and area image features corresponding to an image area where at least one object is located; the regional image features are image features of a foreground region;

a feature matching unit 603 configured to perform, if each of the area image features has a target area image feature that matches the target text feature, taking an object corresponding to the target area image feature as a target object described in the object description text;

an object recognition result generating unit 604 configured to perform generating an object recognition result of the image to be recognized according to the region position information corresponding to the target object and text identification information obtained based on the object description text.

In a possible implementation manner, the text feature extraction unit 601 is specifically configured to perform, in response to an image recognition request, obtaining an object description text from the image recognition request; inputting the object description text into a pre-trained text coding model to obtain the target text characteristics; the pre-trained text coding model is obtained by training based on the matched sample text and sample image in combination with the text coding model to be trained and the image coding model to be trained.

In a possible implementation manner, the image feature recognition unit 602 is specifically configured to perform acquiring an image to be recognized, and input the image to be recognized to a pre-trained feature map generation model to obtain the feature map; inputting the characteristic diagram into a pre-trained object region identification model to obtain region position information and region image characteristics corresponding to an image region where the at least one object is located; the pre-trained feature map generation model is obtained by training in combination with the pre-trained object region recognition model and the pre-trained image coding model; the pre-trained image coding model is used for outputting first image features in a training stage of a feature map generation model so as to adjust second image features processed by the feature map generation model to be trained and the pre-trained object region recognition model; the pre-trained image coding model is obtained by training based on the matched sample text and sample image in combination with the text coding model to be trained and the image coding model to be trained.

In a possible implementation manner, the feature matching unit 603 is specifically configured to perform obtaining a mapping relationship between an image feature space and a text feature space; the mapping relation is obtained according to a pre-trained text coding model and a pre-trained image coding model; and taking the target text features as text features to be matched in the text feature space, taking each regional image feature as an image feature to be matched in the image feature space, determining regional image features matched with the target text features as the target regional image features according to the mapping relation, and taking objects corresponding to the target regional image features as the target objects.

In one possible implementation, the apparatus further includes:

a first training data acquisition unit specifically configured to perform acquisition of first training data; the first training data comprise positive samples formed by matched sample texts and sample images and negative samples formed by unpaired sample texts and sample images, and the matched sample texts and sample images comprise texts and images of objects in different fields;

the image coding unit is specifically configured to input the first training data into an image coding model to be trained for coding, so as to obtain a sample image feature queue and sample image features corresponding to the positive samples;

the text coding unit is specifically configured to input the first training data into a text coding model to be trained for coding, so as to obtain a sample text feature queue and a sample text feature corresponding to the positive sample;

and the first model training unit is specifically configured to perform model training on the image coding model to be trained and the text coding model to be trained based on the sample text features and the sample image feature queue corresponding to the positive sample and the sample image features and the sample text feature queue corresponding to the positive sample, so as to obtain the pre-trained image coding model and the pre-trained text coding model.

In a possible implementation manner, the first model training unit is specifically configured to determine a first similarity according to a sample text feature corresponding to the positive sample and the sample image feature queue, and determine a second similarity according to a sample image feature corresponding to the positive sample and the sample text feature queue; determining a target loss value according to the first similarity and the second similarity; adjusting model parameters in the image coding model to be trained and model parameters in the text coding model to be trained according to the target loss value until a first training end condition is met to obtain the pre-trained image coding model and the pre-trained text coding model; the first training end condition includes maximizing a first comparison result for the same paired sample text and sample image, and minimizing a second comparison result for an unrelated feature in the unpaired sample text and sample image.

In one possible implementation, the apparatus further includes:

a second training data acquisition unit specifically configured to perform acquisition of second training data; the second training data comprises sample image data carrying labeling box information;

a second image feature obtaining unit, configured to perform input of the sample image data to a feature map generation model to be trained to obtain a sample feature map, and input of the sample feature map to the pre-trained object region identification model to obtain the second image feature;

a first image feature obtaining unit, configured to perform obtaining of an image of an annotation region obtained by clipping the sample image data according to the annotation frame information, and input the image of the annotation region to the pre-trained image coding model to obtain the first image feature;

an alignment unit, specifically configured to perform determining an alignment loss value according to an alignment result of the first image feature and the second image feature;

the second model training unit is specifically configured to adjust model parameters in the feature map generation model to be trained according to the alignment loss value until a second training end condition is met, so as to obtain the pre-trained feature map generation model; the second training end condition includes that the first image feature and the second image feature are aligned in a feature space.

In one possible implementation, the apparatus further includes:

the target video frame set acquisition unit is specifically configured to execute acquisition of a video to be processed and determine a target video frame set from the video to be processed;

an image to be recognized determining unit, specifically configured to perform, as the image to be recognized, each target video frame in the target video frame set;

the device further comprises:

and the video identification result obtaining unit is specifically configured to execute obtaining of an object identification result of the video to be processed according to each object identification result obtained from each target video frame.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The modules in the image recognition device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 7 is a block diagram illustrating an electronic device 700 for implementing an image recognition method according to an example embodiment. For example, the electronic device 700 may be a server. Referring to fig. 7, electronic device 700 includes a processing component 720 that further includes one or more processors, and memory resources, represented by memory 722, for storing instructions, such as applications, that are executable by processing component 720. The application programs stored in memory 722 may include one or more modules that each correspond to a set of instructions. Further, the processing component 720 is configured to execute instructions to perform the above-described methods.

The electronic device 700 may further include: a power component 724 is configured to perform power management for the electronic device 700, a wired or wireless network interface 726 is configured to connect the electronic device 700 to a network, and an input-output (I/O) interface 728. The electronic device 700 may operate based on an operating system stored in the memory 722, such as Windows Server, mac OS X, unix, linux, freeBSD, or the like.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 722 comprising instructions, executable by a processor of the electronic device 700 to perform the above-described method is also provided. The storage medium may be a computer-readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by a processor of the electronic device 700 to perform the above-described method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image recognition method, characterized in that the method comprises:

and generating an object recognition result of the image to be recognized according to the area position information corresponding to the target object and text identification information obtained based on the object description text.

2. The method according to claim 1, wherein the extracting, in response to the image recognition request carrying the object description text, the target text feature corresponding to the object description text comprises:

3. The method according to claim 1, wherein the obtaining a feature map corresponding to an image to be recognized, and recognizing area position information and area image features corresponding to an image area where at least one object is located from the feature map comprises:

acquiring an image to be recognized, and inputting the image to be recognized into a pre-trained feature map generation model to obtain a feature map;

inputting the characteristic diagram into a pre-trained object region recognition model to obtain region position information and region image characteristics corresponding to an image region where the at least one object is located;

4. The method according to claim 1, wherein if each of the area image features has a target area image feature that matches the target text feature, taking an object corresponding to the target area image feature as a target object described in the object description text includes:

5. A method according to claim 2 or 3, wherein the pre-trained image coding model and the pre-trained text coding model are trained by:

acquiring first training data; the first training data comprises a positive sample consisting of matched sample texts and sample images and a negative sample consisting of unpaired sample texts and sample images, wherein the matched sample texts and sample images comprise texts and images of objects in different fields;

and performing model training on the image coding model to be trained and the text coding model to be trained based on the sample text features and the sample image feature queue corresponding to the positive sample and the sample image features and the sample text feature queue corresponding to the positive sample to obtain the pre-trained image coding model and the pre-trained text coding model.

6. The method according to claim 5, wherein the performing model training on the image coding model to be trained and the text coding model to be trained based on the sample text features corresponding to the positive sample and the sample image feature queue, and the sample image features corresponding to the positive sample and the sample text feature queue to obtain the pre-trained image coding model and the pre-trained text coding model comprises:

determining a target loss value according to the first similarity and the second similarity; adjusting model parameters in the image coding model to be trained and model parameters in the text coding model to be trained according to the target loss value until a first training end condition is met to obtain the pre-trained image coding model and the pre-trained text coding model; the first training end condition includes maximizing a first comparison result for the same paired sample text and sample image, and minimizing a second comparison result for an unrelated feature in the unpaired sample text and sample image.

7. The method of claim 3, wherein the pre-trained feature map generation model is trained by:

8. The method according to claim 1, wherein before the step of obtaining a feature map corresponding to an image to be recognized and recognizing area position information and area image features corresponding to an image area where at least one object is located from the feature map, the method further comprises:

9. An image recognition apparatus, characterized in that the apparatus comprises:

the image feature recognition unit is configured to acquire a feature map corresponding to an image to be recognized, and recognize area position information and area image features corresponding to an image area where at least one object is located from the feature map; the regional image features are image features of a foreground region;

the feature matching unit is configured to execute, if each of the area image features has a target area image feature matched with the target text feature, taking an object corresponding to the target area image feature as a target object described by the object description text;

10. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image recognition method of any one of claims 1 to 8.

11. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image recognition method of any of claims 1-8.

12. A computer program product comprising instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the image recognition method of any one of claims 1 to 8.