CN114880514A

CN114880514A - Image retrieval method, image retrieval device and storage medium

Info

Publication number: CN114880514A
Application number: CN202210781107.9A
Authority: CN
Inventors: 游强; 王坚; 李兵; 余昊楠; 胡卫明
Original assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Current assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-08-09
Anticipated expiration: 2042-07-05
Also published as: CN114880514B

Abstract

The application discloses an image retrieval method, an image retrieval device and a storage medium. The image retrieval method comprises the following steps: receiving retrieval information; determining text information and image information associated with the retrieval information; performing feature representation based on an attention mechanism on the text information and the image information, and generating an image retrieval feature corresponding to the retrieval information; and retrieving the image matched with the retrieval information according to the image retrieval characteristics.

Description

Image retrieval method, image retrieval device and storage medium

Technical Field

The present application relates to the field of retrieval technologies, and in particular, to an image retrieval method, an image retrieval apparatus, and a storage medium.

Background

Image retrieval is a classic problem in computer application technology, which is a process of studying how to accurately and efficiently retrieve a set of images from an image library that conform to a textual semantic description, an image appearance, or a semantic description.

The published invention patent CN114048282A discloses a text tree local matching-based cross-modal retrieval method and system for graphics and texts, the method comprises: acquiring a data set, preprocessing and dividing the data set to obtain a training set; respectively inputting the pictures and texts in the training set into corresponding networks for feature extraction to obtain picture features and text features; generating a text tree according to the text characteristics; performing image-text similarity calculation and back propagation training network according to the characteristics of the text tree and the image to obtain a cross-modal retrieval model; and acquiring the data to be detected and inputting the data to the cross-modal retrieval model to obtain a retrieval result.

The published patent application CN114003753A discloses a picture retrieval method and device. The method comprises the following steps: extracting features of the picture to be retrieved to obtain a first feature vector; determining each second feature vector meeting the first similarity requirement with the first feature vector from the feature library; clustering each second feature vector, and determining each obtained clustering center as each third feature vector serving as a retrieval sample; for any third feature vector, determining each fourth feature vector meeting the second similarity requirement with the third feature vector from the feature library; and determining a retrieval result corresponding to the picture to be retrieved through each fourth feature vector. The method belongs to the condition of searching through an input picture.

Further, image retrieval is divided into two types according to the form of retrieval information input:

1. searching images by text, namely inputting search information which is a text keyword or a sentence, and outputting a candidate image set;

2. searching the image by using the image, namely, the input retrieval information is an image, and the output is an image set which is also a candidate.

Because the retrieval information of different modes is input in different retrieval modes, the method has a large semantic gap, and the text retrieval information and the image retrieval information are difficult to be fused and learned in the same semantic space, so that the retrieval is difficult to be carried out under the same retrieval frame, and the retrieval can be carried out only in respective retrieval frames.

In addition, when searching for an image based on semantic features of the image, the search may be disturbed by apparent features of the image. The apparent features of the image include the color, texture, style and other features exhibited by the image. When the semantic body of an image is relatively definite and occupies a large area in the image, the apparent information of the image can often sufficiently represent the semantic subject information. However, when the image area occupied by the semantic body is small, the difference between the appearance and the image semantics is large, thereby causing the appearance feature of the image to interfere with the retrieval based on the image semantic feature.

Aiming at the technical problems that the text retrieval information and the image retrieval information in the prior art are difficult to retrieve in the same retrieval frame, and when an image area occupied by a semantic main body is small, the apparent features of an image are easy to interfere with the retrieval based on the semantic features, an effective solution is not provided at present.

Disclosure of Invention

Embodiments of the present disclosure provide an image retrieval method, an image retrieval device, and a storage medium, to at least solve technical problems that in the prior art, text retrieval information and image retrieval information are difficult to retrieve in the same retrieval frame, and when an image area occupied by a semantic subject is small, an apparent feature of an image is likely to interfere with retrieval based on a semantic feature.

According to an aspect of an embodiment of the present disclosure, there is provided an image retrieval method including: receiving retrieval information; determining text information and image information associated with the retrieval information; performing feature representation based on an attention mechanism on the text information and the image information to generate image retrieval features corresponding to the retrieval information; and retrieving the image matched with the retrieval information according to the image retrieval characteristics.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method described above is performed by a processor when the program is executed.

According to another aspect of the embodiments of the present disclosure, there is also provided an image retrieval apparatus including: the retrieval information receiving module is used for receiving retrieval information; the information determining module is used for determining text information and image information which are associated with the retrieval information; the image retrieval feature generation module is used for performing feature representation based on an attention mechanism on the text information and the image information and generating image retrieval features corresponding to the retrieval information; and the image retrieval module is used for retrieving the image matched with the retrieval information according to the image retrieval characteristics.

According to another aspect of the embodiments of the present disclosure, there is also provided an image retrieval apparatus including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: receiving retrieval information; determining text information and image information associated with the retrieval information; performing feature representation based on an attention mechanism on the text information and the image information to generate image retrieval features corresponding to the retrieval information; and retrieving the image matched with the retrieval information according to the image retrieval characteristics.

In the embodiment of the present disclosure, after the search information is received, the search information is not directly input to the feature representation model, but text information and image information associated with the search information are generated from the search information. And then generating image features associated with the retrieval information from the text information and the image information by using the feature representation model. Therefore, in this way, no matter whether the keywords and sentences or one image are input by the user, the technical scheme of the disclosure generates text information and image information which are associated with the retrieval information input by the user, and then generates image characteristics which are associated with the retrieval information according to the text information and the image information. Therefore, through the mode, the semantic features of the text and the feature representation of the image can be fused through an attention mechanism, so that the semantic relation between the text and the image is established, the semantic gap problem of the text and the image in the cross-modal retrieval is reduced, and the cross-modal retrieval can be realized by utilizing the same retrieval frame. Moreover, the feature representation model based on the attention mechanism is adopted in the method, so that the weights of different image parts in the image can be distributed according to the semantic features, and the image feature expression capacity is enhanced. In this way, the interference of the apparent features of the image on the retrieval based on the semantic features of the image is reduced. Therefore, the technical problems that in the prior art, text retrieval information and image retrieval information are difficult to retrieve in the same retrieval frame, and when an image area occupied by a semantic main body is small, the apparent features of an image are easy to interfere with retrieval based on semantic features are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a hardware block diagram of a computing device for implementing the method according to embodiment 1 of the present disclosure;

fig. 2A is a schematic diagram of an image content retrieval system according to embodiment 1 of the present disclosure;

FIG. 2B is a schematic diagram of a feature representation model according to embodiment 1 of the present disclosure;

fig. 3 is a schematic flow chart of an image retrieval method according to a first aspect of embodiment 1 of the present disclosure;

fig. 4 is a schematic flow chart of the image detection and object image-tag library construction according to embodiment 1 of the present disclosure;

fig. 5 is a schematic diagram of an image retrieval apparatus according to embodiment 2 of the present disclosure; and

fig. 6 is a schematic diagram of an image retrieval apparatus according to embodiment 3 of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some of the nouns or terms appearing in the description of the embodiments of the present disclosure are applicable to the following explanations:

and searching the text: in the text image searching method, text information, such as keywords or sentences, for searching images is input.

Retrieving images: in the image searching method, an image for searching an image is input.

An object image: the object image refers to an image in which the object occupies a main body portion.

Example 1

According to the present embodiment, there is provided a method embodiment of an image retrieval method, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computing device for implementing an image retrieval method. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the image retrieval method in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the image retrieval method of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

Fig. 2A is a schematic diagram of an image content retrieval system based on an attention mechanism according to the present embodiment. FIG. 2B further illustrates a schematic diagram of the attention mechanism based image feature representation model illustrated in FIG. 2A. Wherein the image content retrieval system shown in figure 2A operates on the computing device shown in figure 1.

Referring to fig. 2A, the image content retrieval system based on the attention mechanism includes: the system comprises an image blocking module, an object detection module, a label extraction module, an object image-label library, a feature representation model based on an attention mechanism and a feature retrieval library.

Wherein the image blocking module is configured to divide the received search image into a plurality of image blocks (patch).

The object detection module determines an area of an object included in the search image and a tag corresponding to the object using a preset object detection model. In addition, "object" as referred to in the present disclosure refers to an object of interest for object detection, such as an object like a computer or a cup, a target object like a plant, an animal, and a human.

The label extraction module is used for determining the object labels contained in the retrieval text and searching corresponding object images from the object image-label library according to the object labels. Here, the "object image" refers to an image in which the object occupies a main body portion. For example, the object image whose tag information is "dog" is an image in which the dog occupies a main body portion in the image.

The feature representation model is used for generating image retrieval features corresponding to the received retrieval information according to the input text information and the image information. For example, in the present embodiment, the image retrieval features generated by the feature representation model may be features in the form of vectors, that is, feature vectors.

The feature search library stores image features of a plurality of images so as to search images matched with the search information in a feature matching mode.

In addition, the left part of fig. 2B (i.e., the left side of the vertical dashed line) further shows a schematic diagram of the feature representation model shown in fig. 2A, and the right part of fig. 2B shows a schematic diagram of training the feature representation model by real text description. As for the feature representation model, detailed description will be made below.

In the operating environment described above, according to the first aspect of the present embodiment, there is provided an image retrieval method implemented by the computing apparatus shown in fig. 1. Fig. 3 shows a flow diagram of the method, which, with reference to fig. 3, comprises:

s102: receiving retrieval information;

s104: determining text information and image information associated with the retrieval information;

s106: performing feature representation based on an attention mechanism on the text information and the image information to generate image retrieval features corresponding to the retrieval information; and

s108: and searching the image matched with the search information according to the image search characteristic.

Specifically, referring to fig. 2A, an image content retrieval system of a computing device may receive retrieval information of a user, which may be, for example, a retrieval image (i.e., retrieval in a graph-searching manner), a retrieval text (e.g., retrieval by keywords or sentences), or a combination of the retrieval image and the retrieval text (S102).

The image content retrieval system then determines text information and image information associated with the retrieval information (S104). For example, when the search information is a search image, the image content search system may use, as associated image information, an image block obtained by dividing the search image by the image partitioning module and an object region detected by the object detection module from the search image; and taking the object label corresponding to the object area determined by the object detection module as associated text information. When the retrieval information is a retrieval text, the image content retrieval system may use the object tag extracted by the tag extraction module and the retrieval text as associated text information, and may use an object image obtained from object image-tag library matching according to the object tag as image information.

Then, referring to fig. 2A, the image content retrieval system generates an image retrieval feature corresponding to the retrieval information from the determined text information and image information using a feature representation model based on the attention mechanism (S106). As shown in fig. 2B, the feature representation model includes, for example, a multi-layer transform encoder, so that an image search feature corresponding to search information is generated by using the transform encoder.

Then, the image content retrieval system performs matching retrieval in the feature retrieval library based on the image retrieval features generated by the feature representation model, thereby retrieving an image matching the retrieval information (S108). For example, the image content retrieval system may retrieve image features matching the image retrieval features of the retrieval information in the feature retrieval library, and regard the image associated with the matching image features as the image matching the retrieval information.

As described in the background art, since different retrieval forms input retrieval information of different modalities and have a large semantic gap, text retrieval information and image retrieval information are difficult to be fused and learned in the same semantic space, and therefore, retrieval is difficult to be performed in the same retrieval frame, and retrieval can be performed only in respective retrieval frames. In addition, when searching for an image based on semantic features of the image, the search may be disturbed by apparent features of the image. When the image area occupied by the semantic body is small, the difference between the apparent features and the image semantics is large, so that the apparent features of the image interfere with the retrieval based on the image semantic features.

In view of the above, according to the technical solution of the present disclosure, after receiving the search information, the search information is not directly input to the feature representation model, but text information and image information associated with the search information are generated from the search information. And then generating image features associated with the retrieval information from the text information and the image information by using the feature representation model. Therefore, in this way, no matter whether the keywords and sentences or one image are input by the user, the technical scheme of the disclosure generates text information and image information which are associated with the retrieval information input by the user, and then generates image characteristics which are associated with the retrieval information according to the text information and the image information. Therefore, through the mode, the semantic features of the text and the feature representation of the image can be fused through an attention mechanism, so that the semantic relation between the text and the image is established, the semantic gap problem of the text and the image in cross-modal retrieval is reduced, and the cross-modal retrieval can be realized by utilizing the same retrieval framework. Moreover, the feature representation model based on the attention mechanism is adopted in the method, so that the weights of different image parts in the image can be distributed according to the semantic features, and the image feature expression capacity is enhanced. For example, retrieving an image of a dog requires that whether the dog is in a room, in a forest, or on a beach, the representation of the image should focus on the dog, so that the dog is the subject of the image. By representing the model with features based on the attention mechanism, features in the image that are relevant to the dog may be given higher weight. For example, when the search information is "dog is in the room", the feature representation model based on the attention mechanism may give a weight of 0.8 to "dog" and a weight of 0.2 to "room", thereby ensuring that the feature representation of "dog" occupies a dominant position in the finally generated image features. In this way, the interference of the apparent features of the image on the retrieval based on the semantic features of the image is reduced. Therefore, the technical problems that in the prior art, text retrieval information and image retrieval information are difficult to retrieve under the same retrieval frame, and when an image area occupied by a semantic main body is small, the apparent features of an image are easy to interfere with the retrieval based on the semantic features are solved.

Optionally, in a case that the search information is a search image, the operation of determining text information and image information associated with the search information includes: determining an object region contained in the retrieval image and label information associated with the object region through a preset object detection model; dividing a retrieval image into a plurality of image blocks; and regarding the tag information as text information associated with the retrieval image, and regarding the object area and the image block as image information associated with the retrieval image.

Referring to fig. 2A, when a user inputs a search image for searching by using a method of searching images, an object detection module of the image content search system determines an object region and a tag corresponding to an object included in the search image by using a preset object detection model. For example, when a user inputs an image showing a dog on a beach, the object detection module detects an object area of the dog in the image and determines that the object tag associated with the object area is "dog".

In addition, the image partitioning module of the image content retrieval system also divides the retrieved image into a plurality of image blocks. For example, the technical scheme of the disclosure can directly divide the image into the form of image blocks (Patch) with equal width and height without overlapping, so that the image blocks are input into the feature representation model from left to right and from top to bottom according to word sequence. In addition, the technical scheme of the disclosure can also add abstraction of a layer of window on the basis of the divided image blocks, and controls the association of the image blocks through the movement of the window, so that the hierarchical association semantics of the image are more flexible.

Then, as shown with reference to fig. 2B, the image content retrieval system takes the tag information "dog" as text information associated with the retrieval image, and takes the object area of the dog detected in the retrieval image and the image block obtained by dividing the retrieval image as image information. The text information is thus input to the feature representation model together with the image information. For example, the image blocks shown in fig. 2B correspond to image blocks obtained by dividing the search image; the object region shown in fig. 2B corresponds to the object region of the dog detected in the search image; and the object tag shown in fig. 2B corresponds to tag information "dog" associated with the object region of the dog.

Therefore, in the technical scheme of the present disclosure, even if the image search is performed in a manner that the user inputs the image (i.e., the image search is performed), the associated tag information is extracted from the search image input by the user, and the tag information is input to the feature representation model together with the image information as the text information associated with the search image input by the user to generate the image search feature associated with the search image. Therefore, the image retrieval features which are generated by the feature representation model through the attention mechanism and are related to the retrieval image can establish semantic relation with the semantic features of the text information related to the retrieval image, and are fused in a feature space. Therefore, the search image input by the user as the search information and the search text as the search information can share one characteristic representation model to generate the image search characteristic, and therefore, the search can be carried out under the same search frame. In addition, the technical scheme of the disclosure enables the feature representation model to assign appropriate weights to the object regions based on an attention mechanism by detecting the object regions in the retrieval image and inputting the detected object regions as image information to the feature representation model, so that the generated image retrieval features can be consistent with semantic features of real text description of the retrieval image, and thus, the interference of image appearance features is avoided. In addition, the technical scheme of the disclosure enables the feature representation model to assign different weights to each image block based on an attention mechanism by dividing the retrieval image into a plurality of image blocks and inputting the divided image blocks as image information to the feature representation model, so that the generated image retrieval features can more accurately reflect semantic features expressed by the retrieval image. Thus, the accuracy of image retrieval is further improved.

In addition, in the technical solution of the present disclosure, before being input to the feature representation model, the Image content retrieval system may select, from the generated Image blocks, an Image block whose IOU (i.e., intersection ratio) with the object region exceeds a certain threshold and replace the Image block with a black Image (Masked Image). Therefore, the technical scheme of the disclosure masks image blocks containing more object areas in the training process, and observes whether the trained model can pay attention to the masked areas. These masked areas actually contain the main regions of the image, representing richer high-level semantic information, such as certain objects, etc. If the masked image blocks containing the angle of the object area can be associated with other image blocks, the finally obtained feature representation can represent semantic information of the image main body. The actually detected object region is an anchor point of the image main body region

In addition, the image content retrieval system scales all image blocks and object regions to a uniform input size (see ViT for image block size, input 56x 56).

In addition, the object detection model in this embodiment is a detection model that identifies an object in an image through a detection frame (Bounding Box, BBox), a class label, and a confidence level. The object detection is generally divided into general object detection and self-owned special object detection according to the difference between the use scene and the output category, and the general object detection can directly use an open-source data set and a trained model as an object detection model. And the self-owned special object detection needs to construct a private data set to complete the training of the object detection model. The existing object detection models include a single-stage YOLO series and a two-stage Faster R-CNN.

Fig. 4 shows a specific flow of image detection and construction of the object image-tag library shown in fig. 2A.

The first step is as follows: universal image detection dataset and pre-training detection model preparation

The existing open-source general object image-label library and the general object detection weighted model are configured, and the newly input image can directly obtain a general object detection frame and a public object label through the general object detection model. When the disclosed label cannot meet the requirements of a specific occasion, a special object detection model needs to be constructed based on a pre-trained general object detection model. For example, when the general object detection model performs detection, weights obtained from some public data sets through training are loaded into the network model, and the general object detection model can be directly used as the general object detection model to perform general object detection. Such as the detection of general objects like buildings, flowers and people. In addition, these weights can also be used as initial weights for the next training in the subsequent dedicated object detection model. Wherein the object class of the generic object is for instance a building, and the object class of the specific object may further be an airport in the building, even a proprietary class such as for instance daxing airport. In addition, the weights in the general object detection model and the specific object detection model described in the present disclosure refer to parameter values of each parameter in a network of the general object detection model and the specific object detection model. Network models generally include two parts, one is the network structure and the other is the weighting given to the parameters in these structures. The network structure is generally relatively stable and does not change after the problem is determined, and the weights are changed according to the change of the training set and the change of the training rounds. In the technical scheme of the disclosure, from a general object detection model to a special object detection model, the training process only focuses on the change of the training set, and the corresponding change is the change of the weight, and the network structure is not changed.

The second step is that: including labeling of a specialized object image dataset and training of a specialized object detection model

Preparing an image set containing a special object, using an image labeling tool (such as LabelImg and the like) to label the special object to be detected in the image, using the weight of the general object detection model pre-trained in the step one as the initial weight of the special object detection model, and performing Fine-tune (Fine-tune) training based on the labeled special object data set to obtain the special object detection model.

The third step: object image-label library construction and update

From the open source detection image dataset, the image area framed by the object detection frame is extracted to form a generic object image, while the label forms an overt label. And obtaining the special object image and the private label from the marked image set containing the special object according to the same method. And for the received new image, obtaining a new object image and a new label through a general object detection model and a special object detection model respectively, screening the new image and the new label through a threshold value, adding the new image and the new label into an object image-label library to update the existing object image-label library, and providing a cross-modal semantic alignment source for subsequent unified feature representation training and feature extraction.

Through the steps, two detection models (a general object detection model and a special object detection model) and an object image-label library which can be updated according to the new image are obtained.

Optionally, the operation of performing feature representation based on an attention mechanism on the text information and the image information to generate an image retrieval feature corresponding to the retrieval information includes: carrying out linear mapping on the image block and the object area to generate an image block feature vector corresponding to the image block and an object area feature vector corresponding to the object area; performing word embedding processing on the tag information to generate a first semantic feature vector corresponding to the tag information; generating a second semantic feature vector corresponding to the label information according to the label information by using a semantic feature extraction model based on an attention mechanism; and generating image retrieval features corresponding to the retrieved image according to the image block feature vector, the object region feature vector, the first semantic feature vector and the second semantic feature vector by using a preset encoder.

Specifically, referring to fig. 2B, the feature representation model is provided with a linear mapping unit, a word embedding unit, a pre-trained BERT model (i.e., a semantic feature extraction model based on the attention mechanism), an embedding unit, and a multi-layer transform encoder (wherein the transform layer is a feature extraction layer based on the attention mechanism).

In this case, image blocks obtained by dividing the search image are input to the linear mapping unit, and the image blocks are mapped into fixed-length feature vectors (i.e., image block feature vectors) corresponding to the image blocks. The object regions detected in the retrieval image are also input to the respective linear mapping units, thereby mapping the object regions to respective feature vectors having a fixed length (i.e., object region feature vectors).

Further, tag information associated with the object region is input to the word embedding section, and is converted into a shallow semantic feature vector (i.e., a first semantic feature vector) of a fixed length corresponding to the tag information. And the tag information is also input to the BERT model, resulting in a fixed-length deep semantic feature vector (i.e., the second semantic feature vector).

Then, the feature vectors corresponding to the image block and the object region output by the linear mapping unit and the semantic feature vectors output by the word embedding unit and the BERT model are input to a multi-layer Transformer encoder, and an image retrieval feature corresponding to a retrieval image is generated from the input feature vectors by the encoder.

For example, in the technical solution of the present embodiment, image block feature vectors are generated from divided image blocksVib ₁ ~Vib _j (ii) a Generating object region feature vectors from object regionsVob ₁ ~Vob _k (ii) a Generating a shallow semantic feature vector (i.e., a first semantic feature vector) of the tag informationVT1 ₁ ~VT1 _l (ii) a Generating a deep semantic feature vector (i.e., a second semantic feature vector) of the tag informationVT2 ₁ ~VT2 _m 。

Wherein,j、k、landmrespectively, the image block characteristic directionQuantity, object region feature vector, first semantic feature vector, and second semantic feature vector.

Then, the feature representation model combines the respective feature vectorsVib ₁ ~Vib _j 、Vob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l AndVT2 ₁ ~VT2 _m the images are sequentially arranged and input to a multi-layer transform coder, and image retrieval characteristics corresponding to the retrieved images are generated by the coder according to the input characteristic vectors.

Further, before being input to the multi-layer transform encoder, the feature representation model is applied to each feature vector through the embedding unitVib ₁ ~Vib _j 、Vob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l AndVT2 ₁ ~VT2 _m corresponding embedding information, such as token embedding information (token embedding), position embedding information (position embedding), and segment embedding information (segment embedding), is generated. Wherein, for example, the tag embedding information can be directly based on the respective feature vectorsVib ₁ ~Vib _j 、Vob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l AndVT2 ₁ ~VT2 _m and (4) generating. Furthermore, for example, feature vectors may be associated withVib ₁ ~Vib _j 、Vob ₁ ~Vob _k The corresponding segment embedding information is marked as '0' to represent that the segment embedding information is an image feature vector to be associated withVT1 ₁ ~VT1 _l AndVT2 ₁ ~VT2 _m the corresponding segment embedding information is marked as "1" to represent that it is a text feature vector, so that the image segment and the text segment are respectively marked by the segment embedding information. Furthermore, for example, the feature vector may be compared withVib ₁ ~Vib _j 、Vob ₁ ~Vob _k The corresponding position embedding information is marked as '1', and the position embedding information is sequentially and incrementally marked with the characteristic vectorVT1 ₁ ~VT1 _l AndVT2 ₁ ~VT2 _m the corresponding location embeds information.

Therefore, the image features and the semantic features in the retrieval image can be extracted more fully through the above mode, so that the semantic features contained in the image matched with the finally generated image retrieval features are consistent with the semantic features of the retrieval image.

Optionally, in a case where the search image search information is a search text, the operation of determining text information and image information associated with the search image search information includes: extracting label information contained in the retrieval text; determining an object image matched with the label information of the retrieval image in a preset object image-label library, wherein the object image labeled with the label information is stored in the retrieval image object image-label library; and extracting the retrieval image retrieval text and the tag information from the retrieval image retrieval text as text information associated with the retrieval image retrieval text, and the retrieval image object image as image information associated with the retrieval image retrieval text.

Referring to fig. 2A, when the search information input by the user is a search text (i.e., searching in a text search manner), the image content search system extracts the tag information in the search text through the tag extraction module. For example, when the text input by the user is "dog on beach", the tag extraction module extracts the tag information "dog".

The image content retrieval system then inputs the tag information "dog" into the object image-tag library, thereby acquiring an object image (i.e., an object image of the dog) matching the tag information "dog".

Then, referring to fig. 2B, the image content retrieval system inputs the tag information "dog" and the retrieval text "dog on beach" as text information for retrieval to the feature representation model, and also inputs the acquired object image of the dog as a retrieval image to the feature representation model. Thereby performing image retrieval in combination with the text information "dog", "dog on beach", and object image of dog. Wherein the "object region" shown in fig. 2B corresponds to an object image of a dog; the object tag shown in fig. 2B corresponds to tag information "dog"; the search text shown in fig. 2B corresponds to the search text "dog on beach". And the image blocks shown in fig. 2B are left empty (i.e., in the case where the user inputs the search text, the image blocks are not input to the feature representation model any more).

Therefore, in the technical scheme of the disclosure, even if the image retrieval is carried out in a mode that the user inputs the text, the related label information is still extracted aiming at the text input by the user, and the object image matched with the label information is obtained. And the text information and the image information which are related to the retrieval text input by the user are input into the feature representation model together to generate the image retrieval feature which is related to the retrieval text, so that the image feature of the retrieval text and the image feature of the retrieval image can be fused in a feature space, and the retrieval text input by the user and the retrieval image can share one feature representation model to generate the image retrieval feature. In addition, the technical scheme of the disclosure enables the feature representation model to assign appropriate weights to the object image regions associated with the label information based on an attention mechanism by acquiring the object images matched with the label information in the retrieval text and inputting the acquired object images as the image information associated with the retrieval text to the feature representation model together, so that the generated image retrieval features can be consistent with the semantic features of the retrieval text.

Further, when the number of object images matched by the tag information is excessive, the object images input to the feature representation model can be randomly selected by input from the matched object images in a random selection manner.

Further, the operation of performing feature representation based on an attention mechanism on text information and image information to generate an image retrieval feature corresponding to the retrieval information includes: carrying out linear mapping on the object image to generate an object image characteristic vector corresponding to the object image; performing word embedding processing on the tag information to generate a first semantic feature vector corresponding to the tag information; generating a second semantic feature vector corresponding to the label information according to the label information by using a text feature extraction model based on an attention mechanism; performing word embedding processing on the retrieval text to generate a third semantic feature vector corresponding to the retrieval text; generating a fourth semantic feature vector corresponding to the retrieval text according to the retrieval text by using a text feature extraction model based on an attention mechanism; and generating image retrieval features corresponding to the retrieval text according to the object image feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

Specifically, referring to fig. 2B, object images (corresponding to the object regions in fig. 2B) matching the tag information of the search text are input to the respective linear mapping units, thereby mapping the object images to the respective feature vectors (i.e., object region feature vectors) having a fixed length.

Further, tag information extracted from the search text is input to the word embedding unit, and is converted into a shallow semantic feature vector (i.e., a first semantic feature vector) of a fixed length corresponding to the tag information. And the tag information is also input to the BERT model, resulting in a fixed-length deep semantic feature vector (i.e., the second semantic feature vector).

The search text is also input to the word embedding section, and is converted into a fixed-length shallow semantic feature vector (i.e., a third semantic feature vector) corresponding to the search text. And the search text is also input to the BERT model, resulting in a fixed-length deep semantic feature vector (i.e., the fourth semantic feature vector).

Then, the feature vector corresponding to the object image output by the linear mapping unit and the semantic feature vector output by the word embedding unit and the BERT model are input to a multi-layer Transformer encoder, and an image retrieval feature corresponding to the retrieval image is generated by the encoder according to the input feature vector.

For example, in the technical solution of the present embodiment, the object region feature vector is generated according to the object regionVob ₁ ~Vob _k (ii) a Generating a shallow semantic feature vector (i.e., a first semantic feature vector) of the tag informationVT1 ₁ ~VT1 _l (ii) a Generating a deep semantic feature vector (i.e., a second semantic feature vector) of the tag informationVT2 ₁ ~VT2 _m (ii) a Generating shallow semantic feature vector (third semantic feature vector) of search textVT3 ₁ ~VT3 _n (ii) a And generating a deep semantic feature vector (i.e., a fourth semantic feature vector) of the search textVT4 ₁ ~VT4 _o 。

Wherein,k、l、m、nandothe number of the object region feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector are respectively.

Then, the feature representation model combines the respective feature vectorsVob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o the images are sequentially arranged and input to a multi-layer transform coder, and image retrieval characteristics corresponding to the retrieved images are generated by the coder according to the input characteristic vectors.

Further, before being input to the multi-layer transform encoder, the feature representation model is applied to each feature vector through the embedding unitVob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o corresponding embedding information, such as token embedding information (token embedding), position embedding information (position embedding), and segment embedding information (segment embedding), is generated. Wherein, for example, the tag embedding information can be directly based on the respective feature vectorsVob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o and (4) generating. Furthermore, for example, the feature vector may be compared withVob ₁ ~Vob _k The corresponding segment embedding information is marked as '0' to represent that the segment embedding information is an image feature vector to be associated withVT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4the corresponding segment embedding information is marked as "1" to represent that it is a text feature vector, so that the image segment and the text segment are respectively marked by the segment embedding information. Furthermore, for example, the feature vector may be compared withVob ₁ ~Vob _k The corresponding position embedding information is marked as '1', and the position embedding information is sequentially and incrementally marked with the characteristic vectorVT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o the corresponding location embeds information.

Therefore, the image features and the semantic features in the retrieval text can be extracted more fully through the above mode, so that the semantic features contained in the image matched with the finally generated image retrieval features are kept consistent with the semantic features of the retrieval text.

Alternatively, in a case where the retrieval information includes a retrieval text and a retrieval image, the operation of determining text information and image information associated with the retrieval information includes: determining an object region contained in the retrieval image and label information associated with the object region through a preset object detection model; dividing a retrieval image into a plurality of image blocks; and regarding the tag information and the search text as text information associated with the search information, and regarding the object area and the image block as image information associated with the search information.

Referring to fig. 2A, the user may also input the search image and the search text for image search at the same time by combining the image and the text. For example, the search image input by the user is an image showing a dog on the beach, and the search text input by the user is "dog on the beach".

An object detection module of the image content retrieval system determines an area of an object included in a retrieval image and a tag corresponding to the object using a preset object detection model. For example, the object detection module detects an object region of a dog in the image and determines that the object tag associated with the object region is "dog".

Then, as shown in fig. 2B, the image content retrieval system takes the tag information "dog" and the retrieval text "dog on beach" as text information associated with the retrieval information, and takes the object area of the dog detected in the retrieval image and the image block obtained by dividing the retrieval image as image information. The text information is thus input to the feature representation model together with the image information. For example, the image blocks shown in fig. 2B correspond to image blocks obtained by dividing the search image; the object region shown in fig. 2B corresponds to the object region of the dog detected in the search image; the object tag shown in fig. 2B corresponds to tag information "dog" associated with an object area of the dog; and the search text shown in fig. 2B corresponds to the search text "dog on beach" input by the user.

Therefore, in the technical scheme of the disclosure, the user can input the retrieval text and the retrieval image at the same time to perform image retrieval. And the feature representation model can fuse the semantic features of the retrieval text and the retrieval image into a feature space through an attention mechanism, so that the retrieval text and the retrieval image can be combined under the same retrieval frame for image retrieval. In addition, according to the technical scheme of the disclosure, the object region in the retrieval image is detected, and the detected object region is input to the feature representation model as image information, so that the feature representation model can allocate proper weight to the object region based on an attention mechanism, and the generated image retrieval feature can be consistent with the semantic feature of the real text description of the retrieval image, thereby avoiding the interference of the apparent feature of the image. In addition, the technical scheme of the disclosure enables the feature representation model to assign different weights to each image block based on an attention mechanism by dividing the retrieval image into a plurality of image blocks and inputting the divided image blocks as image information to the feature representation model, so that the generated image retrieval features can more accurately reflect semantic features expressed by the retrieval image. Thus, the accuracy of image retrieval is further improved.

Further, the operation of performing feature representation based on an attention mechanism on text information and image information to generate an image retrieval feature corresponding to the retrieval information includes: carrying out linear mapping on the image block and the object area to generate an image block feature vector corresponding to the image block and an object area feature vector corresponding to the object area; performing word embedding processing on the tag information to generate a first semantic feature vector corresponding to the tag information; generating a second semantic feature vector corresponding to the label information according to the label information by using a text feature extraction model based on an attention mechanism; performing word embedding processing on the retrieval text to generate a third semantic feature vector corresponding to the retrieval text; generating a fourth semantic feature vector corresponding to the retrieval text according to the retrieval text by using a text feature extraction model based on an attention mechanism; and generating image retrieval features corresponding to the retrieved image according to the image block feature vector, the object region feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by using a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

Referring to fig. 2B, image blocks into which the search image is divided are input to the linear mapping unit, thereby mapping the image blocks into feature vectors having a fixed length (i.e., image block feature vectors) corresponding to the image blocks. The object regions detected in the retrieval image are also input to the respective linear mapping units, thereby mapping the object regions to respective feature vectors having a fixed length (i.e., object region feature vectors).

For example, in the technical solution of the present embodiment, image block feature vectors are generated from divided image blocksVib ₁ ~Vib _j (ii) a Generating object region feature vectors from object regionsVob ₁ ~Vob _k (ii) a Generating a shallow semantic feature vector (i.e., a first semantic feature vector) of the tag informationVT1 ₁ ~VT1 _l (ii) a Generating a deep semantic feature vector (i.e., a second semantic feature vector) of the tag informationVT2 ₁ ~VT2 _m (ii) a Generating shallow semantic feature vector (third semantic feature vector) of search textVT3 ₁ ~VT3 _n (ii) a And generating a deep semantic feature vector (i.e., a fourth semantic feature vector) of the search textVT4 ₁ ~VT4 _o

Wherein,j、k、l、m、nandothe number of the image block feature vector, the object region feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector are respectively.

Then, the feature representation model combines the respective feature vectorsVib ₁ ~Vib _j 、Vob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o the images are sequentially arranged and input to a multi-layer transform coder, and image retrieval characteristics corresponding to the retrieved images are generated by the coder according to the input characteristic vectors.

Further, before being input to the multi-layer transform encoder, the feature representation model is applied to each feature vector through the embedding unitVib ₁ ~Vib _j 、Vob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o corresponding embedding information, such as token embedding information (token embedding), position embedding information (position embedding), and segment embedding information (segment embedding), is generated. Wherein, for example, the marker-embedded information can be directly based on the respective feature vectorsVib ₁ ~Vib _j 、Vob ₁ ~Vob _k 、VT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o and (4) generating. Furthermore, for example, the feature vector may be compared withVib ₁ ~Vib _j 、Vob ₁ ~Vob _k The corresponding segment embedding information is marked as '0' to represent that the segment embedding information is an image feature vector to be associated withVT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4the corresponding segment embedding information is marked as "1" to represent that it is a text feature vector, so that the image segment and the text segment are respectively marked by the segment embedding information. Furthermore, for example, the feature vector may be compared withVib ₁ ~Vib _j 、Vob ₁ ~Vob _k The corresponding position embedding information is marked as '1', and the position embedding information is sequentially and incrementally marked with the characteristic vectorVT1 ₁ ~VT1 _l 、VT2 ₁ ~VT2 _m 、VT3 ₁ ~VT3 _n AndVT4 ₁ ~VT4 _o the corresponding location embeds information.

Therefore, the image features and semantic features of the retrieval image and the retrieval text in the retrieval information can be extracted more fully through the above mode, so that the semantic features contained in the image matched with the finally generated image retrieval features are kept consistent with the semantic features of the retrieval text.

Further, referring to what is shown on the right side of FIG. 2B, the feature representation model may be trained by:

before training is started, an image-description data set for training is collected, part of the data set is directly from a network (visual description data set), and part of the data is manually marked. When the scale of the labeled data set is small, the weights of the pre-trained BERT can be fixed, and the weights of other parts of the model are only changed in the training process, so that the problem of under-fitting of the model can be solved.

In the course of training, the original input is a branched portion of the image. In order to enable the model to better represent semantic information of the image main body area, the occlusion processing is carried out on partial areas of the image based on the detected object area. One possible method is to perform masking (Mask) processing on a region containing an object block (patch) in an image randomly in a self-supervised learning method, then judge whether the region of the Mask can be matched with an object region extracted by an object detection model (the extracted feature representation has higher similarity) according to output, and output feature representations of an original image and an image detection region through a multi-layer transform encoder.

Referring to the right side of fig. 2B, in the training process, after word embedding, the real text description of the image is input into the pre-trained BERT model to obtain the feature representation of the real text. And then, training the characteristic representation model according to the loss function by using the image characteristics generated by the characteristic representation model and the characteristic representation of the real text.

The loss function set in training is divided into three parts: firstly, the contrast loss (contrast loss) of the whole image and the detected object region in the self-supervision learning, and secondly, the mask block loss (masked token loss), wherein the contrast loss and the mask block loss are used for reference from Oscar; finally, the metric loss (metric loss) between the extracted feature vectors.

The training process is carried out according to the process of normally training other Vision-Language models (VLM) until the training loss curve is converged, so as to obtain the trained feature representation Model.

Further optionally, as described above, the operation of retrieving the image matching the retrieval information according to the image retrieval feature includes: carrying out feature matching on the image retrieval features and image features which are stored in a feature retrieval library and are associated with different images; and determining an image matched with the retrieval information according to the result of the feature matching.

According to the technical scheme of the disclosure, after the feature representation model generates the image retrieval features associated with the retrieval information, the image content retrieval system matches the image retrieval features with the image features associated with each image stored in the feature retrieval library.

Specifically, the image content retrieval system may perform similarity calculation on the image retrieval features associated with the retrieval information and the image features stored in the feature retrieval library one by one, and return the image corresponding to the image exceeding the threshold.

Furthermore, the feature search library may be constructed, for example, by: each image is preprocessed from the image library according to the preprocessing method of searching the image by the image, then is sent to the feature representation extraction model to obtain the corresponding image feature, and is stored in the feature retrieval library until all the images in the image library are traversed once, so that the library of one-to-one correspondence of the image feature and the image is formed.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.

Thus, according to the present embodiment, after receiving the search information, the search information is not directly input to the feature representation model, but text information and image information associated therewith are generated from the search information. And then generating image features associated with the retrieval information from the text information and the image information by using the feature representation model. Therefore, in this way, no matter whether the keywords and sentences or one image are input by the user, the technical scheme of the disclosure generates text information and image information which are associated with the retrieval information input by the user, and then generates image characteristics which are associated with the retrieval information according to the text information and the image information. Therefore, through the mode, the semantic features of the text and the feature representation of the image can be fused through an attention mechanism, so that the semantic relation between the text and the image is established, the semantic gap problem of the text and the image in the cross-modal retrieval is reduced, and the cross-modal retrieval can be realized by utilizing the same retrieval frame. Moreover, the feature representation model based on the attention mechanism is adopted in the method, so that the weights of different image parts in the image can be distributed according to the semantic features, and the image feature expression capacity is enhanced. In this way, the interference of the apparent features of the image on the retrieval based on the semantic features of the image is reduced. Therefore, the technical problems that in the prior art, text retrieval information and image retrieval information are difficult to retrieve in the same retrieval frame, and when an image area occupied by a semantic main body is small, the apparent features of an image are easy to interfere with retrieval based on semantic features are solved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 5 shows an image retrieval apparatus 500 according to the present embodiment, the apparatus 500 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 5, the apparatus 500 includes: a retrieval information receiving module 510, configured to receive retrieval information; an information determination module 520 for determining text information and image information associated with the search information; an image search feature generation module 530, configured to perform feature representation based on an attention mechanism on the text information and the image information, and generate an image search feature corresponding to the search information; and an image retrieval module 540 for retrieving an image matching the retrieval information according to the image retrieval characteristics.

Optionally, the information determining module 520 includes: the object detection submodule is used for determining an object area contained in the retrieval image and label information related to the object area through a preset object detection model under the condition that the retrieval information is the retrieval image; the image dividing submodule is used for dividing the retrieval image into a plurality of image blocks; and an information determination sub-module for regarding the tag information as text information associated with the retrieval image, and regarding the object area and the image block as image information associated with the retrieval image.

Further, the image retrieval feature generation module 530 includes: the linear mapping sub-module is used for performing linear mapping on the image block and the object area to generate an image block feature vector corresponding to the image block and an object area feature vector corresponding to the object area; the first semantic feature generation submodule is used for carrying out word embedding processing on the label information and generating a first semantic feature vector corresponding to the label information; the second semantic feature generation submodule is used for generating a second semantic feature vector corresponding to the label information according to the label information by utilizing a text feature extraction model based on an attention mechanism; and the image retrieval feature generation submodule is used for generating image retrieval features corresponding to the retrieved image according to the image block feature vector, the object region feature vector, the first semantic feature vector and the second semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

Optionally, the information determining module 520 includes: the label extraction submodule is used for extracting label information contained in the search text under the condition that the search information is the search text; the object image determining submodule is used for determining an object image matched with the label information in a preset object image-label library, wherein the object image labeled with the label information is stored in the object image-label library; and an information determination sub-module for taking the retrieval text and extracting the tag information from the retrieval text as text information associated with the retrieval text, and taking the object image as image information associated with the retrieval text.

Further, the image retrieval feature generation module 530 includes: the linear mapping submodule is used for carrying out linear mapping on the object image to generate an object image characteristic vector corresponding to the object image; the first semantic feature generation module is used for performing word embedding processing on the tag information and generating a first semantic feature vector corresponding to the tag information; the second semantic feature generation submodule is used for generating a second semantic feature vector corresponding to the label information according to the label information by utilizing a text feature extraction model based on an attention mechanism; the third semantic feature generation submodule is used for carrying out word embedding processing on the retrieval text and generating a third semantic feature vector corresponding to the retrieval text; the fourth semantic feature generation submodule is used for generating a fourth semantic feature vector corresponding to the retrieval text according to the retrieval text by utilizing a text feature extraction model based on an attention mechanism; and the image retrieval feature generation submodule is used for generating image retrieval features corresponding to the retrieval text according to the object image feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

Optionally, the information determining module 520 includes: the object detection submodule is used for determining an object area contained in the retrieval image and label information related to the object area through a preset object detection model under the condition that the retrieval information comprises a retrieval text and a retrieval image; the image dividing submodule is used for dividing the retrieval image into a plurality of image blocks; and an information determination sub-module for regarding the tag information and the search text as text information associated with the search information, and regarding the object area and the image block as image information associated with the search information.

Further, the image retrieval feature generation module 530 includes: the linear mapping sub-module is used for performing linear mapping on the image block and the object area to generate an image block feature vector corresponding to the image block and an object area feature vector corresponding to the object area; the first semantic feature generation submodule is used for carrying out word embedding processing on the label information and generating a first semantic feature vector corresponding to the label information; the second semantic feature generation submodule is used for generating a second semantic feature vector corresponding to the label information according to the label information by utilizing a text feature extraction model based on an attention mechanism; the third semantic feature generation submodule is used for carrying out word embedding processing on the retrieval text and generating a third semantic feature vector corresponding to the retrieval text; the fourth semantic feature generation submodule is used for generating a fourth semantic feature vector corresponding to the retrieval text according to the retrieval text by using the text feature extraction model based on the attention mechanism; and the image retrieval feature generation submodule is used for generating image retrieval features corresponding to the retrieved image according to the image block feature vector, the object region feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

Example 3

Fig. 6 shows an image retrieval apparatus 600 according to the present embodiment, the apparatus 600 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 6, the apparatus 600 includes: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: receiving retrieval information; determining text information and image information associated with the retrieval information; performing feature representation based on an attention mechanism on the text information and the image information to generate image retrieval features corresponding to the retrieval information; and retrieving the image matched with the retrieval information according to the image retrieval characteristics.

Further optionally, the performing an attention-based feature representation on the text information and the image information to generate an image retrieval feature corresponding to the retrieval information includes: carrying out linear mapping on the image block and the object area to generate an image block feature vector corresponding to the image block and an object area feature vector corresponding to the object area; performing word embedding processing on the tag information to generate a first semantic feature vector corresponding to the tag information; generating a second semantic feature vector corresponding to the label information according to the label information by using a text feature extraction model based on an attention mechanism; and generating image retrieval features corresponding to the retrieval image according to the image block feature vector, the object region feature vector, the first semantic feature vector and the second semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

Optionally, in a case where the search information is a search text, the operation of determining text information and image information associated with the search information includes: extracting label information contained in the retrieval text; determining an object image matched with the label information in a preset object image-label library, wherein the object image-label library stores the object image marked with the label information; and taking the retrieval text and extracting the tag information from the retrieval text as text information associated with the retrieval text, and taking the object image as image information associated with the retrieval text.

Further optionally, the performing an attention-based feature representation on the text information and the image information to generate an image retrieval feature corresponding to the retrieval information includes: carrying out linear mapping on the object image to generate an object image characteristic vector corresponding to the object image; performing word embedding processing on the tag information to generate a first semantic feature vector corresponding to the tag information; generating a second semantic feature vector corresponding to the label information according to the label information by using a text feature extraction model based on an attention mechanism; performing word embedding processing on the search text to generate a third semantic feature vector corresponding to the search text; generating a fourth semantic feature vector corresponding to the retrieval text according to the retrieval text by using a text feature extraction model based on an attention mechanism; and generating image retrieval features corresponding to the retrieval text according to the object image feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

Further optionally, the performing an attention-based feature representation on the text information and the image information to generate an image retrieval feature corresponding to the retrieval information includes: carrying out linear mapping on the image block and the object area to generate an image block feature vector corresponding to the image block and an object area feature vector corresponding to the object area; performing word embedding processing on the tag information to generate a first semantic feature vector corresponding to the tag information; generating a second semantic feature vector corresponding to the label information according to the label information by using a text feature extraction model based on an attention mechanism; performing word embedding processing on the retrieval text to generate a third semantic feature vector corresponding to the retrieval text; generating a fourth semantic feature vector corresponding to the retrieval text according to the retrieval text by using a text feature extraction model based on an attention mechanism; and generating image retrieval features corresponding to the retrieved image according to the image block feature vector, the object region feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by using a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An image retrieval method, comprising:

receiving retrieval information;

determining text information and image information associated with the retrieval information;

performing feature representation based on an attention mechanism on the text information and the image information, and generating an image retrieval feature corresponding to the retrieval information; and

and retrieving the image matched with the retrieval information according to the image retrieval characteristics.

2. The method according to claim 1, wherein, in a case where the search information is a search image, the operation of determining text information and image information associated with the search information includes:

determining an object region contained in the retrieval image and label information associated with the object region through a preset object detection model;

dividing the retrieval image into a plurality of image blocks; and

the tag information is taken as text information associated with the retrieval image, and the object area and the image block are taken as image information associated with the retrieval image.

3. The method according to claim 2, wherein performing an attention-based feature representation on the text information and the image information to generate an image retrieval feature corresponding to the retrieval information includes:

performing linear mapping on the image block and the object region to generate an image block feature vector corresponding to the image block and an object region feature vector corresponding to the object region;

performing word embedding processing on the tag information to generate a first semantic feature vector corresponding to the tag information;

generating a second semantic feature vector corresponding to the label information according to the label information by using a text feature extraction model based on an attention mechanism; and

and generating image retrieval features corresponding to the retrieval image according to the image block feature vector, the object region feature vector, the first semantic feature vector and the second semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

4. The method according to claim 1, wherein, in a case where the search information is a search text, the operation of determining text information and image information associated with the search information includes:

extracting label information contained in the retrieval text;

determining an object image matched with the label information in a preset object image-label library, wherein the object image-label library stores the object image marked with the label information; and

the search text and the tag information are taken as text information associated with the search text, and the object image is taken as image information associated with the search text.

5. The method according to claim 4, wherein performing an attention-based feature representation on the text information and the image information to generate an image retrieval feature corresponding to the retrieval information comprises:

performing linear mapping on the object image to generate an object image feature vector corresponding to the object image;

generating a second semantic feature vector corresponding to the label information according to the label information by using a text feature extraction model based on an attention mechanism;

performing word embedding processing on the retrieval text to generate a third semantic feature vector corresponding to the retrieval text;

generating a fourth semantic feature vector corresponding to the retrieval text according to the retrieval text by utilizing a text feature extraction model based on an attention mechanism; and

and generating image retrieval features corresponding to the retrieval text according to the object image feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by utilizing a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

6. The method according to claim 1, wherein in the case where the search information includes a search text and a search image, the operation of determining text information and image information associated with the search information includes:

dividing the retrieval image into a plurality of image blocks; and

the tag information and the search text are taken as text information associated with the search information, and the object area and the image block are taken as image information associated with the search information.

7. The method according to claim 6, wherein performing an attention-based feature representation on the text information and the image information to generate an image retrieval feature corresponding to the retrieval information comprises:

and generating image retrieval features corresponding to the retrieval image according to the image block feature vector, the object region feature vector, the first semantic feature vector, the second semantic feature vector, the third semantic feature vector and the fourth semantic feature vector by using a preset encoder, wherein the encoder comprises a plurality of feature extraction layers based on an attention mechanism.

8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.

9. An image retrieval apparatus, comprising:

the retrieval information receiving module is used for receiving retrieval information;

the information determining module is used for determining text information and image information which are associated with the retrieval information;

the image retrieval feature generation module is used for performing feature representation based on an attention mechanism on the text information and the image information and generating image retrieval features corresponding to the retrieval information; and

and the image retrieval module is used for retrieving the image matched with the retrieval information according to the image retrieval characteristics.

10. An image retrieval apparatus, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

receiving retrieval information;