CN116975350A

CN116975350A - Image-text retrieval method, device, equipment and storage medium

Info

Publication number: CN116975350A
Application number: CN202310446478.6A
Authority: CN
Inventors: 胡煜松; 高雨婷; 李珂
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-10-31

Abstract

The application discloses a picture and text retrieval method, a device, equipment and a storage medium, which relate to the technical field of artificial intelligence and can be applied to map scenes such as navigation positioning, map searching, scene recognition and the like.

Description

Image-text retrieval method, device, equipment and storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence (Artificial Intelligence, AI), and provides an image-text retrieval method, an image-text retrieval device, image-text retrieval equipment and a storage medium.

Background

In the technical field of AI, vision and language are the most important two modes applied to the understanding of the outside world by AI, and an image-text matching technology is developed for adapting to the demands of life and work, and can realize text prediction based on images or image prediction based on texts, and the application of the image-text matching technology is wider due to the convenience of the image-text prediction technology. For example, the image-text matching technology can be applied to scenes such as visual image questions and answers, image-text retrieval, image description generation, visual common sense reasoning and the like.

The image-text retrieval process is an interaction process of cross-mode image and text information, the image and text information are respectively subjected to feature coding through two single-mode encoders, finally obtained image features and text features are input into a multi-mode encoder for feature fusion, so that the matching degree of the image and the text is determined, and the matching image most suitable for the text is retrieved from the images by comparing the matching degrees of the images and the text.

However, in the related art, global information of an image is more considered when the matching degree of the image and the text is determined, so that the image feature output by the last layer of the single-mode encoder is generally used for fusing with the text feature, but local feature information in the image is lost, so that the image feature cannot accurately express the image information, the interaction performance of the image and the text feature is affected, and the accuracy of an image-text retrieval task is reduced.

Disclosure of Invention

The embodiment of the application provides a picture and text retrieval method, a device, equipment and a storage medium, which are used for improving the accuracy of picture and text retrieval.

In one aspect, a method for retrieving graphics and text is provided, the method comprising:

feature coding of feature-by-feature dimensions is carried out on each target image in a target image set to be retrieved respectively, and image feature sets corresponding to each target image are obtained; wherein each image feature in the image feature set corresponds to a feature dimension;

performing alignment processing based on the image features obtained last time in each image feature set and the text features of the target text to be retrieved respectively to obtain the similarity between each image feature and the text feature;

determining at least one candidate image from the target image set based on the obtained similarity;

selecting a plurality of target feature dimensions from feature dimensions corresponding to an image feature set based on a preset image feature dimension selection strategy, and performing image feature fusion on image features of the plurality of target feature dimensions corresponding to at least one candidate image to obtain image fusion features corresponding to the at least one candidate image;

Mapping the obtained at least one image fusion feature and the text feature into the same feature space respectively to obtain at least one mapping image feature and mapping text feature in the same feature space, and performing cross-modal feature fusion on the at least one mapping image feature and the mapping text feature respectively to obtain multi-modal fusion features corresponding to the at least one candidate image respectively;

and respectively obtaining the matching degree between each of the at least one candidate image and the target text based on the obtained at least one multi-modal fusion feature, and determining the matching image with the target text from the at least one candidate image based on the obtained at least one matching degree.

In one aspect, there is provided an image-text retrieval apparatus, the apparatus comprising:

the feature coding module is used for respectively carrying out feature coding of feature-by-feature dimensions on each target image in the target image set to be retrieved to obtain image feature sets corresponding to each target image; wherein each image feature in the image feature set corresponds to a feature dimension;

the image-text alignment module is used for performing alignment processing based on the image features obtained last time in each image feature set and the text features of the target text to be retrieved respectively to obtain the similarity between each image feature and the text feature;

The candidate determining module is used for determining at least one candidate image from the target image set based on the obtained similarity;

the image fusion module is used for selecting a plurality of target feature dimensions from feature dimensions corresponding to an image feature set based on a preset image feature dimension selection strategy, and carrying out image feature fusion on the image features of the plurality of target feature dimensions corresponding to the at least one candidate image to obtain image fusion features corresponding to the at least one candidate image;

the image-text fusion module is used for respectively mapping the obtained at least one image fusion feature and the text feature into the same feature space, obtaining at least one mapping image feature and mapping text feature in the same feature space, and respectively performing cross-modal feature fusion on the at least one mapping image feature and the mapping text feature to obtain multi-modal fusion features corresponding to the at least one candidate image;

and the matching output module is used for respectively obtaining the matching degree between each of the at least one candidate image and the target text based on the obtained at least one multi-mode fusion characteristic, and determining the matching image with the target text from the at least one candidate image based on the obtained at least one matching degree.

Optionally, the image fusion module is specifically configured to:

for the at least one candidate image, performing the following operations respectively:

for one candidate image, weighting the image features of the candidate image corresponding to the plurality of target feature dimensions based on the weight values corresponding to the plurality of target feature dimensions to obtain corresponding image fusion features;

wherein each weight value is determined based on the number of the plurality of target feature dimensions; alternatively, each weight value is determined from importance to the image fusion feature based on the plurality of target feature dimensions.

Optionally, the image-text fusion module is specifically configured to:

for the at least one mapped image feature, performing any one of the following operations, respectively:

vector splicing processing is carried out on the mapping image features and the mapping text features aiming at one mapping image feature, so that corresponding multi-mode fusion features are obtained;

aiming at a mapping image feature, carrying out pooling treatment on the mapping image feature and the mapping text feature to obtain a corresponding multi-mode fusion feature;

performing convolution operation processing on the mapping image feature and the mapping text feature aiming at one mapping image feature to obtain a corresponding multi-mode fusion feature;

And aiming at a mapping image feature, mapping the mapping image feature and the mapping text feature to obtain a corresponding multi-mode fusion feature.

Optionally, the apparatus further comprises a model training unit for:

obtaining an image-text sample set, wherein each training image-text sample in the image-text sample set comprises a sample image, a sample text and a real matching degree between the sample image and the sample text;

performing multiple rounds of iterative training on the image-text matching model based on the image-text sample set until convergence conditions are met; wherein each round of iterative training process comprises the following steps:

determining the prediction matching degree between the sample image and the sample text included in each input training image text sample by adopting an image-text matching model used in the round;

and determining a model loss value of the image-text matching model based on the obtained prediction matching degree and the corresponding real matching degree, and carrying out parameter adjustment on the image-text matching model based on the model loss value.

Optionally, the model training unit is specifically configured to:

for each training image-text sample, the following processing is carried out respectively:

Respectively carrying out feature coding of feature-by-feature dimensions on each input sample image and each sample text;

carrying out alignment processing on the sample image features obtained in the last time and the sample text features obtained in the last time in the feature-by-feature dimension feature coding process to obtain the similarity between the sample image features and the sample text features;

determining an image-text alignment loss value based on the similarity;

determining an image-text matching loss value based on the obtained prediction matching degrees and the corresponding real matching degrees;

and obtaining the model loss value based on the image-text alignment loss value and the image-text matching loss value.

Optionally, the model training unit is specifically configured to:

respectively carrying out data enhancement processing on each input sample image to obtain enhancement image pairs corresponding to each sample image;

determining an image symmetry loss value based on a similarity between two enhanced images included in each of the respective enhanced image pairs;

and obtaining the model loss value based on the image symmetry loss value and the image-text matching loss value.

Optionally, the model training unit is specifically configured to:

for each enhanced image pair, the following processing is performed respectively:

for one enhanced image pair, respectively carrying out feature coding processing on two enhanced images included in the enhanced image pair to obtain enhanced image features corresponding to the two enhanced images;

determining a sample symmetry loss value of the enhanced image pair based on the similarity between the enhanced image features corresponding to each of the two enhanced images;

the image symmetry-loss value is determined based on the obtained respective sample symmetry-loss values.

In one aspect, a computer device is provided comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when the computer program is executed.

In one aspect, there is provided a computer storage medium having stored thereon computer program instructions which, when executed by a processor, perform the steps of any of the methods described above.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the steps of any of the methods described above.

In the embodiment of the application, feature codes of feature dimensions are respectively carried out on each image to be searched to obtain the image features of each image to be searched in each feature dimension, the image features obtained by the last feature code of each image to be searched are respectively aligned with the text features of the target text to be searched to obtain the similarity between each image feature and the text feature, and the candidate image is determined from each image to be searched according to the similarity. According to the method, the image features obtained by the last feature coding are aligned with the text features, so that partial candidate images are preliminarily screened from a large number of images to be searched for image-text matching, image feature fusion of multiple feature latitudes of all the images to be searched is avoided, and the image-text searching efficiency is improved. And selecting a plurality of target feature dimensions according to a preset feature dimension selection strategy, and fusing the image features of each candidate image under the plurality of target feature dimensions to obtain the image fusion features of each candidate image for fusing with the text features. In this way, the candidate image is rapidly determined through the image features of the last layer, and the image features of a plurality of feature dimensions are fused to fully utilize the information of different layers of the image in consideration of the fact that the image features of different feature dimensions represent the image information of different dimensions and different depths, so that partial local information of the image is prevented from being lost, fine granularity of the image fusion features is better, interaction of the image and text features on multi-layer information is promoted, and accordingly accuracy of image-text retrieval is improved. And the image fusion features and the text features are mapped into the same feature space, so that the feature dimensions of the image fusion features and the text features are consistent, and then the cross-mode feature fusion is carried out on the image fusion features and the text features, compared with direct fusion, the loss and confusion of image and text information caused by direct fusion of the image and the text features with different dimensions can be avoided, so that the semantic information of the image and the text can be more accurately expressed by the cross-mode fusion features, the interaction of the image and the text features is further promoted, and the accuracy of image-text retrieval is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a diagram of a graph-text matching model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a training process of an image-text matching model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a self-supervised learning process of an image modality according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training task of an image-text matching model according to an embodiment of the present application;

Fig. 6 is a schematic flow chart of an image-text retrieval method according to an embodiment of the present application;

fig. 7 is a schematic view of a scenario of an image-text retrieval task according to an embodiment of the present application;

FIG. 8 is a schematic diagram of performance of the image-text matching model and other models according to the embodiment of the present application;

fig. 9 is a schematic structural diagram of an image-text retrieval device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

modality (modality): in the embodiment of the application, the source and the form of the information are various, and the data form of the AI for understanding each information in the outside world can be called a mode, including a visual (image) mode, a language (text) mode and the like.

Mask language model (Masked Language Model, MLM): a pre-training model for natural language processing uses a large number of unlabeled sample texts in pre-training, and breaks through original text information by processing vocabulary masks (masks) at set positions in the sample texts, so that the model carries out text reconstruction of the sample texts, and the model obtains various information from the context, so that the vocabulary of the mask is predicted to be infinitely close to or the original vocabulary. The MLM task is typically implemented in conjunction with a multi-headed self-attention mechanism in the transducer model that automatically learns the representation of each word from the input sequence and then computes a similarity score between the words from these representations to enable prediction of the "mask" symbol. The method not only shows excellent performance in natural language processing, but also is widely applied to image-text matching tasks.

Visual transducer (Vision Transformer, VIT) model: the image classification model based on the self-attention mechanism is characterized in that an image is regarded as a sequence by a VIT model, an input image is split into a plurality of small image blocks, the image blocks are regarded as a sequence, and then the sequence is processed through the multi-layer self-attention mechanism, so that the classification result of the image is finally obtained.

Specifically, the VIT model includes an embedded layer, a multi-layer transducer encoder, and a classification header. The embedding layer maps the input image blocks into a fixed-dimensional vector, and then inputs the vectors into a transform encoder. Each transducer encoder includes a multi-headed self-attention module and a fully-connected feed-forward network module. The multi-head self-attention module can perform attention calculation on each position in the sequence, and interact and integrate information among different positions, and then process the information through the fully-connected feedforward network module. The multi-layered transform encoder can progressively extract features of image blocks and integrate them into one global feature. Finally, the global feature is fed into a classification head for classification. Compared with the traditional convolutional neural network, the VIT model can avoid the loss of the pooling layer to the image information, and has better expandability and generalization performance. The VIT model exhibits excellent performance in a plurality of image classification tasks.

Vision-language pre-training (Vision Language Pretraining, VLP): a model can learn the semantic relation of image-language mode and can be applied to various tasks among image-text tasks by designing a network structure and a training method, utilizing a large amount of image-text information acquired by the network, a public data set and the like and utilizing the data to pretrain the model in a large scale.

Embodiments of the present application relate to artificial intelligence and Machine Learning (ML) techniques, designed primarily based on Machine Learning in artificial intelligence.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and further means to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eye observation or transmission to an instrument for detection. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like. An artificial neural network (Artificial Neural Network, ANN) abstracts the human brain neural network from the point of information processing, builds a simple model, and forms different networks according to different connection modes. The neural network is an operation model, which is formed by interconnecting a plurality of nodes (or neurons), each node represents a specific output function, called an excitation function (activation function), the connection between every two nodes represents a weighting value for the signal passing through the connection, called a weight, which is equivalent to the memory of an artificial neural network, the output of the network is different according to the connection mode of the network, the weight value and the excitation function are different, and the network itself is usually an approximation to a certain algorithm or function in nature, and can also be an expression of a logic strategy.

The embodiment of the application adopts an artificial neural network model based on deep learning when matching the image and the text. The image-text matching in the embodiment of the application can be divided into two parts, namely a training part and an application part. The training part relates to the technical field of machine learning, and in the training part, an artificial neural network model (namely a later-mentioned image-text matching model) is trained through the machine learning technology, so that an image-text matching model is trained based on the image-text matching model training method provided by the embodiment of the application, and model parameters are continuously adjusted through an optimization algorithm until the image-text matching model converges; the application part is used for determining the matching degree and the like between the input image to be matched and the text by using the image-text matching model obtained through training of the training part.

In addition, it should be noted that the artificial neural network model in the embodiment of the present application may be on-line training or off-line training, which is not specifically limited herein, and is illustrated herein by taking off-line training as an example.

The following briefly describes the design concept of the embodiment of the present application:

the vision and language are two most important modes applied by the outside of AI understanding, so a vision-language model capable of learning semantic relations between vision and language modes is widely focused, and the image-text pre-training method for training the vision-language model is usually to model the text and image information firstly and then fuse the image characteristics and the text characteristics.

In the related technology, the fusion of the cross-modal characteristics of the images and the texts is generally realized by adopting a single-flow network architecture and a double-flow network architecture, wherein the single-flow network architecture connects the images and the text information together in series and inputs the images and the text information into a large encoder to realize the fusion of the images and the text information, but the huge encoder architecture and the serial characteristics bring higher dimension calculation, so that the characteristic fusion efficiency is greatly limited; the dual-flow network architecture encodes images and text information through two single-mode encoders respectively, and designs a mode to realize the fusion of the images and the text information, but the dual-flow network architecture has relatively simple structure, but has poor fusion performance, needs a large amount of image text information to be optimally trained to obtain a model for fusing the images and the text information, and cannot realize the task of generating the text based on image prediction text, visual image question-answering, image description generation and the like.

In order to solve the problems, the related art proposes a graph-text pre-training method based on a single-flow and double-flow combined network architecture, which performs feature coding and feature alignment on image and text information through two single-mode encoders respectively, and inputs the aligned image features and text features into a multi-mode encoder to perform cross-mode feature fusion, so that the fusion performance of the image and text features is improved, the calculated amount is reduced, and the fusion efficiency is improved.

However, in the current graph-text pre-training method based on a single-stream and double-stream combined network architecture, the image features output by the last layer of the single-mode encoder are generally used for feature alignment and feature fusion with the text features, and different from the global information of the images, which is needed to be considered more when the image text is aligned, the global image features and the text features output by the last layer are used for fusion, so that local feature information in the images can be lost, the image features cannot accurately express the image information, the interaction of the images and the text features is affected, and the model is difficult to train and obtain better effects.

In view of this, an embodiment of the present application provides an image-text retrieval method, in which feature codes of feature dimensions are performed on each image to be retrieved, image features of each image to be retrieved in each feature dimension are obtained, alignment processing is performed on image features obtained by last feature code of each image to be retrieved and text features of a target text to be retrieved, so as to obtain similarity between each image feature and text feature, and candidate images are determined from each image to be retrieved according to the similarity. According to the method, the image features obtained by the last feature coding are aligned with the text features, so that partial candidate images are preliminarily screened from a large number of images to be searched for image-text matching, image feature fusion of multiple feature latitudes of all the images to be searched is avoided, and the image-text searching efficiency is improved. And selecting a plurality of target feature dimensions according to a preset feature dimension selection strategy, and fusing the image features of each candidate image under the plurality of target feature dimensions to obtain the image fusion features of each candidate image for fusing with the text features. In this way, the image features with different feature dimensions represent image information with different dimensions and different depths, including but not limited to textures, colors and shapes, the image features fused with the feature dimensions can fully utilize information of different layers of the image, so that partial local information of the image is avoided being lost, fine granularity of the image fused features is better, interaction of the image and text features on multi-layer information is promoted, and accuracy of image-text retrieval is improved. And the image fusion features and the text features are mapped into the same feature space, so that the feature dimensions of the image fusion features and the text features are consistent, and then the cross-mode feature fusion is carried out on the image fusion features and the text features, compared with direct fusion, the loss and confusion of image and text information caused by direct fusion of the image and the text features with different dimensions can be avoided, so that the semantic information of the image and the text can be more accurately expressed by the cross-mode fusion features, the interaction of the image and the text features is further promoted, and the accuracy of image-text retrieval is improved.

In order to further improve the modeling effect of an image mode, when the image-text matching model is trained, the embodiment of the application carries out two different data enhancement on the same sample image, determines the symmetrical loss of the image by obtaining the similarity between two different enhanced images through alignment, and is used for carrying out parameter adjustment on the image-text matching model so as to improve the image modeling performance of the image-text matching model and obtain the image characteristics with better expression effect.

In addition, the model is difficult to converge due to the fact that end-to-end training is directly carried out, so that the model can be subjected to multi-task learning through the image-text matching model, the feature expression function of the model is further enriched, the correlation and potential structures among data of different modes can be captured, and universality of different downstream tasks is improved. Therefore, in the embodiment of the application, in the pre-training stage, the model convergence difficulty is reduced by adopting sample data of different modes to perform feature learning of corresponding modes.

After the design idea of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limiting. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be applied to scenes related to image-text retrieval, including but not limited to scenes such as visual image questions and answers, image description generation, visual common sense reasoning and the like. As shown in fig. 1, a schematic view of an application scenario provided in an embodiment of the present application may include a terminal device 101 and a server 102.

The terminal device 101 is an electronic device used by a user, including, but not limited to, any device capable of being connected to a server, such as a mobile phone, a tablet personal computer (PAD), a notebook computer, a desktop computer, a smart television, an intelligent vehicle-mounted device, an aircraft, and an intelligent wearable device, and providing a local service for the user. The terminal device 101 may be provided with an application related to an image-text retrieval service, including but not limited to an application having an image-text retrieval function or an image-text matching model training function, where the application related to the embodiment of the present application may be a software client, or may be a client such as a web page or an applet.

The server 102 may be a background server corresponding to an application installed on the terminal device 101, and provide corresponding services for a target application, including but not limited to a matching image of a return image-text retrieval service, which may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and basic cloud computing services such as big data and an artificial intelligent platform.

The server 102 can include one or more processors 1021, memory 1022, and I/O interfaces 1023 for interaction with terminals, etc. In addition, the server 102 may further configure a database 1024, where the database 1024 may be used to store data information of image modalities, text modalities, trained model parameters, and the like. The memory 1022 of the server 102 may further store program instructions of the image-text retrieval method provided by the embodiment of the present application, where the program instructions, when executed by the processor 1021, can be used to implement the steps of the image-text retrieval method provided by the embodiment of the present application, so as to obtain the matching degree between the image to be retrieved and the text.

The image-text searching method and the training method of the image-text matching model in the embodiment of the application can be executed by the terminal equipment 101, can be executed by the server 102, can be executed by the terminal equipment 101, and can be executed by the server 102.

In one possible implementation manner, when the training method of the image-text matching model is executed by the terminal device 101 or the server 102, the terminal device 101 or the server 102 obtains a preset image-text sample set, each training image-text sample in the image-text sample set comprises a sample image, a sample text and a true matching degree between the sample image and the sample text, the terminal device 101 or the server 102 carries out multiple rounds of iterative training on the image-text matching model to be trained based on the image-text sample set, in each round of iterative training, selects the training image-text sample from the image-text sample set, carries out feature coding of feature-by-feature dimensions on the selected sample image by adopting the image-text matching model used in the round, obtains an image feature set corresponding to the sample image, selects a strategy according to the preset feature dimension, selecting multiple target feature dimensions from feature dimensions corresponding to an image feature set, fusing image features of the corresponding feature dimensions in the image feature set based on the preset multiple target feature dimensions to obtain image fusion features, mapping the image fusion features and text features of a sample text to the same feature space to obtain mapped image features and mapped text features in the same feature space, performing cross-modal feature fusion on the mapped image features and the mapped text features to obtain corresponding multi-modal fusion features, determining a prediction matching degree between a sample image and a sample file based on the multi-modal fusion features, determining a model loss value of a graph matching model based on the obtained prediction matching degree corresponding to each training graph sample and the corresponding real matching degree, finally performing parameter adjustment on the graph matching model based on the model loss value, finally, a target image-text matching model meeting the requirements is obtained and output.

The terminal device 101 and the server 102 may be in direct or indirect communication connection via one or more networks 103. The network 103 may be a wired network, or may be a Wireless network, including but not limited to a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may be other possible networks, which the embodiments of the present application are not limited to.

In the embodiment of the present application, the number of the terminal devices 101 may be one or more, and similarly, the number of the servers 102 may be one or more, that is, the number of the terminal devices 101 or the servers 102 is not limited.

In a possible implementation manner, the image-text retrieval method of the embodiment of the present application may be applied to an image-text retrieval scene, a user inputs text information on the terminal device 101, and expects to acquire images corresponding to the content described by the text, then the application installed on the terminal device 101 may be a search engine client provided to the user, the terminal device 101 sends the text information to the server 102, and the server 102 obtains, for the text information input by the user, the matching degree of each image in the image data set and the text information based on the image-text retrieval method of the embodiment of the present application, so that the relative size of the matching degree of each image, and the images recommended to the user may be presented on the search engine client, and the user may view the images based on the search engine client.

In a possible implementation manner, the image-text retrieval method of the embodiment of the present application may be applied to a visual question-answering scenario, where a user inputs a question (text information) on the terminal device 101, and desires to obtain an image corresponding to an answer to the question. The application installed on the terminal device 101 may be an AI question-answering client provided to the user and transmit text information input by the user to the server 102. The server 102 calculates the matching degree between each image in the image data set and the text information based on the image-text retrieval method provided by the embodiment, so as to recommend the image with higher relative matching degree which accords with the preset threshold. The AI question-answering client presents these images to the user, enabling the user to view these images through the AI question-answering client, and answer their questions accordingly.

In a possible implementation manner, the image-text retrieval method of the embodiment of the present application may be applied to an image description generation scenario, where a user inputs a piece of text information on the terminal device 101, and desires to obtain an image corresponding to the text description content. An application installed on the terminal device 101 may generate a client for an image provided to a user and transmit text information input by the user to the server 102. The server 102 determines images with matching degree meeting a preset threshold value from the image dataset based on the image-text retrieval method provided by the embodiment, combines the image characteristics of the images and the text characteristics of the text information, generates corresponding target images, returns the target images to the terminal equipment, and finally displays the target images on the image generation client for the user to view.

Of course, the method provided by the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.

As shown in fig. 2, a model structure diagram of an image-text matching model according to an embodiment of the present application is shown, where the image-text matching model may include an image feature extraction module, a text feature extraction module, and a feature fusion module.

(1) The image feature extraction module: the encoder structure for feature-by-feature latitude encoding of the image data information of the input model is used to extract the image features of the input image data information at each feature latitude, and any encoder structure capable of video encoding can be used, including but not limited to convolutional neural networks (Convolutional Neural Network, CNN), which can gradually extract the low-level features and the high-level features of the image through multi-layer convolution and pooling operations, and common CNN structures include visual geometry groups (Visual Geometry Group, VGG), resNet, inception, and the like, which are not exemplified one by one.

In a possible implementation manner, the embodiment of the application can use the VIT model as the image feature extraction module of the image-text matching model, the VIT model has a plurality of layers of converters encoder which can extract the features of the image data information in different feature dimensions layer by layer, and the last layer of image features output by the VIT model comprise the global information of the image.

(2) Text feature extraction module: the text data information of the input model is feature-coded, and any encoder structure capable of performing text coding may be used to extract the text features of the input text data information, and is not exemplified herein. Including but not limited to using a recurrent neural network (Recurrent Neural Network, RNN) or a Transformer implementation. RNNs process sequence data by recursively updating hidden states, mapping each word in the text to a fixed length vector representation. The transducer then calculates the context representation of each word directly through the attention mechanism and then averages or merges the context representations of all words to arrive at a text vector representation.

In one possible implementation manner, the embodiment of the application can use a bi-directional encoder representation (Bidirectional Encoder Representation from Transformers, BERT) model based on a transformer as a text feature extraction module of a graph-text matching model, wherein the BERT model is used as a bi-directional encoder, and the bi-directional representation BERT model can simultaneously utilize two parts of information of a front word and a rear word when processing a word, so that even if some words are randomly masked, the BERT model can still utilize all the words which are not masked to perform prediction, and the BERT model is excellent in text feature extraction.

(3) And a feature fusion module: the method is used for combining different image features and text features to generate multi-modal fusion features. Feature fusion can be implemented in a variety of ways, including: splicing different features together to serve as output features; adding the different features as output features; averaging the different features as output features; the different features take the maximum value as the output feature.

The multi-mode fusion features output by the feature fusion module can be used for obtaining the matching degree of the image and the text through a full connection layer (Fully Connected Layer, FC) or an interaction layer (Interaction Layer). Specifically, the multi-mode fusion feature can be subjected to nonlinear mapping through a plurality of full-connection layers, the matching degree is finally output, or a series of matching features are generated through the interaction layer by calculating interaction information between the image features and the text features, then the nonlinear mapping is performed through the plurality of full-connection layers, and the matching degree is finally output.

Before the image-text matching model is put into use, training is needed in advance to enable the image-text matching model to be converged, so that the training process is first described.

Referring to fig. 3, a flow chart of a training process of the graph matching model provided in the embodiment of the present application is shown, and the process may be performed by the server 102 or the terminal device 101 in fig. 1, or may be performed by the server 102 and the terminal device 101 together, where the training process of the graph matching model in the embodiment of the present application is mainly described in detail by taking the server 102 as an example.

Step 301: and obtaining a picture and text sample set.

In the embodiment of the application, each training graphic sample in the graphic sample set can comprise a sample image, a sample text and a real matching degree between the sample image and the sample text. When training the image-text matching model, a training image Wen Yangben can be extracted from the obtained image-text sample set as training sample data.

The true matching degree refers to the actual matching degree between the sample image and the sample text, and is used for enabling the model to learn the relevance between the correct image and the text, so that the image-text matching can be more accurately performed, and the true matching degree is used as supervision data for model training.

In one possible implementation, the true degree of match between the sample image and the sample text may be determined by way of a manual scoring by the relevant annotators, for example by 0/1 indicating a mismatch/match.

The image-text sample set in the embodiment of the application can be obtained in various modes, including but not limited to the following modes: selecting a matched image and text from the disclosed image and text dataset as a sample; obtaining images and texts with labels or descriptions from public platforms such as websites and social media, and screening out matched samples; in an artificial scene, matching samples are obtained by taking images and writing text.

Step 302: and determining the prediction matching degree between the sample image and the sample text included in each input training image text sample by adopting the image-text matching model used in the round.

In the embodiment of the present application, a specific process of determining the matching degree between the image and the text by using the image-text matching model will be specifically described in a subsequent model application process, so that the description is omitted herein.

Step 303: and determining a model loss value of the image-text matching model based on the obtained prediction matching degrees and the corresponding real matching degrees.

In the embodiment of the application, when calculating the loss value according to the predicted matching degree and the true matching degree corresponding to each training image-text sample, a preset loss function can be adopted to calculate the loss value, and the loss function can adopt a cross entropy loss function, including but not limited to a Sigmoid function. The Loss function used may also be, but is not limited to, a multi-class cross entropy Loss function, a contrast Loss function (metric Loss) related to metric learning, or a triple Loss function (triple Loss), etc. In general, the loss value is a value for determining the degree of similarity between the actual output and the desired output, and the smaller the loss value, the more similar the actual output and the desired output are.

In one possible implementation manner, in order to enable the image features output by the image-text matching model to fully express the information of the image data, so as to improve the accuracy of image-text matching, self-supervised learning (Self-Supervised Learning, SSL) can be performed on the single-mode features in the model training process. The self-supervision learning is a learning method without manual annotation, and can train by utilizing the self property of the single-mode data under the condition of not using the annotation data, so that the modeling effect of a model for the single mode is improved, including but not limited to training by using the self information of the image data, images of different versions are obtained by enhancing the data of the images, and then the model learns and predicts the relation between the images of different versions through the self-supervision tasks including but not limited to image rotation prediction, image colorization and image filling, so that the model learns more useful characteristics.

Referring to fig. 4, a schematic diagram of a self-supervised learning process of an image mode according to an embodiment of the present application is shown, and for each sample image input, an image matching model performs data enhancement processing on each sample image to obtain an enhanced image pair corresponding to each sample image. Each enhancement image pair comprises two enhancement images obtained by different data enhancement processing modes on a sample image, including but not limited to data enhancement processing modes such as a turnover image, a rotation image, a scaling image, a shearing image, a translation image, adding random noise, adjusting the brightness/contrast of the image and the like. Further, the image symmetry-loss value may be determined according to a similarity between two enhanced images included in each of the respective enhanced image pairs, and the image symmetry-loss value and the image matching-loss value together constitute a total model-loss value of the image matching model.

When the feature encoding of the sample image is performed as shown in fig. 4, for each enhanced image pair, feature encoding processing may be performed on two enhanced images included in each enhanced image pair, so as to obtain enhanced image features corresponding to the two enhanced images, determine sample symmetry loss values of the enhanced image pair according to similarities between the enhanced image features corresponding to the two enhanced images, and determine the image symmetry loss values based on the obtained sample symmetry loss values.

In one possible implementation, self-supervised learning of image features may be implemented by a simian model, where the simian model contains two identical neural networks, a learner (that learns a representation of an image) and an evaluator (that evaluates a representation of the learner). The learner extracts features from the input image, and the evaluator evaluates whether the features extracted by the learner are capable of correctly expressing the input image. The training objective of the simian model is to maximize the similarity between the learner and the evaluator while minimizing the variation of the learner across the input image.

As shown in FIG. 4 above, the Simsimbim model uses different data enhancements for an input image I to obtain two different images I ₁ And I ₂ The two images respectively pass through an image feature extractor to obtain a feature f ₁ And f ₂ . And due to the feature f ₁ And f ₂ The features themselves are different, and direct alignment may result in collapse of the features, so that when the similarity between the features is calculated by performing feature alignment, one of the features is mapped proj first and then aligned with the other feature.

Therefore, the image symmetry loss function is calculated specifically as follows:

the sim (x, y) represents the similarity of the two feature vectors of the comparison x, y, and the simsian model can calculate the similarity or distance of the two input features in a weight sharing manner and output corresponding values, and usually uses euclidean distance or cosine similarity to calculate the similarity score. Specifically, the simian model outputs a similarity score of any real number between 0 and 1, which indicates the degree of similarity between input data pairs, 0 indicates complete dissimilarity, and 1 indicates complete similarity, and the simian model is trained in a manner of minimizing the similarity or distance of positive samples and maximizing the similarity or distance of negative samples, so that the aim of feature alignment is achieved.

Proj (x) denotes a mapping operation on the x features, and the Simsimbiam model typically implements feature mapping through a multi-layer perceptron (Multilayer Perceptron, MLP). Specifically, the learner and the evaluator in simsian each map the extracted features through one MLP, the MLP between the learner and the evaluator is composed of multiple fully connected layers, where each fully connected layer contains a set of learning parameters for converting feature vectors into new feature representations, and the MLP between the learner and the evaluator share parameters, i.e., both learn the mapping of features using the same learning rule. In the training process of the SimsiMSIAm model, the MLP of the learner and the evaluator are optimized by using the similarity loss function through a back propagation algorithm, so that the feature mapping of the learner and the evaluator is gradually adjusted to furthest improve the value of the similarity loss function.

Thus, by implementing feature mapping using MLP, the simsian model can learn an efficient feature representation of an image in a shorter time. Meanwhile, the MLP has good fitting performance, and the feature vector can be converted into a higher-dimensional feature representation, so that the classification and detection performance of the Simsiiam model are improved.

In one possible implementation, in order to improve the prediction accuracy of the image-text matching model for unknown data, the image-text matching model can be better adapted to different downstream tasks, and multiple tasks related to image-text pre-training can be used for multi-stage multi-task training of the image-text matching model, including but not limited to image-text alignment tasks, text modeling tasks, image-text matching tasks, and the like. The training tasks of different modes and different training stages are combined to pretrain the image-text matching model, so that the image-text matching model has a preliminary feature expression function, model training is accelerated, model training time is reduced, learning difficulty is reduced, the image-text matching model is easier to converge in the learning process of the downstream task, the learning of the image-text matching model is more hierarchical, the model can be conveniently learned from shallow to deep in the downstream task, in addition, the model has a richer and comprehensive feature expression function due to pretraining on the training tasks of different modes, correlation and potential structures among data of different modes can be captured, and the model is facilitated to adapt to various types of downstream tasks. The image-text sample set adopted in the embodiment of the application not only comprises the image information and the text information related to the image-text pre-training task, but also can be different for the data structures of training samples corresponding to different tasks.

In one possible implementation manner, an Image-Text Alignment task (ITA) may be used to train the Image feature extraction module and the Text feature extraction module in the Image matching model, that is, alignment of the Image features and the Text feature semantics output by the Image feature and the Text feature by using a contrast learning method, and the Image feature and the Text feature are mapped into the same embedding space, so that the distance between the Image feature and the Text feature in the embedding space is minimized, and the Image matching model learns the corresponding relationship between the Image and the Text better.

Specifically, the calculation process of the graph-text alignment loss function of the ITA task comprises the following steps: for each image sample, it is compared with the corresponding text sample, while the image is compared with other text samples, and the cosine similarity between the image and the positive sample and the cosine similarity between the image and the negative sample are calculated. Wherein the positive sample is a text sample corresponding to the current image sampleThe negative samples are other text samples that do not match the current image sample. Meanwhile, for each sample text, comparing the sample text with corresponding image sample images and other sample images, calculating cosine similarity between the text and the positive sample image and cosine similarity between the text and the negative sample image, and calculating a loss function according to the difference between the two cosine similarities. Thus the loss function L of ITA task _ita The sum of the two partial loss functions for the image and text alignment tasks.

In one possible implementation manner, since the main purpose of the image-text alignment task is to implement semantic alignment of images and texts, the image-text alignment task can meet the requirement of image-text alignment by considering only global information of the images and texts, and features obtained last in the feature-dimension-by-feature encoding process correspond to global structure and semantic information of the images or texts, so that the image-text alignment task is the most representative and distinguishing global feature.

Therefore, in the embodiment of the application, feature coding of feature-by-feature dimensions is respectively carried out on each input sample image and each sample text, the sample image features obtained last time in the feature coding process of the feature-by-feature dimensions are aligned with the sample text features obtained last time, the similarity between the sample image features and the sample text features is obtained, and the graph-text alignment loss value is determined based on the similarity. By aligning the global features obtained in the last time, a good alignment effect can be obtained, meanwhile, the calculation complexity and the storage space of the model can be reduced, and the running efficiency of the model can be improved.

In one possible implementation, the text feature extraction module in the graph-text matching model can be trained by using text modeling tasks, so that language rules and semantic information in the text can be better understood and captured, and the information can be better utilized in various downstream tasks later.

Specifically, the text feature extraction module of the pattern matching model may be trained using an MLM task, under which some randomly selected words in the sample text of the input model may be replaced with special "MASK" symbols, such as "[ MASK ]", and then the pattern matching model needs to predict the words represented by these "MASK" symbols from other words in the context. In this process, the pattern matching model may learn associations between words in the text, such as the context in which a word appears, similarities between words, etc., and such information may be used for many natural language processing tasks, including but not limited to text classification, emotion analysis, named entity recognition, etc.

In one possible implementation, as shown in the training task diagram of the graph matching model in fig. 5, when training the graph matching model by using multiple graph pre-training tasks such as graph alignment task, text modeling task, graph matching task, graph alignment task, etc., the graph matching model may be based on the graph alignment loss function L _ita Text modeling loss function L _mlm Graph-text matching loss function L _itm Image symmetry loss function L _simsiam Summing to obtain a total model loss value L of the image matching model, wherein the total model loss value L is as follows:

L＝L _ita +L _mlm +L _itm +L _simsiam

in a possible implementation, the above-mentioned graph alignment loss function L may also be _ita Text modeling loss function L _mlm Graph-text matching loss function L _itm Image symmetry loss function L _simsiam And respectively giving a certain weight to carry out weighted summation to obtain a total model loss value L of the image matching model.

Step 304: determining whether the model loss value converges to a preset target value; if not, go to step 305; if so, step 306 is performed.

In the embodiment of the application, whether the model loss value of the image-text matching model is converged to a preset target value is judged, if the model loss value is smaller than or equal to the preset target value, or when the variation amplitude of the model loss value obtained by continuous N times of training is smaller than or equal to the preset target value, the model loss value is considered to be converged to the preset target value, and the convergence of the model loss value is explained; otherwise, it is indicated that the model loss value has not converged. Alternatively, the model may be considered to converge when the number of model training iterations satisfies a preset value.

Step 305: and adjusting parameters of the image-text matching model according to the determined model loss value.

In the embodiment of the present application, if the model loss value is not converged, the model parameter is adjusted according to the model loss value, and after the model parameter is adjusted, the process returns to execute step 302 to continue the training process of the next round.

Step 306: if the model converges, the training process is ended, and a trained image-text matching model is obtained.

In the embodiment of the application, if the loss value is determined to be converged, the graph-text matching model used in the round is used as a trained graph-text matching model.

Referring to fig. 6, a flowchart of a method for retrieving graphics and text provided in an embodiment of the present application may be implemented by the server 102 or the terminal device 101 in fig. 1, or may be implemented by the server 102 and the terminal device 101 together, where the description is mainly given by taking the server 102 as an example, and the flowchart of the method is described below.

Step 601: and respectively carrying out feature coding of feature-by-feature dimensions on each target image in the target image set to be retrieved to obtain image feature sets corresponding to each target image.

In the embodiment of the application, the server respectively carries out layer-by-layer coding on each feature dimension of each target image in the target image set to be retrieved through the image feature extraction module in the image matching model to obtain the image feature set corresponding to each target image, wherein each image feature corresponds to one image feature dimension, the image features of different feature dimensions comprise the representation of the target image on different abstract degrees and different semantic information, the features of lower layers are more focused on the local information of the image, such as the edges, textures and the like of the image, the features of higher layers are more focused on the global information and the abstract semantic information of the image, and more complex semantic information such as the shape, the position and the like of an object can be represented.

For example, as shown in fig. 4 above, the embodiment of the present application may use a multi-layer transducer encoder in the ViT model to extract image features of each target image in multiple feature dimensions layer by layer, so as to obtain a set of image features corresponding to each target image. The method is different from the method that when the model is trained, data enhancement is carried out on one input image, the characteristics of two different enhanced images are extracted respectively, the self-supervision learning of the image characteristic extraction module is realized through characteristic alignment, and when the trained image-text matching model is actually applied, the characteristics of the input image are extracted layer by layer only once.

Step 602: and respectively carrying out alignment processing based on the image features obtained last time in each image feature set and the text features of the target text to be retrieved, and obtaining the similarity between each image feature and the text feature.

In the embodiment of the application, the server can utilize the image feature extraction module to perform feature alignment on the image feature obtained last time in the process of performing layer-by-layer coding on each target image set to be retrieved and the text feature of the target text so as to calculate the similarity between each image feature and the text feature.

Specifically, common feature alignment methods may include, but are not limited to, projection mapping and bilinear fusion, where the projection mapping method generally uses a full connection layer to map image features and text features into feature vectors in the same dimension, and then uses measurement methods such as cosine similarity or euclidean distance to calculate the similarity between the two. The bilinear convergence mode is to conduct bilinear interpolation on the image features and the text features to obtain the similarity between the image features and the text features.

Step 603: at least one candidate image is determined from the set of target images based on the obtained similarity.

In the embodiment of the application, after obtaining the similarity between each target image to be retrieved and the target text, the server can screen part of candidate images from the plurality of target images according to the relative magnitude relation between each similarity for the subsequent image-text matching process.

For example, after calculating the similarity between 100 target images to be retrieved and the target text, 10 candidate images with larger similarity can be screened out according to the relative size between the similarities for subsequent image-text matching.

Step 604: based on a preset image feature dimension selection strategy, selecting a plurality of target feature dimensions from feature dimensions corresponding to an image feature set, and performing image feature fusion on image features of the plurality of target feature dimensions corresponding to at least one candidate image to obtain image fusion features corresponding to the at least one candidate image.

In the embodiment of the application, in order to make the feature fusion module fully utilize the information on different layers of the image and improve the performance of image and text fusion, the image features corresponding to a plurality of target feature dimensions in the image feature set can be fused, and the obtained image fusion features are used as the input features of the feature fusion module.

In one possible implementation, the preset feature dimension selection policy may be determined according to the expression of image features corresponding to different feature dimensions in the history. Specifically, the method can be used for extracting features from different convolution layers by comprehensively considering the independence and complementarity of the features on different feature latitudes, or extracting the features from different perspectives by using different feature extractors such as VGG, resNet and the like, and has a plurality of feature dimensions which are complementary and relatively independent so as to improve the diversity and the robustness of feature fusion and avoid the influence of repeated and redundant information.

In one possible implementation manner, the image features of the plurality of target feature dimensions may be weighted according to the weight values corresponding to the selected plurality of target feature dimensions, so as to obtain the image fusion feature.

In one possible implementation, the weight value corresponding to the target feature dimension may be determined according to the number of the plurality of target feature dimensions, and the image fusion feature is an average feature of the image features corresponding to the plurality of target feature dimensions.

Specifically, the image coding module performs feature coding of feature-by-feature dimensions on the target image, outputs 12 layers of image features, and uses average features corresponding to the output image features of the last six layers as input features of the feature fusion module, as follows:

wherein input is _image Image input features representing feature fusion modules, f _i Representing the i-th layer characteristics output by the image coding module.

In one possible implementation, since the lower level features represent more local features of the image, such as edges and textures, etc., while the higher level features represent more complex semantic information, such as shape and position of objects, etc. The weight value corresponding to each target feature dimension can be determined according to the importance of the features of different feature dimensions to the image fusion features, for example, the higher the feature weight value of the higher level is, the lower the feature weight value of the lower level is relatively smaller.

Specifically, the importance of each feature dimension may be determined by correlation analysis, principal component analysis (Principal Component Analysis, PCA), or the like.

Step 605: and mapping the obtained at least one image fusion feature and the text feature into the same feature space respectively to obtain at least one mapping image feature and mapping text feature in the same feature space, and performing cross-modal feature fusion on the at least one mapping image feature and the mapping text feature respectively to obtain multi-modal fusion features corresponding to the at least one candidate image respectively.

In the embodiment of the application, the image and text features come from data information of different modes, and have different data distribution and expression modes, namely, have isomerism. Therefore, before the cross-modal feature fusion is carried out on the image fusion features and the text features, the image fusion features and the text features are firstly aligned, namely the image fusion features and the text features are mapped into the same feature space to eliminate the isomerism between the image fusion features and the text features, so that the aligned mapped image features and the mapped text features can be better subjected to the cross-modal feature fusion, the fusion performance is improved, the image fusion performance comprises but not limited to image labeling, image-text retrieval, image description generation and other image-text fusion tasks, and the aligned features can better capture the semantic relationship between the images and the texts, so that the performance of the fusion task is improved. And the matching precision between the image and the text features can be improved by the alignment processing, the aligned mapped image features and the aligned mapped text features can be searched mutually, and when similar objects are searched through the image and the text query, the matching result can be found more accurately by the alignment processing.

In one possible implementation manner, the cross-modal feature fusion refers to integrating information contained in feature vectors of different modalities, and the cross-modal feature fusion can adopt any one of the following modes:

(1) Vector concatenation

When the cross-modal feature fusion is performed in a vector splicing mode, the feature fusion module can comprise a vector splicing layer so as to perform vector splicing on the image feature vector and the text feature vector in a set mode through the vector splicing layer, and obtain the multi-modal fusion feature vector of the image-text. Specifically, the unimodal feature vector may be spliced after the previous unimodal feature vector, i.e., the unimodal feature vector of the text modality is spliced after the unimodal feature vector of the image modality.

After the cross-modal feature fusion is carried out by adopting a vector splicing mode, the obtained spliced feature vector has more dimensions, and the spliced feature vector can be subjected to feature reduction by adopting a certain feature reduction mode, so that the multi-modal fusion feature vector is obtained.

(2) Feature pooling (pooling)

When cross-modal feature fusion is performed in a feature pooling manner, the feature fusion module may include a pooling layer to pool a plurality of image feature vectors and text feature vectors through the pooling layer to obtain multi-modal fusion feature vectors of the image-text.

Specifically, the pooling process may be performed by using a pooling process manner such as max-pooling (max-pooling) or mean-pooling (mean-pooling), which is not limited in this embodiment of the present application.

(3) Convolution (Convolition) processing

When cross-modal feature fusion is performed through convolution processing, the feature fusion module may include a convolution layer, so as to perform convolution operation on a feature matrix composed of a plurality of image feature vectors and text feature vectors through the convolution layer, and obtain multi-modal fusion feature vectors of the image-text by adopting a set step length.

Specifically, the convolution layer may include at least one weight matrix, where parameters in the weight matrix may be obtained through training, and the feature matrix is subjected to convolution processing through the weight matrix, so as to obtain a multi-mode fusion feature vector of the image-text.

(4) Full join processing

When feature fusion is performed through full connection processing, the feature fusion module may include a full connection layer (fully connected layers, FC) to map a plurality of image feature vectors and text feature vectors to obtain a multi-mode fusion feature vector.

Step 606: and respectively obtaining the matching degree between each of the at least one candidate image and the target text based on the obtained at least one multi-modal fusion feature, and determining the matching image with the target text from the at least one candidate image based on the obtained at least one matching degree.

In the embodiment of the application, after the multi-mode fusion characteristics of each image fusion characteristic and each text characteristic are obtained, the matching degree prediction can be carried out according to each multi-mode fusion characteristic, so as to obtain the matching degree between each target image and each target file.

Specifically, the matching degree prediction may be implemented using any possible classifier, including but not limited to, using FC or softmax methods, and the like.

In a possible implementation manner, as shown in fig. 7, which is a schematic view of a scenario of a graphic retrieval task, a user desires to retrieve a target image corresponding to the content described in the text from an image dataset through text information to be retrieved sent by a terminal device to a server. The server receives text information to be searched sent by the terminal equipment, performs preliminary alignment processing on text features corresponding to the target text according to the last layer of image features of each image in the image data set output by the image feature extraction module, obtains the similarity between each image in the image data set and the target text, and screens M images to be matched with relatively large similarity from the huge image data set by comparing the similarity magnitude relation. The server obtains M image fusion features corresponding to the M images to be matched respectively by fusing the image features to obtain M image fusion features corresponding to the M images to be matched respectively, and respectively performing cross-modal feature fusion with text features of a target text to obtain M multi-modal fusion features, and obtains matching degrees corresponding to the M images to be matched and the target text respectively according to the M multi-modal fusion features, and finally obtains a target image with the highest matching degree by comparing the matching degree.

Specifically, fig. 8 is a schematic diagram showing performance of each of the graphic matching model and the other UNITER, oscar, ALBEF, TCL models provided in the embodiment of the present application on the task of graphic retrieval.

Where i2t represents text by image retrieval, and t2i represents a target image by text retrieval. R@1, R@5 and R@10 are one of indexes for measuring the image retrieval performance. R@1 refers to the fact that in a given test set, for each query sample, only one image that is most similar to that sample is retained, and then the proportion of these most similar images that truly matches the query sample is calculated.

Specifically, R@1 is 79.4 indicating that the model correctly matches 79.4% of the query samples. Similarly, R@5, r@10 represent, respectively, that in a given test set, only five, ten images that are most similar to each query sample are retained for that sample, and then the proportion of these most similar images that truly matches the query sample is calculated.

Through experimental verification, the training method of the image-text matching model provided by the embodiment of the application and the learning of general data and data in the field can be used for obtaining the trained image-text matching model, compared with other UNITER, oscar, ALBEF, TCL models, the method has the advantages that the retrieval performance of the model is obviously improved on image-text and text-image retrieval tasks, and the image-text matching is more accurate.

Referring to fig. 9, based on the same inventive concept, an embodiment of the present application further provides an image-text retrieving apparatus 90, which includes:

the feature encoding module 901 is configured to perform feature encoding of feature dimensions on each target image in the target image set to be retrieved, so as to obtain an image feature set corresponding to each target image; wherein each image feature in the image feature set corresponds to a feature dimension;

the image-text alignment module 902 performs alignment processing based on the image features obtained last time in each image feature set and the text features of the target text to be retrieved respectively, so as to obtain the similarity between each image feature and the text feature;

a candidate determining module 903, configured to determine at least one candidate image from the target image set based on the obtained similarity;

the image fusion module 904 is configured to select a plurality of target feature dimensions from feature dimensions corresponding to the image feature set based on a preset image feature dimension selection policy, and perform image feature fusion on image features of the plurality of target feature dimensions corresponding to at least one candidate image, so as to obtain image fusion features corresponding to at least one candidate image;

The image-text fusion module 905 is configured to map the obtained at least one image fusion feature and text feature into the same feature space, obtain at least one mapped image feature and mapped text feature in the same feature space, and perform cross-modal feature fusion on the at least one mapped image feature and mapped text feature, respectively, to obtain multi-modal fusion features corresponding to the at least one candidate image;

and a matching output module 906, configured to obtain a degree of matching between each of the at least one candidate image and the target text based on the obtained at least one multimodal fusion feature, and determine a matching image with the target text from the at least one candidate image based on the obtained at least one degree of matching.

Optionally, the image fusion module 904 is specifically configured to:

for at least one candidate image, the following operations are performed:

wherein each weight value is determined based on the number of the plurality of target feature dimensions; alternatively, each weight value is determined from importance to the image fusion feature based on a plurality of target feature dimensions.

Optionally, the graphic fusion module 905 is specifically configured to:

for at least one of the mapped image features, performing any one of the following operations, respectively:

aiming at one mapping image feature, carrying out pooling treatment on the mapping image feature and the mapping text feature to obtain a corresponding multi-mode fusion feature;

performing convolution operation processing on the mapping image features and the mapping text features aiming at one mapping image feature to obtain corresponding multi-mode fusion features;

and aiming at one mapping image feature, mapping the mapping image feature and the mapping text feature to obtain a corresponding multi-mode fusion feature.

Optionally, the apparatus further comprises a model training unit 907 for:

obtaining a graphic sample set, wherein each training graphic sample in the graphic sample set comprises a sample image, a sample text and a real matching degree between the sample image and the sample text;

and determining a model loss value of the image-text matching model based on the obtained prediction matching degrees and the corresponding real matching degrees, and carrying out parameter adjustment on the image-text matching model based on the model loss value.

Optionally, the model training unit 907 is specifically configured to:

carrying out alignment processing on the sample image features obtained in the last time and the sample text features obtained in the last time in the feature-dimension-by-feature encoding process to obtain the similarity between the sample image features and the sample text features;

determining an image-text alignment loss value based on the similarity;

and obtaining a model loss value based on the image-text alignment loss value and the image-text matching loss value.

Optionally, the model training unit 907 is specifically configured to:

and obtaining a model loss value based on the image symmetry loss value and the image-text matching loss value.

Optionally, the model training unit 907 is specifically configured to:

for each enhanced image pair, the following processing is performed:

an image symmetry-loss value is determined based on the obtained respective sample symmetry-loss values.

According to the image-text retrieval device provided by the embodiment of the application, the feature codes of feature dimensions are respectively carried out on each image to be retrieved to obtain the image features of each image to be retrieved in each feature dimension, the image features obtained by the last feature code of each image to be retrieved are respectively aligned with the text features of the target text to be retrieved, the similarity between each image feature and the text features is obtained, and the candidate image is determined from each image to be retrieved according to the similarity. According to the method, the image features obtained by the last feature coding are aligned with the text features, so that partial candidate images are preliminarily screened from a large number of images to be searched for image-text matching, image feature fusion of multiple feature latitudes of all the images to be searched is avoided, and the image-text searching efficiency is improved. And selecting a plurality of target feature dimensions according to a preset feature dimension selection strategy, and fusing the image features of each candidate image under the plurality of target feature dimensions to obtain the image fusion features of each candidate image for fusing with the text features. In this way, the image features with different feature dimensions represent image information with different dimensions and different depths, including but not limited to textures, colors and shapes, the image features fused with the feature dimensions can fully utilize information of different layers of the image, so that partial local information of the image is avoided being lost, fine granularity of the image fused features is better, interaction of the image and text features on multi-layer information is promoted, and accuracy of image-text retrieval is improved. And the image fusion features and the text features are mapped into the same feature space, so that the feature dimensions of the image fusion features and the text features are consistent, and then the cross-mode feature fusion is carried out on the image fusion features and the text features, compared with direct fusion, the loss and confusion of image and text information caused by direct fusion of the image and the text features with different dimensions can be avoided, so that the semantic information of the image and the text can be more accurately expressed by the cross-mode fusion features, the interaction of the image and the text features is further promoted, and the accuracy of image-text retrieval is improved.

Referring to fig. 10, based on the same technical concept, an embodiment of the present application further provides a computer device 100, which may include a memory 1001 and a processor 1002. The computer device 100 may be the terminal device 101 or the server 102 shown in fig. 1, and when the computer device 100 is the server 102, the memory 1001 and the processor 1002 may correspond to the memory 1022 and the processor 1021 of the server 102, respectively.

The memory 1001 is configured to store a computer program executed by the processor 1002. The memory 1001 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. The processor 1002 may be a central processing unit (central processing unit, CPU), or a digital processing unit, or the like. The specific connection medium between the memory 1001 and the processor 1002 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 1001 and the processor 1002 are connected by a bus 1003 in fig. 10, the bus 1003 is shown by a thick line in fig. 10, and the connection manner between other components is only schematically illustrated, but not limited to. The bus 1003 may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

Memory 1001 may be volatile memory (RAM) including, but not limited to, random-access memory (RAM); the memory 1001 may also be a non-volatile memory (non-volatile memory) including, but not limited to, a read-only memory, a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 1001 may be a combination of the above.

A processor 1002 for executing the methods performed by the apparatus in the embodiments shown in fig. 2 to 8 when calling the computer program stored in the memory 1001.

In some possible embodiments, aspects of the method provided by the application may also be implemented in the form of a program product comprising program code for causing a computer device to carry out the steps of the method according to the various exemplary embodiments of the application described in this specification, when said program product is run on the computer device, i.e. the computer device may carry out the method as carried out by the device in the examples shown in fig. 2-8.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The image-text retrieval method is characterized by comprising the following steps of:

2. The method of claim 1, wherein the performing image feature fusion on the image features of the plurality of target feature dimensions corresponding to each of the at least one candidate image to obtain image fusion features corresponding to each of the at least one candidate image comprises:

3. The method of claim 1, wherein the cross-modal feature fusion of the at least one mapped image feature and the mapped text feature to obtain multi-modal fusion features corresponding to the at least one candidate image, respectively, comprises:

4. A method according to any one of claims 1-3, wherein the matching of the at least one candidate image to the target text is performed by a graph matching model, the training of which comprises:

5. The method of claim 4, wherein determining the model loss value for the teletext matching model based on the obtained respective predicted matches and the corresponding true matches comprises:

Determining an image-text alignment loss value based on the similarity;

6. The method of claim 4, wherein determining the model loss value for the teletext matching model based on the obtained respective predicted matches and the corresponding true matches comprises:

7. The method of claim 6, wherein determining an image symmetry-loss value based on a similarity between two enhanced images included in each of the pair of enhanced images comprises:

8. A graphic retrieval apparatus, the apparatus comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that,

the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 7.

10. A computer storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 7.