CN113792207B

CN113792207B - Cross-modal retrieval method based on multi-level feature representation alignment

Info

Publication number: CN113792207B
Application number: CN202111149240.4A
Authority: CN
Inventors: 张卫锋; 周俊峰; 王小江
Original assignee: Jiaxing University
Current assignee: Jiaxing University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-11-17
Anticipated expiration: 2041-09-29
Also published as: CN113792207A

Abstract

The invention discloses a cross-modal retrieval method based on multi-level feature representation alignment, and relates to the technical field of cross-modal retrieval. According to the invention, global similarity, local similarity and relationship similarity between two different modal data of an image and a text are calculated respectively in a cross-modal fine-grained accurate alignment stage, the global similarity, the local similarity and the relationship similarity are fused to obtain the image-text comprehensive similarity, corresponding loss functions are designed in a neural network training stage, cross-modal structure constraint information is mined, parameter learning of a search model is restrained and supervised from a plurality of angles, and finally a search result of a test query sample is obtained according to the image-text comprehensive similarity, so that the accuracy of cross-modal search is effectively improved by introducing fine-grained association relation between the two different modal data of the image and the text, and the method has wide market requirements and application prospects in the fields of image-text search, pattern recognition and the like.

Description

Cross-modal retrieval method based on multi-level feature representation alignment

Technical Field

The invention relates to the technical field of cross-modal retrieval, in particular to a cross-modal retrieval method based on multi-level feature representation alignment.

Background

With the rapid development of new generation internet technologies such as mobile internet and social network, multi-modal data such as text, image and video is explosively increased. The cross-modal retrieval technology aims at realizing cross-modal retrieval among different modal data by mining and utilizing association information among the different modal data, and the core is to realize similarity measurement among the cross-modal data. In recent years, the cross-modal retrieval technology has become a research hotspot at home and abroad, is widely focused by academia and industry, is one of important research fields of cross-modal intelligence, and is also an important direction of future development of the information retrieval field.

Cross-modal retrieval involves data of multiple modalities simultaneously, and the data are related to each other in terms of 'heterogeneous gaps', namely, the data are related to each other in terms of high-level semantics, but are heterogeneous in terms of bottom-layer characteristics, so that a retrieval algorithm is required to be capable of deeply mining related information among different modality data, and alignment of one modality data to another modality data is achieved.

At present, a subspace learning method is a main stream method of cross-modal retrieval, and the method can be further subdivided into a retrieval model based on traditional statistical correlation analysis and a retrieval model based on deep learning. The cross-modal retrieval method based on traditional statistical correlation analysis maps different modal data to subspaces through a linear mapping matrix, and the correlation among the different modal data is maximized. The cross-modal retrieval method based on deep learning utilizes the feature extraction capability of the deep neural network to extract effective representation of different modal data, and utilizes the complex nonlinear mapping capability of the neural network to mine complex association characteristics among the cross-modal data.

In carrying out the present invention, the applicant has found that the following technical problems exist in the prior art:

the cross-modal retrieval method provided by the prior art focuses on the representation learning, association analysis and alignment of the global features and the local features of the images and the texts, but lacks the reasoning of the relation between visual targets and the alignment of relation information, and cannot comprehensively and effectively utilize the structure constraint information supervision model contained in the training data for training, so that the cross-modal retrieval accuracy of the cross-modal retrieval method on the images and the texts is lower.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a cross-modal retrieval method based on multi-level feature representation alignment, which accurately measures the similarity between an image and a text through cross-modal multi-level representation association and effectively provides retrieval accuracy, thereby solving the technical problems that the existing cross-modal retrieval method is not fine enough in representation and insufficient in cross-modal association, and simultaneously utilizing cross-modal structure constraint information to supervise the training of a retrieval model. The technical scheme of the invention is as follows:

according to an aspect of an embodiment of the present invention, there is provided a cross-modal retrieval method based on multi-level feature representation alignment, wherein the method includes:

Acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic tags which are corresponding to the image data and the text data together;

for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to image data in the data pairs, and text global features, text local features and text relation features corresponding to text data in the data pairs;

for a target data pair formed by any image data and any text data in the training data set, calculating to obtain the image-text comprehensive similarity corresponding to the target data pair according to the image global feature and the text global feature corresponding to the target data pair, the image local feature and the text local feature corresponding to the target data pair, and the image relation feature and the text relation feature corresponding to the target data pair;

and designing inter-mode structure constraint loss functions and intra-mode structure constraint loss functions based on the comprehensive similarity of the corresponding images and texts of each group of target data, and training the model by adopting the inter-mode structure constraint loss functions and the intra-mode structure constraint loss functions.

In a preferred embodiment, the step of extracting, for each group of data pairs in the training dataset, an image global feature, an image local feature, and an image relationship feature corresponding to image data in the data pairs, and a text global feature, a text local feature, and a text relationship feature corresponding to text data in the data pairs, respectively, includes:

for each group of data pairs in the training data set, a convolutional neural network CNN is adopted to extract the image global features of the image data corresponding to the data pairsThen a visual object detector is used for detecting visual objects included in the image data and extracting the image local characteristics of each visual objectWherein, the method comprises the steps of, wherein,Mfor the number of visual objects comprised by the image data,for visual targetExtracting image relation features among all visual targets through an image visual relation coding networkWherein, the method comprises the steps of, wherein,for visual targetAnd visual targetImage relation features between the images;

for each group of data pairs in the training data set, converting each word in text data corresponding to the data pairs into a word vector by adopting a word embedding modelWherein, the method comprises the steps of, wherein, NFor the number of words included in the text data, then sequentially vector each wordInputting the text data into a recurrent neural network to obtain text global features corresponding to the text dataInputting each word vector into a feedforward neural network to obtain text local characteristics corresponding to each wordSimultaneously inputting each word vector into a text relationship coding network to extract text relationship characteristics among each wordWherein, the method comprises the steps of, wherein,is a wordSum wordText relationship features between.

In a preferred embodiment, the step of calculating, for a target data pair formed by any image data and any text data in the training data set, the image-text integrated similarity corresponding to the target data pair according to the image global feature and the text global feature corresponding to the target data pair, the image local feature and the text local feature corresponding to the target data pair, and the image relationship feature and the text relationship feature corresponding to the target data pair includes:

for a target data pair consisting of any image data and any text data in the training data set, based on the image global features corresponding to the image data in the target data pairText global features corresponding to text data The cosine distance of the target data pair is calculated to obtain the corresponding image-text global similarityThe method comprises the steps of carrying out a first treatment on the surface of the Wherein, the image-text global similarityThe calculation formula of (a) is as formula (1):

) Formula (1)

Calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out image local characteristics of each visual targetAfter corresponding weight weighting, obtaining new image local representation through feedforward neural network mappingThen, a visual directing attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text local characteristics of each word are calculatedAfter corresponding weight weighting, mapping by a feedforward neural network to obtain a new text local representationLocal representation from individual imagesAnd respective text local representationsCosine similarity of all visual targets and words is calculated, and the local similarity of the target data to corresponding images-texts is calculated according to the mean value of the cosine similarityThe method comprises the steps of carrying out a first treatment on the surface of the Wherein, the image-text local similarityThe calculation formula of (a) is shown as a formula (2), M is the number of visual targets, and N is the number of words:

formula (2)

According to cosine similarity mean values of each image relationship feature and each text relationship feature in the target data pair, calculating to obtain the corresponding image-text relationship similarity of the target data pair The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the similarity of image-text relationshipThe calculation formula of (c) is as formula (3),Prepresenting the number of relationships between image data and text data:

formula (3)

Global similarity of corresponding image-text according to the target dataLocal similarity of image-textSimilarity of image-text relationshipCalculating to obtain the image-text comprehensive similarity corresponding to the target data pairThe method comprises the steps of carrying out a first treatment on the surface of the Wherein, the image-text comprehensive similarityThe calculation formula of (a) is as formula (4):

equation (4).

In a preferred embodiment, the inter-modality structure constraint loss function is calculated as formula (5), wherein,Bin order to obtain the number of samples,for the super-parameters of the model, the model is a model super-parameter,for a matched pair of target data,andfor a non-matching target data pair:

formula (5)

The calculation formula of the intra-mode structure constraint loss function is shown as formula (6), wherein,as an image triplet, compared to，And (3) withWith a greater number of common semantic tags,as a text triplet, as compared to，And (3) withHaving more common semantic tags:

equation (6).

In a preferred embodiment, the training the neural network model using the inter-modality structure constraint loss function and the intra-modality structure constraint loss function includes:

Obtaining matched target data pairs, unmatched target data pairs, image triples and text triples from the training data set through random sampling, calculating inter-mode structure constraint loss function values according to the inter-mode structure constraint loss function, calculating intra-mode structure constraint loss function values according to the intra-mode structure constraint loss function, fusing according to a formula (7), and optimizing network parameters by using a back propagation algorithm:

formula (7)

Wherein the method comprises the steps ofIs a super parameter.

In a preferred embodiment, the image relationship features between the visual objects are extracted through an image visual relationship encoding networkComprises the steps of:

obtaining a visual target in an image via an image visual target detectorAnd visual targetFeatures of (2)，And features of two target joint regionsAnd (3) fusing the characteristics by adopting a formula (8), and calculating to obtain the relation characteristics:

formula (8)

Wherein [ among others ]]For the vector concatenation operation,for the function of the activation of the neurons,is a model parameter.

In a preferred embodiment, the inputting of each word vector into the text-relational encoding network extracts each word vectorText relationship features between words Comprises the steps of:

in the text relation coding network, the word is calculated by adopting a formula (9)Sum wordText relationship features between：

Formula (9)

Wherein,representing the function of activation of the neuron,is a model parameter.

In a preferred embodiment, the text-guided attention mechanism is used to calculate the weight of each visual object included in the image data in the object data pair, and the image local features of the respective visual objects are usedAfter corresponding weight weighting, obtaining new image local representation through feedforward neural network mappingComprises the steps of:

using a text-guided attention mechanism, the weight of each visual object in the image is calculated by equation (10):

formula (10)

Wherein,、is a model parameter;

weighting each visual target by formula (11) and mapping through a feedforward neural network to obtain a new image local representation：

Formula (11)

Wherein,is a model parameter.

In a preferred embodiment, the visual directing attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the text local features of each word are usedAfter corresponding weight weighting, mapping by a feedforward neural network to obtain a new text local representation Comprises the steps of:

using a visual directing attention mechanism, the weight of each word in the text is calculated by equation (12):

formula (12)

Wherein,、is a model parameter;

text local features for individual words by equation (13)Weighting the corresponding weights, and mapping through a feedforward neural network to obtain a new text local representation：

Formula (13)

Wherein,is a model parameter.

In a preferred embodiment, the training data set is acquired by Wikipedia, MS COCO, pascal Voc.

Compared with the prior art, the cross-modal retrieval method based on multi-level feature representation alignment has the following advantages:

according to the cross-modal retrieval method based on multi-level feature representation alignment, global similarity, local similarity and relationship similarity between two different modal data of an image and a text are calculated respectively in a cross-modal fine-granularity accurate alignment stage, the global similarity, the local similarity and the relationship similarity are fused to obtain the image-text comprehensive similarity, a corresponding loss function is designed in a network training stage, cross-modal structure constraint information is mined, parameter learning of a retrieval model is restrained and supervised from multiple angles, and finally a retrieval result of a test query sample is obtained according to the image-text comprehensive similarity, so that accuracy of cross-modal retrieval is improved effectively by introducing fine-granularity association relation between the two different modal data of the image and the text, and the cross-modal retrieval method has wide market requirements and application prospects in the fields of image-text retrieval, pattern recognition and the like.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of an implementation environment provided by one embodiment of the present invention.

FIG. 2 is a method flow diagram illustrating a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment.

FIG. 3 is a schematic diagram of inter-modal structure constraint loss in accordance with an embodiment of the present invention.

FIG. 4 is a schematic diagram of intra-modal structural constraint loss, in accordance with an embodiment of the present invention.

FIG. 5 is a schematic representation of one result of text retrieval of an image in accordance with an embodiment of the present invention.

FIG. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment.

FIG. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail below with reference to specific embodiments (but not limited to the illustrated embodiments) and the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention can be applied to various scenes, and the related implementation environment can comprise an input/output scene of a single server or an interaction scene of a terminal and the server. When the implementation environment is an input/output scene of a single server, the acquisition and storage main bodies of the image data and the text data are servers; when the implementation environment is an interaction scenario between the terminal and the server, a schematic diagram of the implementation environment according to the embodiment may be shown in fig. 1. In a schematic diagram of an implementation environment shown in fig. 1, the implementation environment includes a terminal 101 and a server 102.

The terminal 101 is an electronic device running at least one client, which is a client of a certain Application, also called APP (Application). The terminal 101 may be a smart phone, tablet computer, or the like.

The terminal 101 and the server 102 are connected by a wireless or wired network. The terminal 101 is used for transmitting data to the server 102, or the terminal is used for receiving data transmitted by the server 102. In one possible implementation, the terminal 101 may send at least one of image data or text data to the server 102.

The server 102 is used to receive data transmitted from the terminal 101, or the server 102 is used to transmit data to the terminal 101. The server 102 may analyze the data sent by the terminal 101, so as to match the image data and the text data with the highest similarity from the database, and send the image data and the text data to the terminal 101.

FIG. 2 is a flow chart of a method for cross-modal retrieval based on alignment of multi-level feature representations, as shown in FIG. 2, according to an exemplary embodiment, the method comprising:

step 100: a training data set is acquired, and for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic tags, wherein the semantic tags correspond to the image data and the text data together.

It should be noted that the text data may be text content corresponding to any language, such as english, chinese, japanese, german, etc.; the image data may be image content corresponding to any color, such as a color image, a gray scale image, and the like.

Step 200: and respectively extracting the image global features, the image local features and the image relation features corresponding to the image data in the data pair, and the text global features, the text local features and the text relation features corresponding to the text data in the data pair for each group of data pairs in the training data set.

In a preferred embodiment, step 200 specifically includes:

Step 210: for each group of data pairs in the training data set, a convolutional neural network CNN is adopted to extract the image global features of the image data corresponding to the data pairsThen a visual object detector is used for detecting visual objects included in the image data and extracting the image local characteristics of each visual objectWherein, the method comprises the steps of, wherein,Mfor the number of visual objects comprised by the image data,for visual targetExtracting image relation features among all visual targets through an image visual relation coding networkWherein, the method comprises the steps of, wherein,for visual targetAnd visual targetFeatures of the image relationship between them.

Step 220: for each group of data pairs in the training data set, converting each word in text data corresponding to the data pairs into a word vector by adopting a word embedding modelWherein, the method comprises the steps of, wherein,Nthe text data comprises word numbers, and then each word vector is sequentially input into a recurrent neural network to obtain text global features corresponding to the text dataInputting each word vector into a feedforward neural network to obtain text local characteristics corresponding to each wordSimultaneously inputting each word vector into a text relationship coding network to extract text relationship characteristics among each word Wherein, the method comprises the steps of, wherein,is a wordSum wordText relationship features between.

By implementing step 200 described above, cross-modal multi-level refinement representation may be achieved.

Step 300: and calculating a target data pair consisting of any image data and any text data in the training data set according to the image global feature and the text global feature corresponding to the target data pair, the image local feature and the text local feature corresponding to the target data pair, and the image relation feature and the text relation feature corresponding to the target data pair to obtain the image-text comprehensive similarity corresponding to the target data pair.

In a preferred embodiment, step 300 specifically includes:

step 310: for a target data pair consisting of any image data and any text data in the training data set, based on the image global features corresponding to the image data in the target data pairText global features corresponding to text dataThe cosine distance of the target data pair is calculated to obtain the corresponding image-text global similarity。

Wherein, the image-text global similarityThe calculation formula of (a) is as formula (1):

) Formula (1)

Step 320: calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out image local characteristics of each visual target After corresponding weight weighting, obtaining new image local representation through feedforward neural network mappingThe weight of each word included in the text data in the target data pair is then calculated using a visual directing attention mechanismHeavy, text local characteristics of each wordAfter corresponding weight weighting, mapping by a feedforward neural network to obtain a new text local representationLocal representation from individual imagesAnd respective text local representationsCosine similarity of all visual targets and words is calculated, and the local similarity of the target data to corresponding images-texts is calculated according to the mean value of the cosine similarity。

Wherein, the image-text local similarityThe calculation formula of (a) is as formula (2),Min order to achieve the desired number of visual objects,Nthe number of words is:

formula (2)

Step 330: according to cosine similarity mean values of each image relationship feature and each text relationship feature in the target data pair, calculating to obtain the corresponding image-text relationship similarity of the target data pair. Wherein, the similarity of image-text relationshipThe calculation formula of (c) is as formula (3),Prepresenting the number of relationships between image data and text data:

formula (3)

Step 340: global similarity of corresponding image-text according to the target data Local similarity of image-textCalculating the similarity of the image-text relationship to obtain the comprehensive similarity of the target data to the corresponding image-text relationship。

Wherein, the image-text comprehensive similarityThe calculation formula of (a) is as formula (4):

formula (4)

By implementation of step 300 described above, cross-modal fine-grained precise alignment may be achieved.

Step 400: based on the comprehensive similarity of the corresponding image-text of each group of target data, designing an inter-mode structure constraint loss function and an intra-mode structure constraint loss function, and training a neural network model by adopting the inter-mode structure constraint loss function and the intra-mode structure constraint loss function.

formula (5)

Formula (6)

FIG. 3 is a schematic diagram illustrating a loss of structural constraint between modes according to an embodiment of the present invention.

formula (7)

Wherein the method comprises the steps ofIs a super parameter.

Wherein fig. 4 is a schematic diagram of intra-modal structural constraint loss in an embodiment of the present invention.

Through implementation of the step 400, training of using cross-modal structure constraint information to monitor a retrieval model can be achieved, so that network training is conducted towards the direction of pulling up the similarity between matched target data pairs, the similarity between non-matched target data pairs is reduced, and meanwhile, the trained network can learn images and text representations with more discriminant power.

formula (8)

In a preferred embodiment, the inputting of the respective word vectors into the text-relationship encoding network extracts text-relationship features between the respective wordsComprises the steps of:

Formula (9)

In a preferred embodiment, the text-guided attention mechanism is used to calculate the weight of each visual object included in the image data in the object data pair, and the image local features of the respective visual objects are usedAfter corresponding weight weighting, obtaining new image local representation through feedforward neural network mapping Comprises the steps of:

formula (10)

Wherein,、is a model parameter;

Formula (11)

Wherein,is a model parameter.

In a preferred embodiment, the visual directing attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the text local features of each word are usedAfter corresponding weight weighting, mapping by a feedforward neural network to obtain a new text local representationComprises the steps of:

formula (12)

Wherein,、is a model parameter;

Formula (13)

Wherein,is a model parameter.

It should be noted that, after training the neural network model by adopting the steps 100-400, the similarity between the two data can be accurately output through calculation of the neural network model. And (3) using any one mode type in the test data set as a query mode, using another mode type as a target mode, using each data of the query mode as a query sample, retrieving data in the target mode, and calculating the similarity between the query sample and the query target according to an image-text comprehensive similarity calculation formula shown in a formula (4). In one possible implementation manner, the neural network model may output target modal data with highest similarity as matching data, or the neural network model sorts the similarity of each neural network model according to the order from large to small to obtain a relevant result list of a preset number of target modal data, so as to implement cross-modal retrieval operation between different modal data.

In the embodiment, an MS COCO cross-mode data set is adopted for experiments, and the data set is firstly proposed by a literature (T.Lin, et al, microsoft COCO: common objects in context, ECCV 2014, pp.740-755.), and becomes one of the most commonly used experimental data sets in the cross-mode retrieval field. Each picture in the dataset is provided with 5 text labels, wherein 82783 pictures and the text labels thereof are used as a training sample set, and 5000 pictures and the text labels thereof are randomly selected from the rest samples to be used as a test sample set. In order to better illustrate the beneficial effects of the cross-modal retrieval method based on multi-level feature representation alignment provided by the embodiment of the invention, the cross-modal retrieval method based on multi-level feature representation alignment provided by the invention is subjected to experimental test comparison with the following 3 existing cross-modal retrieval methods:

The existing method comprises the following steps: the Order-casting method described in the literature (I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, order-embeddings ofimages and language, ICLR, 2016.).

The existing method is as follows: the VSE++ method described in the literature (F. Faghri, D. Fleet, R. Kiros, and S. Fidler, VSE++: improved visualsemantic embeddings with hard negatives, BMVC, 2018.).

The existing method is as follows: the c-VRANet method described in the literature (J. Yu, W. Zhang, Y. Lu, Z. Qin, et al Reasoning on the relation: enhancing visual representation for visual question answering and cross-model retrieval, IEEE Transactions on Multimedia, 22 (12): 3196-3209, 2020.).

The experiment adopts R@n index commonly used in the field of cross-modal retrieval to evaluate the accuracy of cross-modal retrieval, the index represents the percentage of correct samples in n samples returned by the retrieval, the higher the index is, the better the retrieval result is, and n in the experiment is respectively 1,5 and 10.

List one

As can be seen from the data shown in the first table, compared with the existing cross-modal retrieval method, the cross-modal retrieval method based on multi-level feature representation alignment provided by the invention has the advantages that the retrieval accuracy of the two tasks of retrieving text data by image data and retrieving image data by the text data is obviously improved, so that the effectiveness of the refined alignment of the image text global-local-relation multi-level feature representation provided by the invention is fully proved. In order to facilitate understanding, a schematic diagram of a result of performing text retrieval on an image by using the embodiment of the present invention is also shown, as shown in fig. 5, where the first column is a text for retrieval, the second column is a matching image given by a dataset, and the third column to the seventh column are corresponding retrieval results of the first five of similarity.

Compared with the prior art, the cross-modal retrieval method based on multi-level characteristic representation alignment can achieve higher retrieval accuracy.

In summary, according to the cross-modal retrieval method based on multi-level feature representation alignment, global similarity, local similarity and relationship similarity between two different modal data of an image and a text are calculated respectively in a cross-modal fine-granularity precise alignment stage, the global similarity, the local similarity and the relationship similarity are fused to obtain the image-text comprehensive similarity, a corresponding loss function is designed in a network training stage, cross-modal structure constraint information is mined, parameter learning of a retrieval model is restrained and supervised from a plurality of angles, and finally a retrieval result of a test query sample is obtained according to the image-text comprehensive similarity, so that accuracy of cross-modal retrieval is improved effectively by introducing fine-granularity association relation between the two different modal data of the image and the text, and the cross-modal retrieval method has wide market requirements and application prospects in the fields of image-text retrieval, pattern recognition and the like.

FIG. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment. For example, apparatus 600 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 6, apparatus 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the apparatus 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 may include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on the apparatus 600, contact data, phonebook data, messages, pictures, videos, and the like. The memory 604 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 606 provides power to the various components of the device 600. The power supply components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 600.

The multimedia component 608 includes a screen between the device 600 and the target user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a target user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 608 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 600 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 614 includes one or more sensors for providing status assessment of various aspects of the apparatus 600. For example, the sensor assembly 614 may detect the on/off state of the device 600, the relative positioning of the assemblies, such as the display and keypad of the device 600, the sensor assembly 614 may also detect the change in position of the device 600 or one of the assemblies of the device 600, the presence or absence of a target user contacting the device 600, the orientation or acceleration/deceleration of the device 600, and the change in temperature of the device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communication between the apparatus 600 and other devices in a wired or wireless manner. The device 600 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 616 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 604, including instructions executable by processor 620 of apparatus 600 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of apparatus 600, causes apparatus 600 to perform a cross-modal retrieval method based on multi-level feature representation alignment, the method comprising:

Based on the comprehensive similarity of the corresponding image-text of each group of target data, designing an inter-mode structure constraint loss function and an intra-mode structure constraint loss function, and training a neural network model by adopting the inter-mode structure constraint loss function and the intra-mode structure constraint loss function.

FIG. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment. For example, the apparatus 700 may be provided as a server. Referring to fig. 7, apparatus 700 includes a processing component 722 that further includes one or more processors and memory resources represented by memory 732 for storing instructions, such as application programs, executable by processing component 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, the processing component 722 is configured to execute instructions to perform the above-described start page generation method.

The apparatus 700 may further comprise a power component 726 configured to perform power management of the apparatus 700, a wired or wireless network interface 750 configured to connect the apparatus 700 to a network, and an input output (I/O) interface 758. The apparatus 700 may operate based on an operating system stored in memory 732, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

While the invention has been described in detail in the foregoing general description, embodiments and experiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof.

Claims

1. A cross-modal retrieval method based on multi-level feature representation alignment, the method comprising:

For each group of data pairs in the training data set, a convolutional neural network CNN is adopted to extract the image global feature f of the image data corresponding to the data pairs ^vg Then a visual object detector is used for detecting visual objects included in the image data and extracting the image local characteristics of each visual objectWherein M is the number of visual objects comprised by said image data, < >>Extracting image relation features between visual targets by using an image visual relation coding network for the feature vector of the visual target i>Wherein (1)>Is the image relation characteristic between the visual target i and the visual target j; the image relation characteristic between each visual target is extracted through the image visual relation coding network>Comprises the steps of: obtaining features of visual object i and visual object j in the image via an image visual object detector>And the characteristics of the two target union regions +.>And (3) fusing the characteristics by adopting a formula (8), and calculating to obtain the characteristics of each relation:

wherein [ among others ]]For vector stitching operations, σ is the neuron activation function, W ₁ Is a model parameter;

for each group of data pairs in the training data set, converting each word in text data corresponding to the data pairs into a word vector by adopting a word embedding model Wherein N is the number of words included in the text data, and then each word vector is sequentially input into a recurrent neural network to obtain text global features corresponding to the text data>Inputting each word vector into a feedforward neural network to obtain text local characteristics corresponding to each word>Simultaneously, each word vector is input into a text relation coding network to extract text relation characteristics among each word>Wherein (1)>Is the text relationship feature between word i and word j; the input of each word vector into the text relation coding network extracts the text relation characteristic between each word>Comprises the steps of: in a text-relation encoding network, calculating the text-relation feature between words i and j using equation (9)>

Wherein σ represents a neuron activation function, W ₂ Is a model parameter;

for a target data pair consisting of any image data and any text data in the training data set, based on the image global feature f corresponding to the image data in the target data pair ^vg Text global feature f corresponding to text data ^tg The cosine distance of the target data pair corresponding to the image-text global similarity s is calculated _g (v, t); wherein the image-text global similarity s _g The calculation formula of (v, t) is as formula (1):

s _g (v,t)＝cos(f ^vg ,f ^tg ) Formula (1);

calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out image local feature f on each visual target ^vl After corresponding weight weighting, obtaining new image local representation through feedforward neural network mappingThen, a visual directing attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text local feature f of each word is calculated ^tl After corresponding weight weighting, mapping by feedforward neural network to obtain new text local representation +.>Local representation according to individual images->And the respective text partial representation +.>Cosine similarity of all visual targets and words is calculated, and the local similarity s of the target data to the corresponding image-text is calculated by means of the mean value _l (v, t); wherein the image-text local similarity s _l The calculation formula of (v, t) is shown as formula (2), M is the number of visual targets, and N is the number of words:

according to cosine similarity mean values of each image relationship feature and each text relationship feature in the target data pair, calculating to obtain image-text relationship similarity s corresponding to the target data pair _r (v, t); wherein the image-text relationship similarity s _r The calculation formula of (v, t) is shown as formula (3), and P represents the imageNumber of relationships between data and text data:

global similarity s of corresponding images and texts according to the target data _g (v, t), image-text local similarity s _l (v, t), image-text relationship similarity s _r (v, t) calculating to obtain the image-text comprehensive similarity s (v, t) corresponding to the target data pair; wherein, the calculation formula of the image-text integrated similarity s (v, t) is as formula (4):

s(v,t)＝s _g (v,t)+s _l (v,t)+s _r (v, t) formula (4);

based on the comprehensive similarity of each group of target data to corresponding images and texts, an inter-mode structure constraint loss function and an intra-mode structure constraint loss function are designed, a neural network model is trained by adopting the inter-mode structure constraint loss function and the intra-mode structure constraint loss function, cross-mode retrieval operation among different mode data is realized according to the neural network model, a calculation formula of the inter-mode structure constraint loss function is shown as a formula (5), wherein B is the number of samples, alpha is a model super-parameter, (v) _i ,t _i ) For matching target data pairs, (v) _i ,t ^- ) And (t) _i ,v ^- ) For a non-matching target data pair:

the calculation formula of the intra-mode structure constraint loss function is shown as formula (6), wherein (v) _i ,v ⁺ ,v ^- ) As an image triplet, compared to v ^- ，v ⁺ And v _i Have more common semantic tags, (t) _i ,t ⁺ ,t ^- ) As a text triplet, compared to t ^- ，t ⁺ And t _i Having more common semantic tags:

2. the method of claim 1, wherein the step of training a neural network model using the inter-modality structural constraint loss function and the intra-modality structural constraint loss function comprises:

L＝η·L _inter +(1-η)·L _inner formula (7)

Wherein the method comprises the steps of _η Is a super parameter.

3. The method according to claim 1, wherein the computing of the weights of each visual object included in the image data in the object data pair using a text-guided attention mechanism subjects the image local features f of the respective visual object to the image local features f ^vl After corresponding weight weighting, obtaining new image local representation through feedforward neural network mapping Comprises the steps of:

wherein W is ₃ 、W ₄ Is a modelParameters;

weighting each visual target by formula (11) and mapping through a feedforward neural network to obtain a new image local representation

Wherein W is ₅ Is a model parameter.

4. The method of claim 1, wherein the computing a weight for each word included in the text data in the target data pair using a visual directing attention mechanism includes characterizing a text local feature f of each word ^tl After corresponding weight weighting, mapping by a feedforward neural network to obtain a new text local representationComprises the steps of:

wherein W is ₆ 、W ₇ Is a model parameter;

text local feature f for each word by equation (13) ^tl Weighting the corresponding weights, and mapping through a feedforward neural network to obtain a new text local representation

Wherein W is ₈ Is a model parameter.

5. The method of claim 1, wherein the training data set is obtained by Wikipedia, MS COCO, pascal Voc.