CN113792207A

CN113792207A - Cross-modal retrieval method based on multi-level feature representation alignment

Info

Publication number: CN113792207A
Application number: CN202111149240.4A
Authority: CN
Inventors: 张卫锋; 周俊峰; 王小江
Original assignee: Jiaxing University
Current assignee: Jiaxing University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-14
Anticipated expiration: 2041-09-29
Also published as: CN113792207B

Abstract

The invention discloses a cross-modal retrieval method based on multi-level feature representation alignment, and relates to the technical field of cross-modal retrieval. The invention respectively calculates the global similarity, the local similarity and the relation similarity between the image and the text data in two different modes in the cross-mode fine-grained accurate alignment stage, and fuses to obtain the image-text comprehensive similarity, designs a corresponding loss function in the neural network training stage, excavates the cross-mode structure constraint information, performs parameter learning from a plurality of angle constraints and a supervision retrieval model, and finally obtains the retrieval result of a test query sample according to the image-text comprehensive similarity, thereby effectively improving the accuracy of cross-mode retrieval by introducing the fine-grained incidence relation between the image and the text data in two different modes, and having wide market demands and application prospects in the fields of image-text retrieval, mode identification and the like.

Description

Cross-modal retrieval method based on multi-level feature representation alignment

Technical Field

The invention relates to the technical field of cross-modal retrieval, in particular to a cross-modal retrieval method based on multi-level feature representation alignment.

Background

With the rapid development of new generation internet technologies such as mobile internet, social network and the like, multi-modal data such as text, images, videos and the like are growing explosively. The cross-modal retrieval technology aims to realize cross retrieval among different modal data by mining and utilizing associated information among different modal data, and the core of the cross retrieval technology is to realize similarity measurement among the cross modal data. In recent years, the cross-modal retrieval technology has become a research hotspot at home and abroad, receives wide attention from academic circles and industrial circles, is one of important research fields of cross-modal intelligence, and is an important direction for future development of the information retrieval field.

The cross-modal retrieval simultaneously relates to data of multiple modalities, and the data have a 'heterogeneous gap', namely the data are related to each other in high-level semantics and have heterogeneity on bottom-level features, so that a retrieval algorithm is required to deeply mine related information between different modality data and realize alignment of one modality data to another modality data.

At present, a subspace learning method is a mainstream method of cross-modal retrieval, and the method can be subdivided into a retrieval model based on traditional statistical correlation analysis and a retrieval model based on deep learning. The cross-modal retrieval method based on the traditional statistical correlation analysis maps different modal data to a subspace through a linear mapping matrix, and the correlation among the different modal data is maximized. The cross-modal retrieval method based on deep learning extracts effective representations of different modal data by utilizing the feature extraction capability of a deep neural network, and simultaneously excavates the complex association characteristics between the cross-modal data by utilizing the complex nonlinear mapping capability of the neural network.

In the process of implementing the present invention, the applicant finds that the following technical problems exist in the prior art:

the cross-modal retrieval method provided by the prior art focuses on representation learning, association analysis and alignment of global features and local features of images and texts, but lacks reasoning of relationships between visual targets and alignment of relationship information, and cannot comprehensively and effectively utilize a structural constraint information supervision model contained in training data for training, so that the cross-modal retrieval method has low accuracy in cross-modal retrieval of images and texts.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a cross-modal retrieval method based on multi-level feature representation alignment, which accurately measures the similarity between an image and a text and effectively provides retrieval accuracy by cross-modal multi-level representation correlation, thereby solving the technical problems that the representation of the existing cross-modal retrieval method is not fine enough and the cross-modal correlation is not sufficient, and simultaneously, the training of a retrieval model is supervised by using cross-modal structure constraint information. The technical scheme of the invention is as follows:

according to an aspect of the embodiments of the present invention, there is provided a cross-modal retrieval method based on multi-level feature representation alignment, the method including:

acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels corresponding to the image data and the text data together;

for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to image data in the data pairs, and text global features, text local features and text relation features corresponding to text data in the data pairs;

for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain image-text comprehensive similarity corresponding to the target data pair according to image global features and text global features corresponding to the target data pair, image local features and text local features corresponding to the target data pair, and image relation features and text relation features corresponding to the target data pair;

and designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.

In a preferred embodiment, the step of, for each group of data pairs in the training data set, respectively extracting an image global feature, an image local feature, and an image relationship feature corresponding to image data in the data pair, and a text global feature, a text local feature, and a text relationship feature corresponding to text data in the data pair includes:

for each group of data pairs in the training data set, extracting the image global characteristics of the image data corresponding to the data pairs by adopting a Convolutional Neural Network (CNN)

Then, a visual target detector is used to detect the visual targets included in the image data and extract the image local features of each visual target

WhereinMfor the number of visual objects comprised by the image data,

as a visual target

Extracting image relation characteristics among all visual targets through an image visual relation coding network

Wherein

as a visual target

And a visual target

Image relationship features between;

for each group of data pairs in the training data set, converting each word in the text data corresponding to the data pair into a word vector using a word embedding model

WhereinNthe word quantity included in the text data is input into a recurrent neural network in sequence, and the global text feature corresponding to the text data is obtained

Then, each word vector is input to a feedforward neural network to obtain the local text characteristics corresponding to each word

Simultaneously, each word vector is input into a text relation coding network to extract text relation characteristics among words

Wherein

is a word

Hehe word

A textual relationship feature between.

In a preferred embodiment, the step of calculating, for a target data pair composed of any image data and any text data in the training data set, an image-text comprehensive similarity corresponding to the target data pair according to an image global feature and a text global feature corresponding to the target data pair, an image local feature and a text local feature corresponding to the target data pair, and an image relationship feature and a text relationship feature corresponding to the target data pair includes:

for a target data pair consisting of any image data and any text data in the training data set, based on image global features corresponding to the image data in the target data pair

Global features of text corresponding to text data

The cosine distance of the target data is calculated to obtain the image-text global similarity corresponding to the target data

(ii) a Wherein image-text global similarity

Is as in formula (1):

) Formula (1)

Calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out local image feature on each visual target

After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping

Then, a visual guidance attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text of each word is locally processedFeature(s)

After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping

From the respective image partial representation

And respective text partial representations

Calculating cosine similarity of all visual targets and words, and calculating the local similarity of the target data to the corresponding image-text according to the mean value of the cosine similarity

(ii) a Wherein the image-text local similarity

The formula (2) is shown in the specification, wherein M is the number of visual objects, and N is the number of words:

formula (2)

Calculating to obtain image-text relation similarity corresponding to the target data pair according to the cosine similarity mean value of each image relation feature and each text relation feature in the target data pair

(ii) a Wherein, the similarity of image-text relationship

Is as in formula (3),Pnumber of relationships representing image data and text data:

formula (3)

According to the global similarity of the target data to the corresponding image-text

Image-text local similarity

Similarity of image-text relationship

Calculating to obtain the image-text comprehensive similarity corresponding to the target data pair

(ii) a Wherein, the image-text comprehensive similarity

The calculation formula of (2) is as formula (4):

equation (4).

In a preferred embodiment, the calculation formula of the inter-modal structure constraint loss function is as in formula (5), wherein,Bis the number of samples to be tested,

in order to be a hyper-parameter of the model,

for the pair of target data that is matched,

and

for non-matching target data pairs:

formula (5)

The calculation formula of the intra-modal structure constraint loss function is shown as formula (6), wherein,

for image triplets, in contrast to

，

And

there are more common semantic labels that are present,

for text triplets, in contrast to

，

And

with more common semantic labels:

equation (6).

In a preferred embodiment, the step of training the neural network model by using the inter-modal structural constraint loss function and the intra-modal structural constraint loss function includes:

randomly sampling from the training data set to obtain a matched target data pair, a non-matched target data pair, an image triple and a text triple, respectively calculating an inter-modal structure constraint loss function value according to the inter-modal structure constraint loss function, calculating an intra-modal structure constraint loss function value according to the intra-modal structure constraint loss function, fusing according to a formula (7), and optimizing network parameters by using a back propagation algorithm:

formula (7)

Wherein

Is a hyper-parameter.

In a preferred embodiment, the image relationship features between the visual targets are extracted through an image visual relationship coding network

The method comprises the following steps:

obtaining visual objects in an image via an image visual object detector

And a visual target

Is characterized by

，

And the characteristics of the two target union regions

The above features are fused by formula (8), and each relation characteristic is obtained by calculationAnd (3) carrying out mark:

formula (8)

Wherein]In order to perform the vector splicing operation,

in order to function the activation of the neuron,

are model parameters.

In a preferred embodiment, the inputting of the word vectors into the text relation coding network extracts the text relation features between the words

The method comprises the following steps:

in a text-relational coding network, words are calculated using equation (9)

Hehe word

Characteristic of textual relationship between

：

Formula (9)

Wherein,

represents the function of the activation of the neuron,

are model parameters.

In a preferred embodiment, the calculating of the weight of each visual target included in the image data in the target data pair by using a text-guided attention mechanism includes the step of comparing the image local features of the visual targets

The method comprises the following steps:

using the text-guided attention mechanism, the weight of each visual object in the image is calculated by equation (10):

formula (10)

Wherein,

、

is a model parameter;

each visual target is weighted by formula (11) and a new image local representation is obtained through feed-forward neural network mapping

：

Formula (11)

Wherein,

are model parameters.

In a preferred embodiment, the calculating using a visual guidance attention mechanismThe weight of each word included in the text data in the target data pair is obtained by using the local text characteristics of each word

The method comprises the following steps:

using the visual guidance attention mechanism, the weight of each word in the text is calculated by equation (12):

formula (12)

Wherein,

、

is a model parameter;

local text features for individual words by means of formula (13)

Weighting corresponding weight, and obtaining new text local representation through feedforward neural network mapping

：

Formula (13)

Wherein,

are model parameters.

In a preferred embodiment, the training data set is obtained by Wikipedia, MS COCO, Pascal Voc.

Compared with the prior art, the cross-modal retrieval method based on multi-level feature representation alignment provided by the invention has the following advantages:

the invention provides a cross-modal retrieval method based on multi-level feature representation alignment, which is characterized in that the global similarity, the local similarity and the relation similarity between two different modal data of an image and a text are respectively calculated and fused to obtain the comprehensive similarity of the image and the text in a cross-modal fine-grained accurate alignment stage, a corresponding loss function is designed in a network training stage, cross-modal structure constraint information is mined, parameter learning from a plurality of angle constraints and a supervision retrieval model is performed, and finally a retrieval result of a test query sample is obtained according to the comprehensive similarity of the image and the text, so that the accuracy of cross-modal retrieval is effectively improved by introducing a fine-grained incidence relation between the two different modal data of the image and the text, and the cross-modal retrieval method has wide market requirements and application prospects in the fields of image-text retrieval, mode identification and the like.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of an implementation environment provided by one embodiment of the invention.

FIG. 2 is a flowchart illustrating a method for cross-modal retrieval based on multi-level feature representation alignment, according to an example embodiment.

Fig. 3 is a schematic diagram illustrating constraint loss of an inter-modal structure according to an embodiment of the present invention.

Fig. 4 is a diagram illustrating intra-modal structural constraint loss according to an embodiment of the present invention.

Fig. 5 is a diagram illustrating a result of performing a text search on an image according to an embodiment of the present invention.

FIG. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment.

FIG. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in detail below with reference to specific embodiments (but not limited to) and the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, rather than all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention can be suitable for various scenes, and the related implementation environment can comprise an input and output scene of a single server or an interaction scene of a terminal and the server. When the implementation environment is an input/output scene of a single server, the main bodies of acquiring and storing the image data and the text data are both servers; when the implementation environment is an interaction scenario between a terminal and a server, a schematic diagram of the implementation environment according to the embodiment may be as shown in fig. 1. In the schematic diagram of the implementation environment shown in fig. 1, the implementation environment includes a terminal 101 and a server 102.

The terminal 101 is an electronic device running at least one client, and the client is a client of an Application program, which is also called APP (Application program). The terminal 101 may be a smartphone, a tablet computer, or the like.

The terminal 101 and the server 102 are connected via a wireless or wired network. The terminal 101 is used for transmitting data to the server 102, or the terminal is used for receiving data transmitted by the server 102. In one possible implementation, the terminal 101 may transmit at least one of image data or text data to the server 102.

The server 102 is used for receiving data transmitted by the terminal 101, or the server 102 is used for transmitting data to the terminal 101. The server 102 may analyze and process data transmitted by the terminal 101, so as to match image data and text data with the highest similarity from the database and transmit the image data and the text data to the terminal 101.

Fig. 2 is a flowchart illustrating a method for cross-modal retrieval method based on multi-level feature representation alignment according to an exemplary embodiment, and as shown in fig. 2, the method for cross-modal retrieval method based on multi-level feature representation alignment includes:

step 100: acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels which correspond to the image data and the text data together.

It should be noted that the text data may be text contents corresponding to any language, such as english, chinese, japanese, german, etc.; the image data may be image content corresponding to any color, such as a color image, a grayscale image, and the like.

Step 200: and for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to the image data in the data pairs, and text global features, text local features and text relation features corresponding to the text data in the data pairs.

In a preferred embodiment, step 200 specifically includes:

step 210: for each group of data pairs in the training data set, extracting the image global characteristics of the image data corresponding to the data pairs by adopting a Convolutional Neural Network (CNN)

WhereinMfor the number of visual objects comprised by the image data,

as a visual target

Wherein

as a visual target

And a visual target

Image relationship features between.

Step 220: for each group of data pairs in the training data set, converting each word in the text data corresponding to the data pair into a word vector using a word embedding model

Wherein

is a word

Hehe word

A textual relationship feature between.

By implementing the step 200, cross-modal multi-level refined representation can be realized.

Step 300: and for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain the image-text comprehensive similarity corresponding to the target data pair according to the image global feature and the text global feature corresponding to the target data pair, the image local feature and the text local feature corresponding to the target data pair, and the image relation feature and the text relation feature corresponding to the target data pair.

In a preferred embodiment, step 300 specifically includes:

step 310: for a target data pair consisting of any image data and any text data in the training data set, based on image global features corresponding to the image data in the target data pair

Global features of text corresponding to text data

。

Wherein image-text global similarity

Is as in formula (1):

) Formula (1)

Step 320: calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out local image feature on each visual target

Then, a visual guidance attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text local characteristics of each word are calculated

From the respective image partial representation

And respective text partial representations

。

Wherein the image-text local similarity

Is as in formula (2),Mfor the visionThe number of the target is,Nnumber of words:

formula (2)

Step 330: calculating to obtain image-text relation similarity corresponding to the target data pair according to the cosine similarity mean value of each image relation feature and each text relation feature in the target data pair

. Wherein, the similarity of image-text relationship

formula (3)

Step 340: according to the global similarity of the target data to the corresponding image-text

Image-text local similarity

And calculating the image-text relation similarity to obtain the image-text comprehensive similarity corresponding to the target data pair

。

Wherein, the image-text comprehensive similarity

The calculation formula of (2) is as formula (4):

formula (4)

Fine-grained cross-modal alignment can be achieved through the implementation of step 300 described above.

Step 400: designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a neural network model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.

in order to be a hyper-parameter of the model,

for the pair of target data that is matched,

and

for non-matching target data pairs:

formula (5)

for image triplets, in contrast to

，

And

there are more common semantic labels that are present,

for text triplets, in contrast to

，

And

with more common semantic labels:

formula (6)

formula (7)

Wherein

Is a hyper-parameter.

Fig. 4 is a schematic diagram illustrating a loss of structural constraint in a mode according to an embodiment of the present invention.

Through the implementation of the step 400, the training of the information supervision retrieval model by using the cross-modal structure constraint can be realized, so that the network training is performed towards the direction of raising the similarity between the matched target data pairs and reducing the similarity between the unmatched target data pairs, and meanwhile, the trained network can learn images and text representations with more discriminative power.

The method comprises the following steps:

obtaining visual objects in an image via an image visual object detector

And a visual target

Is characterized by

，

And the characteristics of the two target union regions

And fusing the characteristics by adopting a formula (8), and calculating to obtain the relationship characteristics:

formula (8)

Wherein]In order to perform the vector splicing operation,

in order to function the activation of the neuron,

are model parameters.

The method comprises the following steps:

in a text-relational coding network, words are calculated using equation (9)

Hehe word

Characteristic of textual relationship between

：

Formula (9)

Wherein,

represents the function of the activation of the neuron,

are model parameters.

In a preferred embodiment, the calculating of the weight of each visual target included in the image data in the target data pair by using a text-guided attention mechanism includes determining the local features of the image of each visual targetSign for

The method comprises the following steps:

formula (10)

Wherein,

、

is a model parameter;

：

Formula (11)

Wherein,

are model parameters.

In a preferred embodiment, the calculating the weight of each word included in the text data in the target data pair by using a visual guidance attention mechanism includes local text features of each word

The method comprises the following steps:

formula (12)

Wherein,

、

is a model parameter;

local text features for individual words by means of formula (13)

：

Formula (13)

Wherein,

are model parameters.

It should be noted that, after the training of the neural network model is implemented by adopting the above steps 100-400, the similarity between the data of different modes and the data of different modes can be accurately output through the calculation of the neural network model. And (3) using any one mode type in the test data set as a query mode, using the other mode type as a target mode, using each data of the query mode as a query sample, retrieving the data in the target mode, and calculating the similarity between the query sample and the query target according to an image-text comprehensive similarity calculation formula shown in a formula (4). In a possible implementation manner, the neural network model may output the target modal data with the highest similarity as matching data, or the neural network model sorts the similarities of the neural network models from large to small to obtain a related result list of a preset number of target modal data, thereby implementing cross-modal retrieval operation among different modal data.

This example was conducted using the MS COCO cross-modal dataset, which was first proposed in the literature (T. Lin, et al. Microsoft COCO: Common objects in context, ECCV 2014, pp.740-755.) and has become one of the most Common experimental datasets in the cross-modal search field. Each picture in the data set is provided with 5 text labels, wherein 82783 pictures and text labels thereof are used as training sample sets, and 5000 pictures and text labels thereof are randomly selected from the rest samples to be used as test sample sets. In order to better illustrate the beneficial effects of the cross-modal retrieval method based on multi-level feature representation alignment provided by the embodiment of the present invention, the cross-modal retrieval method based on multi-level feature representation alignment provided by the present invention is compared with the following 3 existing cross-modal retrieval methods through experimental tests:

the prior method comprises the following steps: the Order-embedding method described in the literature (I. Vendorov, R. Kiros, S. Fidler, and R. Urtastun, Order-embedding of images and language, ICLR, 2016.).

The prior method II comprises the following steps: a VSE + + method described in the literature (F. Faghri, D. fly, R. tools, and S. Fidler, VSE + +: Improved visual inspection techniques with hard defects, BMVC, 2018.).

The existing method is three: the c-ANet method is described in the documents (J, Yu, W, Zhang, Y, Lu, Z, Qin, et al, reading on the relationship: Enhancing visual representation for visual request and cross-mode representation, IEEE Transactions on Multimedia, 22(12): 3196-.

The accuracy of cross-modal retrieval is evaluated by adopting an R @ n index commonly used in the field of cross-modal retrieval in the experiment, the index represents the percentage of correct samples in n samples returned by retrieval, the higher the index is, the better the retrieval result is, and n in the experiment is respectively 1, 5 and 10.

Watch 1

Compared with the conventional cross-modal retrieval method, the cross-modal retrieval method based on multi-level feature representation alignment provided by the invention has the advantages that the retrieval accuracy on two tasks of retrieving the text data from the image data and retrieving the image data from the text data is obviously improved, so that the effectiveness of refined alignment of the global-local-relation multi-level feature representation of the image text provided by the invention is fully proved. For easy understanding, a schematic diagram of results of text retrieval images by using the embodiment of the present invention is also shown, as shown in fig. 5, where the first column is a text for retrieval, the second column is a matching image given by a data set, and the third to seventh columns are corresponding retrieval results of the top five degrees of similarity.

The following experimental results show that compared with the existing method, the cross-modal retrieval method based on multi-level feature representation alignment can achieve higher retrieval accuracy.

In summary, the present invention provides a cross-modal search method based on multi-level feature representation alignment, the global similarity, the local similarity and the relation similarity between two different modal data of an image and a text are respectively calculated in a cross-modal fine-grained accurate alignment stage and are fused to obtain the image-text comprehensive similarity, in the network training stage, designing corresponding loss functions, excavating cross-modal structure constraint information, learning from parameters of a plurality of angle constraints and a supervision retrieval model, finally obtaining retrieval results of the test query sample according to the image-text comprehensive similarity, therefore, by introducing a fine-grained incidence relation between the image data and the text data in two different modes, the accuracy of cross-mode retrieval is effectively improved, and the method has wide market demands and application prospects in the fields of image-text retrieval, mode identification and the like.

FIG. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment. For example, the apparatus 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, apparatus 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, input/output (I/O) interface 612, sensor component 614, and communication component 616.

The processing component 602 generally controls overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply component 606 provides power to the various components of device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 600.

The multimedia component 608 includes a screen that provides an output interface between the device 600 and the target user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a target user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, audio component 610 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the apparatus 600. For example, sensor component 614 may detect an open/closed state of device 600, the relative positioning of components, such as a display and keypad of device 600, the change in position of device 600 or a component of device 600, the presence or absence of contact by a target user with device 600, the orientation or acceleration/deceleration of device 600, and the change in temperature of device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the apparatus 600 and other devices in a wired or wireless manner. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of apparatus 600, enable apparatus 600 to perform a cross-modal retrieval method based on multi-level feature representation alignment, the method comprising:

designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a neural network model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.

FIG. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment. For example, the apparatus 700 may be provided as a server. Referring to fig. 7, apparatus 700 includes a processing component 722 that further includes one or more processors and memory resources, represented by memory 732, for storing instructions, such as applications, that are executable by processing component 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, the processing component 722 is configured to execute instructions to perform the above-described launch page generation method.

The apparatus 700 may also include a power component 726 configured to perform power management of the apparatus 700, a wired or wireless network interface 750 configured to connect the apparatus 700 to a network, and an input output (I/O) interface 758. The apparatus 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

While the invention has been described in detail in the foregoing by way of general description, and specific embodiments and experiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof.

Claims

1. A cross-modal retrieval method based on multi-level feature representation alignment is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of extracting, for each group of data pairs in the training data set, an image global feature, an image local feature and an image relation feature corresponding to image data in the data pair, and a text global feature, a text local feature and a text relation feature corresponding to text data in the data pair, respectively, comprises:

WhereinMfor the number of visual objects comprised by the image data,

as a visual target

Wherein

as a visual target

And a visual target

Image relationship features between;

Wherein

is a word

Hehe word

A textual relationship feature between.

3. The method according to claim 2, wherein the step of calculating, for a target data pair composed of any image data and any text data in the training data set, an image-text comprehensive similarity corresponding to the target data pair according to an image global feature and a text global feature corresponding to the target data pair, an image local feature and a text local feature corresponding to the target data pair, and an image relationship feature and a text relationship feature corresponding to the target data pair includes:

Global features of text corresponding to text data

(ii) a Wherein image-text global similarity

Is as in formula (1):

) Formula (1)

From the respective image partial representation

Calculating cosine similarity of all visual targets and words according to local representation of each text, and calculating the local similarity of the target data to the corresponding image-text according to the mean value of the cosine similarity

(ii) a Wherein the image-text local similarity

formula (2)

(ii) a Wherein, the similarity of image-text relationship

formula (3)

Image-text local similarity

Similarity of image-text relationship

Is calculated toTo the corresponding image-text comprehensive similarity of the target data pair

(ii) a Wherein, the image-text comprehensive similarity

The calculation formula of (2) is as formula (4):

equation (4).

4. The method according to claim 3, wherein the inter-modal structural constraint loss function is calculated as in equation (5),Bis the number of samples to be tested,

in order to be a hyper-parameter of the model,

for the pair of target data that is matched,

and

for non-matching target data pairs:

formula (5)

for image triplets, in contrast to

，

And

there are more common semantic labels that are present,

for text triplets, in contrast to

，

And

with more common semantic labels:

equation (6).

5. The method of claim 4, wherein the step of training a neural network model using the inter-modal and intra-modal structural constraint loss functions comprises:

formula (7)

Wherein

Is a hyper-parameter.

6. The method according to claim 2, wherein the extracting of the image relation features between the visual targets through the image visual relation coding network

The method comprises the following steps:

obtaining visual objects in an image via an image visual object detector

And a visual target

Is characterized by

，

And the characteristics of the two target union regions

formula (8)

Wherein]In order to perform the vector splicing operation,

in order to function the activation of the neuron,

are model parameters.

7. The method of claim 2, wherein inputting the word vectors into the text-relation coding network extracts text-relation features between words

The method comprises the following steps:

in a text-relational coding network, words are calculated using equation (9)

Hehe word

Characteristic of textual relationship between

：

Formula (9)

Wherein,

representing the neuron activation function as a model parameter.

8. The method of claim 3, wherein the step of removing the substrate comprises removing the substrate from the substrateCalculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and calculating the image local characteristics of each visual target

After weighting the corresponding weight, obtaining a new image local representation through feedforward neural network mapping, comprising the following steps:

formula (10)

Wherein,

、

is a model parameter;

：

Formula (11)

Wherein,

are model parameters.

9. The method of claim 3, wherein said directing attention visually is performedThe force mechanism calculates the weight of each word included in the text data in the target data pair and the local text characteristics of each word