CN113792207A - Cross-modal retrieval method based on multi-level feature representation alignment - Google Patents

Cross-modal retrieval method based on multi-level feature representation alignment Download PDF

Info

Publication number
CN113792207A
CN113792207A CN202111149240.4A CN202111149240A CN113792207A CN 113792207 A CN113792207 A CN 113792207A CN 202111149240 A CN202111149240 A CN 202111149240A CN 113792207 A CN113792207 A CN 113792207A
Authority
CN
China
Prior art keywords
text
image
data
target
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111149240.4A
Other languages
Chinese (zh)
Other versions
CN113792207B (en
Inventor
张卫锋
周俊峰
王小江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiaxing University
Original Assignee
Jiaxing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiaxing University filed Critical Jiaxing University
Priority to CN202111149240.4A priority Critical patent/CN113792207B/en
Publication of CN113792207A publication Critical patent/CN113792207A/en
Application granted granted Critical
Publication of CN113792207B publication Critical patent/CN113792207B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-modal retrieval method based on multi-level feature representation alignment, and relates to the technical field of cross-modal retrieval. The invention respectively calculates the global similarity, the local similarity and the relation similarity between the image and the text data in two different modes in the cross-mode fine-grained accurate alignment stage, and fuses to obtain the image-text comprehensive similarity, designs a corresponding loss function in the neural network training stage, excavates the cross-mode structure constraint information, performs parameter learning from a plurality of angle constraints and a supervision retrieval model, and finally obtains the retrieval result of a test query sample according to the image-text comprehensive similarity, thereby effectively improving the accuracy of cross-mode retrieval by introducing the fine-grained incidence relation between the image and the text data in two different modes, and having wide market demands and application prospects in the fields of image-text retrieval, mode identification and the like.

Description

Cross-modal retrieval method based on multi-level feature representation alignment
Technical Field
The invention relates to the technical field of cross-modal retrieval, in particular to a cross-modal retrieval method based on multi-level feature representation alignment.
Background
With the rapid development of new generation internet technologies such as mobile internet, social network and the like, multi-modal data such as text, images, videos and the like are growing explosively. The cross-modal retrieval technology aims to realize cross retrieval among different modal data by mining and utilizing associated information among different modal data, and the core of the cross retrieval technology is to realize similarity measurement among the cross modal data. In recent years, the cross-modal retrieval technology has become a research hotspot at home and abroad, receives wide attention from academic circles and industrial circles, is one of important research fields of cross-modal intelligence, and is an important direction for future development of the information retrieval field.
The cross-modal retrieval simultaneously relates to data of multiple modalities, and the data have a 'heterogeneous gap', namely the data are related to each other in high-level semantics and have heterogeneity on bottom-level features, so that a retrieval algorithm is required to deeply mine related information between different modality data and realize alignment of one modality data to another modality data.
At present, a subspace learning method is a mainstream method of cross-modal retrieval, and the method can be subdivided into a retrieval model based on traditional statistical correlation analysis and a retrieval model based on deep learning. The cross-modal retrieval method based on the traditional statistical correlation analysis maps different modal data to a subspace through a linear mapping matrix, and the correlation among the different modal data is maximized. The cross-modal retrieval method based on deep learning extracts effective representations of different modal data by utilizing the feature extraction capability of a deep neural network, and simultaneously excavates the complex association characteristics between the cross-modal data by utilizing the complex nonlinear mapping capability of the neural network.
In the process of implementing the present invention, the applicant finds that the following technical problems exist in the prior art:
the cross-modal retrieval method provided by the prior art focuses on representation learning, association analysis and alignment of global features and local features of images and texts, but lacks reasoning of relationships between visual targets and alignment of relationship information, and cannot comprehensively and effectively utilize a structural constraint information supervision model contained in training data for training, so that the cross-modal retrieval method has low accuracy in cross-modal retrieval of images and texts.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a cross-modal retrieval method based on multi-level feature representation alignment, which accurately measures the similarity between an image and a text and effectively provides retrieval accuracy by cross-modal multi-level representation correlation, thereby solving the technical problems that the representation of the existing cross-modal retrieval method is not fine enough and the cross-modal correlation is not sufficient, and simultaneously, the training of a retrieval model is supervised by using cross-modal structure constraint information. The technical scheme of the invention is as follows:
according to an aspect of the embodiments of the present invention, there is provided a cross-modal retrieval method based on multi-level feature representation alignment, the method including:
acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels corresponding to the image data and the text data together;
for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to image data in the data pairs, and text global features, text local features and text relation features corresponding to text data in the data pairs;
for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain image-text comprehensive similarity corresponding to the target data pair according to image global features and text global features corresponding to the target data pair, image local features and text local features corresponding to the target data pair, and image relation features and text relation features corresponding to the target data pair;
and designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.
In a preferred embodiment, the step of, for each group of data pairs in the training data set, respectively extracting an image global feature, an image local feature, and an image relationship feature corresponding to image data in the data pair, and a text global feature, a text local feature, and a text relationship feature corresponding to text data in the data pair includes:
for each group of data pairs in the training data set, extracting the image global characteristics of the image data corresponding to the data pairs by adopting a Convolutional Neural Network (CNN)
Figure 21599DEST_PATH_IMAGE001
Then, a visual target detector is used to detect the visual targets included in the image data and extract the image local features of each visual target
Figure 834704DEST_PATH_IMAGE002
WhereinMfor the number of visual objects comprised by the image data,
Figure 956243DEST_PATH_IMAGE003
as a visual target
Figure 263728DEST_PATH_IMAGE004
Extracting image relation characteristics among all visual targets through an image visual relation coding network
Figure 108318DEST_PATH_IMAGE005
Wherein
Figure 905373DEST_PATH_IMAGE006
as a visual target
Figure 514209DEST_PATH_IMAGE004
And a visual target
Figure 359805DEST_PATH_IMAGE007
Image relationship features between;
for each group of data pairs in the training data set, converting each word in the text data corresponding to the data pair into a word vector using a word embedding model
Figure 370486DEST_PATH_IMAGE008
WhereinNthe word quantity included in the text data is input into a recurrent neural network in sequence, and the global text feature corresponding to the text data is obtained
Figure 525393DEST_PATH_IMAGE009
Then, each word vector is input to a feedforward neural network to obtain the local text characteristics corresponding to each word
Figure 621525DEST_PATH_IMAGE010
Simultaneously, each word vector is input into a text relation coding network to extract text relation characteristics among words
Figure 270812DEST_PATH_IMAGE011
Wherein
Figure 87065DEST_PATH_IMAGE012
is a word
Figure 225922DEST_PATH_IMAGE004
Hehe word
Figure 481454DEST_PATH_IMAGE007
A textual relationship feature between.
In a preferred embodiment, the step of calculating, for a target data pair composed of any image data and any text data in the training data set, an image-text comprehensive similarity corresponding to the target data pair according to an image global feature and a text global feature corresponding to the target data pair, an image local feature and a text local feature corresponding to the target data pair, and an image relationship feature and a text relationship feature corresponding to the target data pair includes:
for a target data pair consisting of any image data and any text data in the training data set, based on image global features corresponding to the image data in the target data pair
Figure 996749DEST_PATH_IMAGE013
Global features of text corresponding to text data
Figure 168974DEST_PATH_IMAGE009
The cosine distance of the target data is calculated to obtain the image-text global similarity corresponding to the target data
Figure 478732DEST_PATH_IMAGE014
(ii) a Wherein image-text global similarity
Figure 221560DEST_PATH_IMAGE015
Is as in formula (1):
Figure 274967DEST_PATH_IMAGE016
) Formula (1)
Calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out local image feature on each visual target
Figure 114747DEST_PATH_IMAGE017
After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping
Figure 283822DEST_PATH_IMAGE018
Then, a visual guidance attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text of each word is locally processedFeature(s)
Figure 779526DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 371044DEST_PATH_IMAGE020
From the respective image partial representation
Figure 517861DEST_PATH_IMAGE018
And respective text partial representations
Figure 169422DEST_PATH_IMAGE020
Calculating cosine similarity of all visual targets and words, and calculating the local similarity of the target data to the corresponding image-text according to the mean value of the cosine similarity
Figure 886842DEST_PATH_IMAGE021
(ii) a Wherein the image-text local similarity
Figure 282051DEST_PATH_IMAGE021
The formula (2) is shown in the specification, wherein M is the number of visual objects, and N is the number of words:
Figure 111988DEST_PATH_IMAGE022
formula (2)
Calculating to obtain image-text relation similarity corresponding to the target data pair according to the cosine similarity mean value of each image relation feature and each text relation feature in the target data pair
Figure 934450DEST_PATH_IMAGE023
(ii) a Wherein, the similarity of image-text relationship
Figure 201483DEST_PATH_IMAGE023
Is as in formula (3),Pnumber of relationships representing image data and text data:
Figure 338067DEST_PATH_IMAGE024
formula (3)
According to the global similarity of the target data to the corresponding image-text
Figure 6945DEST_PATH_IMAGE014
Image-text local similarity
Figure 187260DEST_PATH_IMAGE021
Similarity of image-text relationship
Figure 941589DEST_PATH_IMAGE023
Calculating to obtain the image-text comprehensive similarity corresponding to the target data pair
Figure 881863DEST_PATH_IMAGE025
(ii) a Wherein, the image-text comprehensive similarity
Figure 405249DEST_PATH_IMAGE025
The calculation formula of (2) is as formula (4):
Figure 257929DEST_PATH_IMAGE026
equation (4).
In a preferred embodiment, the calculation formula of the inter-modal structure constraint loss function is as in formula (5), wherein,Bis the number of samples to be tested,
Figure 499555DEST_PATH_IMAGE027
in order to be a hyper-parameter of the model,
Figure 977941DEST_PATH_IMAGE028
for the pair of target data that is matched,
Figure 355832DEST_PATH_IMAGE029
and
Figure 143529DEST_PATH_IMAGE030
for non-matching target data pairs:
Figure 606871DEST_PATH_IMAGE031
formula (5)
The calculation formula of the intra-modal structure constraint loss function is shown as formula (6), wherein,
Figure 888948DEST_PATH_IMAGE032
for image triplets, in contrast to
Figure 121346DEST_PATH_IMAGE033
Figure 892993DEST_PATH_IMAGE034
And
Figure 529117DEST_PATH_IMAGE035
there are more common semantic labels that are present,
Figure 411623DEST_PATH_IMAGE036
for text triplets, in contrast to
Figure 701790DEST_PATH_IMAGE037
Figure 644338DEST_PATH_IMAGE038
And
Figure 82272DEST_PATH_IMAGE039
with more common semantic labels:
Figure 955419DEST_PATH_IMAGE040
equation (6).
In a preferred embodiment, the step of training the neural network model by using the inter-modal structural constraint loss function and the intra-modal structural constraint loss function includes:
randomly sampling from the training data set to obtain a matched target data pair, a non-matched target data pair, an image triple and a text triple, respectively calculating an inter-modal structure constraint loss function value according to the inter-modal structure constraint loss function, calculating an intra-modal structure constraint loss function value according to the intra-modal structure constraint loss function, fusing according to a formula (7), and optimizing network parameters by using a back propagation algorithm:
Figure 162410DEST_PATH_IMAGE041
formula (7)
Wherein
Figure 275859DEST_PATH_IMAGE042
Is a hyper-parameter.
In a preferred embodiment, the image relationship features between the visual targets are extracted through an image visual relationship coding network
Figure 138773DEST_PATH_IMAGE005
The method comprises the following steps:
obtaining visual objects in an image via an image visual object detector
Figure 363081DEST_PATH_IMAGE004
And a visual target
Figure 378573DEST_PATH_IMAGE007
Is characterized by
Figure 397344DEST_PATH_IMAGE003
Figure 809871DEST_PATH_IMAGE043
And the characteristics of the two target union regions
Figure 775553DEST_PATH_IMAGE044
The above features are fused by formula (8), and each relation characteristic is obtained by calculationAnd (3) carrying out mark:
Figure 957136DEST_PATH_IMAGE045
formula (8)
Wherein]In order to perform the vector splicing operation,
Figure 412388DEST_PATH_IMAGE046
in order to function the activation of the neuron,
Figure 233582DEST_PATH_IMAGE047
are model parameters.
In a preferred embodiment, the inputting of the word vectors into the text relation coding network extracts the text relation features between the words
Figure 799693DEST_PATH_IMAGE011
The method comprises the following steps:
in a text-relational coding network, words are calculated using equation (9)
Figure 39044DEST_PATH_IMAGE004
Hehe word
Figure 665197DEST_PATH_IMAGE007
Characteristic of textual relationship between
Figure 786737DEST_PATH_IMAGE048
Figure 847884DEST_PATH_IMAGE049
Formula (9)
Wherein,
Figure 4059DEST_PATH_IMAGE046
represents the function of the activation of the neuron,
Figure 801113DEST_PATH_IMAGE050
are model parameters.
In a preferred embodiment, the calculating of the weight of each visual target included in the image data in the target data pair by using a text-guided attention mechanism includes the step of comparing the image local features of the visual targets
Figure 347632DEST_PATH_IMAGE017
After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping
Figure 255546DEST_PATH_IMAGE018
The method comprises the following steps:
using the text-guided attention mechanism, the weight of each visual object in the image is calculated by equation (10):
Figure 453178DEST_PATH_IMAGE051
formula (10)
Wherein,
Figure 421134DEST_PATH_IMAGE052
Figure 517266DEST_PATH_IMAGE053
is a model parameter;
each visual target is weighted by formula (11) and a new image local representation is obtained through feed-forward neural network mapping
Figure 166553DEST_PATH_IMAGE018
Figure 31741DEST_PATH_IMAGE054
Formula (11)
Wherein,
Figure 859013DEST_PATH_IMAGE055
are model parameters.
In a preferred embodiment, the calculating using a visual guidance attention mechanismThe weight of each word included in the text data in the target data pair is obtained by using the local text characteristics of each word
Figure 380125DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 895420DEST_PATH_IMAGE020
The method comprises the following steps:
using the visual guidance attention mechanism, the weight of each word in the text is calculated by equation (12):
Figure 615114DEST_PATH_IMAGE056
formula (12)
Wherein,
Figure 111823DEST_PATH_IMAGE057
Figure 182547DEST_PATH_IMAGE058
is a model parameter;
local text features for individual words by means of formula (13)
Figure 173637DEST_PATH_IMAGE019
Weighting corresponding weight, and obtaining new text local representation through feedforward neural network mapping
Figure 13417DEST_PATH_IMAGE020
Figure 179563DEST_PATH_IMAGE059
Formula (13)
Wherein,
Figure 737583DEST_PATH_IMAGE060
are model parameters.
In a preferred embodiment, the training data set is obtained by Wikipedia, MS COCO, Pascal Voc.
Compared with the prior art, the cross-modal retrieval method based on multi-level feature representation alignment provided by the invention has the following advantages:
the invention provides a cross-modal retrieval method based on multi-level feature representation alignment, which is characterized in that the global similarity, the local similarity and the relation similarity between two different modal data of an image and a text are respectively calculated and fused to obtain the comprehensive similarity of the image and the text in a cross-modal fine-grained accurate alignment stage, a corresponding loss function is designed in a network training stage, cross-modal structure constraint information is mined, parameter learning from a plurality of angle constraints and a supervision retrieval model is performed, and finally a retrieval result of a test query sample is obtained according to the comprehensive similarity of the image and the text, so that the accuracy of cross-modal retrieval is effectively improved by introducing a fine-grained incidence relation between the two different modal data of the image and the text, and the cross-modal retrieval method has wide market requirements and application prospects in the fields of image-text retrieval, mode identification and the like.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of an implementation environment provided by one embodiment of the invention.
FIG. 2 is a flowchart illustrating a method for cross-modal retrieval based on multi-level feature representation alignment, according to an example embodiment.
Fig. 3 is a schematic diagram illustrating constraint loss of an inter-modal structure according to an embodiment of the present invention.
Fig. 4 is a diagram illustrating intra-modal structural constraint loss according to an embodiment of the present invention.
Fig. 5 is a diagram illustrating a result of performing a text search on an image according to an embodiment of the present invention.
FIG. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment.
FIG. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in detail below with reference to specific embodiments (but not limited to) and the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, rather than all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention can be suitable for various scenes, and the related implementation environment can comprise an input and output scene of a single server or an interaction scene of a terminal and the server. When the implementation environment is an input/output scene of a single server, the main bodies of acquiring and storing the image data and the text data are both servers; when the implementation environment is an interaction scenario between a terminal and a server, a schematic diagram of the implementation environment according to the embodiment may be as shown in fig. 1. In the schematic diagram of the implementation environment shown in fig. 1, the implementation environment includes a terminal 101 and a server 102.
The terminal 101 is an electronic device running at least one client, and the client is a client of an Application program, which is also called APP (Application program). The terminal 101 may be a smartphone, a tablet computer, or the like.
The terminal 101 and the server 102 are connected via a wireless or wired network. The terminal 101 is used for transmitting data to the server 102, or the terminal is used for receiving data transmitted by the server 102. In one possible implementation, the terminal 101 may transmit at least one of image data or text data to the server 102.
The server 102 is used for receiving data transmitted by the terminal 101, or the server 102 is used for transmitting data to the terminal 101. The server 102 may analyze and process data transmitted by the terminal 101, so as to match image data and text data with the highest similarity from the database and transmit the image data and the text data to the terminal 101.
Fig. 2 is a flowchart illustrating a method for cross-modal retrieval method based on multi-level feature representation alignment according to an exemplary embodiment, and as shown in fig. 2, the method for cross-modal retrieval method based on multi-level feature representation alignment includes:
step 100: acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels which correspond to the image data and the text data together.
It should be noted that the text data may be text contents corresponding to any language, such as english, chinese, japanese, german, etc.; the image data may be image content corresponding to any color, such as a color image, a grayscale image, and the like.
Step 200: and for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to the image data in the data pairs, and text global features, text local features and text relation features corresponding to the text data in the data pairs.
In a preferred embodiment, step 200 specifically includes:
step 210: for each group of data pairs in the training data set, extracting the image global characteristics of the image data corresponding to the data pairs by adopting a Convolutional Neural Network (CNN)
Figure 266785DEST_PATH_IMAGE001
Then, a visual target detector is used to detect the visual targets included in the image data and extract the image local features of each visual target
Figure 961071DEST_PATH_IMAGE002
WhereinMfor the number of visual objects comprised by the image data,
Figure 799583DEST_PATH_IMAGE003
as a visual target
Figure 579320DEST_PATH_IMAGE004
Extracting image relation characteristics among all visual targets through an image visual relation coding network
Figure 177792DEST_PATH_IMAGE005
Wherein
Figure 992164DEST_PATH_IMAGE006
as a visual target
Figure 814627DEST_PATH_IMAGE004
And a visual target
Figure 770075DEST_PATH_IMAGE007
Image relationship features between.
Step 220: for each group of data pairs in the training data set, converting each word in the text data corresponding to the data pair into a word vector using a word embedding model
Figure 968976DEST_PATH_IMAGE008
WhereinNthe word quantity included in the text data is input into a recurrent neural network in sequence, and the global text feature corresponding to the text data is obtained
Figure 575537DEST_PATH_IMAGE009
Then, each word vector is input to a feedforward neural network to obtain the local text characteristics corresponding to each word
Figure 568901DEST_PATH_IMAGE010
Simultaneously, each word vector is input into a text relation coding network to extract text relation characteristics among words
Figure 510181DEST_PATH_IMAGE011
Wherein
Figure 247193DEST_PATH_IMAGE012
is a word
Figure 770578DEST_PATH_IMAGE004
Hehe word
Figure 872527DEST_PATH_IMAGE007
A textual relationship feature between.
By implementing the step 200, cross-modal multi-level refined representation can be realized.
Step 300: and for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain the image-text comprehensive similarity corresponding to the target data pair according to the image global feature and the text global feature corresponding to the target data pair, the image local feature and the text local feature corresponding to the target data pair, and the image relation feature and the text relation feature corresponding to the target data pair.
In a preferred embodiment, step 300 specifically includes:
step 310: for a target data pair consisting of any image data and any text data in the training data set, based on image global features corresponding to the image data in the target data pair
Figure 114152DEST_PATH_IMAGE013
Global features of text corresponding to text data
Figure 334481DEST_PATH_IMAGE009
The cosine distance of the target data is calculated to obtain the image-text global similarity corresponding to the target data
Figure 977952DEST_PATH_IMAGE014
Wherein image-text global similarity
Figure 313119DEST_PATH_IMAGE015
Is as in formula (1):
Figure 979723DEST_PATH_IMAGE016
) Formula (1)
Step 320: calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out local image feature on each visual target
Figure 58538DEST_PATH_IMAGE017
After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping
Figure 477887DEST_PATH_IMAGE018
Then, a visual guidance attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text local characteristics of each word are calculated
Figure 249534DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 200172DEST_PATH_IMAGE020
From the respective image partial representation
Figure 285940DEST_PATH_IMAGE018
And respective text partial representations
Figure 638424DEST_PATH_IMAGE020
Calculating cosine similarity of all visual targets and words, and calculating the local similarity of the target data to the corresponding image-text according to the mean value of the cosine similarity
Figure 3808DEST_PATH_IMAGE021
Wherein the image-text local similarity
Figure 441743DEST_PATH_IMAGE021
Is as in formula (2),Mfor the visionThe number of the target is,Nnumber of words:
Figure 65622DEST_PATH_IMAGE022
formula (2)
Step 330: calculating to obtain image-text relation similarity corresponding to the target data pair according to the cosine similarity mean value of each image relation feature and each text relation feature in the target data pair
Figure 538192DEST_PATH_IMAGE023
. Wherein, the similarity of image-text relationship
Figure 651641DEST_PATH_IMAGE023
Is as in formula (3),Pnumber of relationships representing image data and text data:
Figure 763823DEST_PATH_IMAGE024
formula (3)
Step 340: according to the global similarity of the target data to the corresponding image-text
Figure 988131DEST_PATH_IMAGE014
Image-text local similarity
Figure 721731DEST_PATH_IMAGE021
And calculating the image-text relation similarity to obtain the image-text comprehensive similarity corresponding to the target data pair
Figure 6082DEST_PATH_IMAGE025
Wherein, the image-text comprehensive similarity
Figure 369674DEST_PATH_IMAGE025
The calculation formula of (2) is as formula (4):
Figure 397673DEST_PATH_IMAGE026
formula (4)
Fine-grained cross-modal alignment can be achieved through the implementation of step 300 described above.
Step 400: designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a neural network model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.
In a preferred embodiment, the calculation formula of the inter-modal structure constraint loss function is as in formula (5), wherein,Bis the number of samples to be tested,
Figure 579256DEST_PATH_IMAGE027
in order to be a hyper-parameter of the model,
Figure 441032DEST_PATH_IMAGE028
for the pair of target data that is matched,
Figure 340855DEST_PATH_IMAGE029
and
Figure 93916DEST_PATH_IMAGE030
for non-matching target data pairs:
Figure 395585DEST_PATH_IMAGE031
formula (5)
The calculation formula of the intra-modal structure constraint loss function is shown as formula (6), wherein,
Figure 756159DEST_PATH_IMAGE032
for image triplets, in contrast to
Figure 346540DEST_PATH_IMAGE033
Figure 716342DEST_PATH_IMAGE034
And
Figure 29774DEST_PATH_IMAGE035
there are more common semantic labels that are present,
Figure 826828DEST_PATH_IMAGE036
for text triplets, in contrast to
Figure 638927DEST_PATH_IMAGE037
Figure 812419DEST_PATH_IMAGE038
And
Figure 823100DEST_PATH_IMAGE039
with more common semantic labels:
Figure 712428DEST_PATH_IMAGE040
formula (6)
Fig. 3 is a schematic diagram illustrating constraint loss of an inter-modal structure according to an embodiment of the present invention.
In a preferred embodiment, the step of training the neural network model by using the inter-modal structural constraint loss function and the intra-modal structural constraint loss function includes:
randomly sampling from the training data set to obtain a matched target data pair, a non-matched target data pair, an image triple and a text triple, respectively calculating an inter-modal structure constraint loss function value according to the inter-modal structure constraint loss function, calculating an intra-modal structure constraint loss function value according to the intra-modal structure constraint loss function, fusing according to a formula (7), and optimizing network parameters by using a back propagation algorithm:
Figure 808560DEST_PATH_IMAGE041
formula (7)
Wherein
Figure 457847DEST_PATH_IMAGE042
Is a hyper-parameter.
Fig. 4 is a schematic diagram illustrating a loss of structural constraint in a mode according to an embodiment of the present invention.
Through the implementation of the step 400, the training of the information supervision retrieval model by using the cross-modal structure constraint can be realized, so that the network training is performed towards the direction of raising the similarity between the matched target data pairs and reducing the similarity between the unmatched target data pairs, and meanwhile, the trained network can learn images and text representations with more discriminative power.
In a preferred embodiment, the image relationship features between the visual targets are extracted through an image visual relationship coding network
Figure 323035DEST_PATH_IMAGE005
The method comprises the following steps:
obtaining visual objects in an image via an image visual object detector
Figure 418816DEST_PATH_IMAGE004
And a visual target
Figure 2245DEST_PATH_IMAGE007
Is characterized by
Figure 251960DEST_PATH_IMAGE003
Figure 909338DEST_PATH_IMAGE043
And the characteristics of the two target union regions
Figure 219096DEST_PATH_IMAGE044
And fusing the characteristics by adopting a formula (8), and calculating to obtain the relationship characteristics:
Figure 476771DEST_PATH_IMAGE045
formula (8)
Wherein]In order to perform the vector splicing operation,
Figure 530178DEST_PATH_IMAGE046
in order to function the activation of the neuron,
Figure 369958DEST_PATH_IMAGE047
are model parameters.
In a preferred embodiment, the inputting of the word vectors into the text relation coding network extracts the text relation features between the words
Figure 788301DEST_PATH_IMAGE011
The method comprises the following steps:
in a text-relational coding network, words are calculated using equation (9)
Figure 80742DEST_PATH_IMAGE004
Hehe word
Figure 626255DEST_PATH_IMAGE007
Characteristic of textual relationship between
Figure 320542DEST_PATH_IMAGE048
Figure 175365DEST_PATH_IMAGE049
Formula (9)
Wherein,
Figure 220681DEST_PATH_IMAGE046
represents the function of the activation of the neuron,
Figure 615891DEST_PATH_IMAGE050
are model parameters.
In a preferred embodiment, the calculating of the weight of each visual target included in the image data in the target data pair by using a text-guided attention mechanism includes determining the local features of the image of each visual targetSign for
Figure 86055DEST_PATH_IMAGE017
After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping
Figure 174097DEST_PATH_IMAGE018
The method comprises the following steps:
using the text-guided attention mechanism, the weight of each visual object in the image is calculated by equation (10):
Figure 378813DEST_PATH_IMAGE051
formula (10)
Wherein,
Figure 577713DEST_PATH_IMAGE052
Figure 666499DEST_PATH_IMAGE053
is a model parameter;
each visual target is weighted by formula (11) and a new image local representation is obtained through feed-forward neural network mapping
Figure 925442DEST_PATH_IMAGE018
Figure 679771DEST_PATH_IMAGE054
Formula (11)
Wherein,
Figure 354466DEST_PATH_IMAGE055
are model parameters.
In a preferred embodiment, the calculating the weight of each word included in the text data in the target data pair by using a visual guidance attention mechanism includes local text features of each word
Figure 877851DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 494646DEST_PATH_IMAGE020
The method comprises the following steps:
using the visual guidance attention mechanism, the weight of each word in the text is calculated by equation (12):
Figure 470693DEST_PATH_IMAGE056
formula (12)
Wherein,
Figure 949079DEST_PATH_IMAGE057
Figure 592550DEST_PATH_IMAGE058
is a model parameter;
local text features for individual words by means of formula (13)
Figure 616131DEST_PATH_IMAGE019
Weighting corresponding weight, and obtaining new text local representation through feedforward neural network mapping
Figure 345053DEST_PATH_IMAGE020
Figure 423867DEST_PATH_IMAGE059
Formula (13)
Wherein,
Figure 859528DEST_PATH_IMAGE060
are model parameters.
In a preferred embodiment, the training data set is obtained by Wikipedia, MS COCO, Pascal Voc.
It should be noted that, after the training of the neural network model is implemented by adopting the above steps 100-400, the similarity between the data of different modes and the data of different modes can be accurately output through the calculation of the neural network model. And (3) using any one mode type in the test data set as a query mode, using the other mode type as a target mode, using each data of the query mode as a query sample, retrieving the data in the target mode, and calculating the similarity between the query sample and the query target according to an image-text comprehensive similarity calculation formula shown in a formula (4). In a possible implementation manner, the neural network model may output the target modal data with the highest similarity as matching data, or the neural network model sorts the similarities of the neural network models from large to small to obtain a related result list of a preset number of target modal data, thereby implementing cross-modal retrieval operation among different modal data.
This example was conducted using the MS COCO cross-modal dataset, which was first proposed in the literature (T. Lin, et al. Microsoft COCO: Common objects in context, ECCV 2014, pp.740-755.) and has become one of the most Common experimental datasets in the cross-modal search field. Each picture in the data set is provided with 5 text labels, wherein 82783 pictures and text labels thereof are used as training sample sets, and 5000 pictures and text labels thereof are randomly selected from the rest samples to be used as test sample sets. In order to better illustrate the beneficial effects of the cross-modal retrieval method based on multi-level feature representation alignment provided by the embodiment of the present invention, the cross-modal retrieval method based on multi-level feature representation alignment provided by the present invention is compared with the following 3 existing cross-modal retrieval methods through experimental tests:
the prior method comprises the following steps: the Order-embedding method described in the literature (I. Vendorov, R. Kiros, S. Fidler, and R. Urtastun, Order-embedding of images and language, ICLR, 2016.).
The prior method II comprises the following steps: a VSE + + method described in the literature (F. Faghri, D. fly, R. tools, and S. Fidler, VSE + +: Improved visual inspection techniques with hard defects, BMVC, 2018.).
The existing method is three: the c-ANet method is described in the documents (J, Yu, W, Zhang, Y, Lu, Z, Qin, et al, reading on the relationship: Enhancing visual representation for visual request and cross-mode representation, IEEE Transactions on Multimedia, 22(12): 3196-.
The accuracy of cross-modal retrieval is evaluated by adopting an R @ n index commonly used in the field of cross-modal retrieval in the experiment, the index represents the percentage of correct samples in n samples returned by retrieval, the higher the index is, the better the retrieval result is, and n in the experiment is respectively 1, 5 and 10.
Figure 365596DEST_PATH_IMAGE061
Watch 1
Compared with the conventional cross-modal retrieval method, the cross-modal retrieval method based on multi-level feature representation alignment provided by the invention has the advantages that the retrieval accuracy on two tasks of retrieving the text data from the image data and retrieving the image data from the text data is obviously improved, so that the effectiveness of refined alignment of the global-local-relation multi-level feature representation of the image text provided by the invention is fully proved. For easy understanding, a schematic diagram of results of text retrieval images by using the embodiment of the present invention is also shown, as shown in fig. 5, where the first column is a text for retrieval, the second column is a matching image given by a data set, and the third to seventh columns are corresponding retrieval results of the top five degrees of similarity.
The following experimental results show that compared with the existing method, the cross-modal retrieval method based on multi-level feature representation alignment can achieve higher retrieval accuracy.
In summary, the present invention provides a cross-modal search method based on multi-level feature representation alignment, the global similarity, the local similarity and the relation similarity between two different modal data of an image and a text are respectively calculated in a cross-modal fine-grained accurate alignment stage and are fused to obtain the image-text comprehensive similarity, in the network training stage, designing corresponding loss functions, excavating cross-modal structure constraint information, learning from parameters of a plurality of angle constraints and a supervision retrieval model, finally obtaining retrieval results of the test query sample according to the image-text comprehensive similarity, therefore, by introducing a fine-grained incidence relation between the image data and the text data in two different modes, the accuracy of cross-mode retrieval is effectively improved, and the method has wide market demands and application prospects in the fields of image-text retrieval, mode identification and the like.
FIG. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment. For example, the apparatus 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 6, apparatus 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, input/output (I/O) interface 612, sensor component 614, and communication component 616.
The processing component 602 generally controls overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.
The memory 604 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power supply component 606 provides power to the various components of device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 600.
The multimedia component 608 includes a screen that provides an output interface between the device 600 and the target user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a target user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 610 is configured to output and/or input audio signals. For example, audio component 610 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.
The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the apparatus 600. For example, sensor component 614 may detect an open/closed state of device 600, the relative positioning of components, such as a display and keypad of device 600, the change in position of device 600 or a component of device 600, the presence or absence of contact by a target user with device 600, the orientation or acceleration/deceleration of device 600, and the change in temperature of device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 616 is configured to facilitate communications between the apparatus 600 and other devices in a wired or wireless manner. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of apparatus 600, enable apparatus 600 to perform a cross-modal retrieval method based on multi-level feature representation alignment, the method comprising:
acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels corresponding to the image data and the text data together;
for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to image data in the data pairs, and text global features, text local features and text relation features corresponding to text data in the data pairs;
for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain image-text comprehensive similarity corresponding to the target data pair according to image global features and text global features corresponding to the target data pair, image local features and text local features corresponding to the target data pair, and image relation features and text relation features corresponding to the target data pair;
designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a neural network model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.
FIG. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on multi-level feature representation alignment, according to an example embodiment. For example, the apparatus 700 may be provided as a server. Referring to fig. 7, apparatus 700 includes a processing component 722 that further includes one or more processors and memory resources, represented by memory 732, for storing instructions, such as applications, that are executable by processing component 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, the processing component 722 is configured to execute instructions to perform the above-described launch page generation method.
The apparatus 700 may also include a power component 726 configured to perform power management of the apparatus 700, a wired or wireless network interface 750 configured to connect the apparatus 700 to a network, and an input output (I/O) interface 758. The apparatus 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
While the invention has been described in detail in the foregoing by way of general description, and specific embodiments and experiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof.

Claims (10)

1. A cross-modal retrieval method based on multi-level feature representation alignment is characterized by comprising the following steps:
acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels corresponding to the image data and the text data together;
for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to image data in the data pairs, and text global features, text local features and text relation features corresponding to text data in the data pairs;
for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain image-text comprehensive similarity corresponding to the target data pair according to image global features and text global features corresponding to the target data pair, image local features and text local features corresponding to the target data pair, and image relation features and text relation features corresponding to the target data pair;
designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a neural network model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.
2. The method according to claim 1, wherein the step of extracting, for each group of data pairs in the training data set, an image global feature, an image local feature and an image relation feature corresponding to image data in the data pair, and a text global feature, a text local feature and a text relation feature corresponding to text data in the data pair, respectively, comprises:
for each group of data pairs in the training data set, extracting the image global characteristics of the image data corresponding to the data pairs by adopting a Convolutional Neural Network (CNN)
Figure 453193DEST_PATH_IMAGE001
Then, a visual target detector is used to detect the visual targets included in the image data and extract the image local features of each visual target
Figure 504325DEST_PATH_IMAGE002
WhereinMfor the number of visual objects comprised by the image data,
Figure 429556DEST_PATH_IMAGE003
as a visual target
Figure 653864DEST_PATH_IMAGE004
Extracting image relation characteristics among all visual targets through an image visual relation coding network
Figure 918623DEST_PATH_IMAGE005
Wherein
Figure 202974DEST_PATH_IMAGE006
as a visual target
Figure 287605DEST_PATH_IMAGE004
And a visual target
Figure 315603DEST_PATH_IMAGE007
Image relationship features between;
for each group of data pairs in the training data set, converting each word in the text data corresponding to the data pair into a word vector using a word embedding model
Figure 933404DEST_PATH_IMAGE008
WhereinNthe word quantity included in the text data is input into a recurrent neural network in sequence, and the global text feature corresponding to the text data is obtained
Figure 388656DEST_PATH_IMAGE009
Then, each word vector is input to a feedforward neural network to obtain the local text characteristics corresponding to each word
Figure 226162DEST_PATH_IMAGE010
Simultaneously, each word vector is input into a text relation coding network to extract text relation characteristics among words
Figure 792273DEST_PATH_IMAGE011
Wherein
Figure 766045DEST_PATH_IMAGE012
is a word
Figure 392199DEST_PATH_IMAGE004
Hehe word
Figure 513738DEST_PATH_IMAGE007
A textual relationship feature between.
3. The method according to claim 2, wherein the step of calculating, for a target data pair composed of any image data and any text data in the training data set, an image-text comprehensive similarity corresponding to the target data pair according to an image global feature and a text global feature corresponding to the target data pair, an image local feature and a text local feature corresponding to the target data pair, and an image relationship feature and a text relationship feature corresponding to the target data pair includes:
for a target data pair consisting of any image data and any text data in the training data set, based on image global features corresponding to the image data in the target data pair
Figure 821223DEST_PATH_IMAGE013
Global features of text corresponding to text data
Figure 977398DEST_PATH_IMAGE009
The cosine distance of the target data is calculated to obtain the image-text global similarity corresponding to the target data
Figure 774453DEST_PATH_IMAGE014
(ii) a Wherein image-text global similarity
Figure 837085DEST_PATH_IMAGE015
Is as in formula (1):
Figure 744998DEST_PATH_IMAGE016
) Formula (1)
Calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out local image feature on each visual target
Figure 693362DEST_PATH_IMAGE017
After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping
Figure 661318DEST_PATH_IMAGE018
Then, a visual guidance attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text local characteristics of each word are calculated
Figure 695134DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 406738DEST_PATH_IMAGE020
From the respective image partial representation
Figure 475188DEST_PATH_IMAGE018
Calculating cosine similarity of all visual targets and words according to local representation of each text, and calculating the local similarity of the target data to the corresponding image-text according to the mean value of the cosine similarity
Figure 614045DEST_PATH_IMAGE021
(ii) a Wherein the image-text local similarity
Figure 931894DEST_PATH_IMAGE021
The formula (2) is shown in the specification, wherein M is the number of visual objects, and N is the number of words:
Figure 883407DEST_PATH_IMAGE022
formula (2)
Calculating to obtain image-text relation similarity corresponding to the target data pair according to the cosine similarity mean value of each image relation feature and each text relation feature in the target data pair
Figure 806364DEST_PATH_IMAGE023
(ii) a Wherein, the similarity of image-text relationship
Figure 53805DEST_PATH_IMAGE023
Is as in formula (3),Pnumber of relationships representing image data and text data:
Figure 858950DEST_PATH_IMAGE024
formula (3)
According to the global similarity of the target data to the corresponding image-text
Figure 912357DEST_PATH_IMAGE014
Image-text local similarity
Figure 689820DEST_PATH_IMAGE021
Similarity of image-text relationship
Figure 170480DEST_PATH_IMAGE023
Is calculated toTo the corresponding image-text comprehensive similarity of the target data pair
Figure 728500DEST_PATH_IMAGE025
(ii) a Wherein, the image-text comprehensive similarity
Figure 756237DEST_PATH_IMAGE025
The calculation formula of (2) is as formula (4):
Figure 716103DEST_PATH_IMAGE026
equation (4).
4. The method according to claim 3, wherein the inter-modal structural constraint loss function is calculated as in equation (5),Bis the number of samples to be tested,
Figure 305347DEST_PATH_IMAGE027
in order to be a hyper-parameter of the model,
Figure 85084DEST_PATH_IMAGE028
for the pair of target data that is matched,
Figure 417977DEST_PATH_IMAGE029
and
Figure 232349DEST_PATH_IMAGE030
for non-matching target data pairs:
Figure 992494DEST_PATH_IMAGE031
formula (5)
The calculation formula of the intra-modal structure constraint loss function is shown as formula (6), wherein,
Figure 259528DEST_PATH_IMAGE032
for image triplets, in contrast to
Figure 894646DEST_PATH_IMAGE033
Figure 563525DEST_PATH_IMAGE034
And
Figure 494572DEST_PATH_IMAGE035
there are more common semantic labels that are present,
Figure 248901DEST_PATH_IMAGE036
for text triplets, in contrast to
Figure 251492DEST_PATH_IMAGE037
Figure 712561DEST_PATH_IMAGE038
And
Figure 876826DEST_PATH_IMAGE039
with more common semantic labels:
Figure 56134DEST_PATH_IMAGE040
equation (6).
5. The method of claim 4, wherein the step of training a neural network model using the inter-modal and intra-modal structural constraint loss functions comprises:
randomly sampling from the training data set to obtain a matched target data pair, a non-matched target data pair, an image triple and a text triple, respectively calculating an inter-modal structure constraint loss function value according to the inter-modal structure constraint loss function, calculating an intra-modal structure constraint loss function value according to the intra-modal structure constraint loss function, fusing according to a formula (7), and optimizing network parameters by using a back propagation algorithm:
Figure 596837DEST_PATH_IMAGE041
formula (7)
Wherein
Figure 416806DEST_PATH_IMAGE042
Is a hyper-parameter.
6. The method according to claim 2, wherein the extracting of the image relation features between the visual targets through the image visual relation coding network
Figure 17552DEST_PATH_IMAGE043
The method comprises the following steps:
obtaining visual objects in an image via an image visual object detector
Figure 418577DEST_PATH_IMAGE004
And a visual target
Figure 762971DEST_PATH_IMAGE007
Is characterized by
Figure 933052DEST_PATH_IMAGE044
Figure 704699DEST_PATH_IMAGE045
And the characteristics of the two target union regions
Figure 593021DEST_PATH_IMAGE046
And fusing the characteristics by adopting a formula (8), and calculating to obtain the relationship characteristics:
Figure 475526DEST_PATH_IMAGE047
formula (8)
Wherein]In order to perform the vector splicing operation,
Figure 264229DEST_PATH_IMAGE048
in order to function the activation of the neuron,
Figure 206777DEST_PATH_IMAGE049
are model parameters.
7. The method of claim 2, wherein inputting the word vectors into the text-relation coding network extracts text-relation features between words
Figure 582394DEST_PATH_IMAGE050
The method comprises the following steps:
in a text-relational coding network, words are calculated using equation (9)
Figure 268591DEST_PATH_IMAGE004
Hehe word
Figure 475581DEST_PATH_IMAGE007
Characteristic of textual relationship between
Figure 526714DEST_PATH_IMAGE051
Figure 451944DEST_PATH_IMAGE052
Formula (9)
Wherein,
Figure 613935DEST_PATH_IMAGE048
representing the neuron activation function as a model parameter.
8. The method of claim 3, wherein the step of removing the substrate comprises removing the substrate from the substrateCalculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and calculating the image local characteristics of each visual target
Figure 941012DEST_PATH_IMAGE053
After weighting the corresponding weight, obtaining a new image local representation through feedforward neural network mapping, comprising the following steps:
using the text-guided attention mechanism, the weight of each visual object in the image is calculated by equation (10):
Figure 959783DEST_PATH_IMAGE054
formula (10)
Wherein,
Figure 808528DEST_PATH_IMAGE055
Figure 836527DEST_PATH_IMAGE056
is a model parameter;
each visual target is weighted by formula (11) and a new image local representation is obtained through feed-forward neural network mapping
Figure 18110DEST_PATH_IMAGE057
Figure 411045DEST_PATH_IMAGE058
Formula (11)
Wherein,
Figure 45289DEST_PATH_IMAGE059
are model parameters.
9. The method of claim 3, wherein said directing attention visually is performedThe force mechanism calculates the weight of each word included in the text data in the target data pair and the local text characteristics of each word
Figure 549082DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 850751DEST_PATH_IMAGE020
The method comprises the following steps:
using the visual guidance attention mechanism, the weight of each word in the text is calculated by equation (12):
Figure 476904DEST_PATH_IMAGE060
formula (12)
Wherein,
Figure 536127DEST_PATH_IMAGE061
Figure 905928DEST_PATH_IMAGE062
is a model parameter;
local text features for individual words by means of formula (13)
Figure 62103DEST_PATH_IMAGE019
Weighting corresponding weight, and obtaining new text local representation through feedforward neural network mapping
Figure 295376DEST_PATH_IMAGE020
Figure 904212DEST_PATH_IMAGE063
Formula (13)
Wherein,
Figure 749808DEST_PATH_IMAGE064
are model parameters.
10. The method of claim 1, wherein the training data set is obtained via Wikipedia, MS COCO, Pascal Voc.
CN202111149240.4A 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment Active CN113792207B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111149240.4A CN113792207B (en) 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111149240.4A CN113792207B (en) 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment

Publications (2)

Publication Number Publication Date
CN113792207A true CN113792207A (en) 2021-12-14
CN113792207B CN113792207B (en) 2023-11-17

Family

ID=78877521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111149240.4A Active CN113792207B (en) 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment

Country Status (1)

Country Link
CN (1) CN113792207B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239730A (en) * 2021-12-20 2022-03-25 华侨大学 Cross-modal retrieval method based on neighbor sorting relation
CN114880441A (en) * 2022-07-06 2022-08-09 北京百度网讯科技有限公司 Visual content generation method, device, system, equipment and medium
CN115129917A (en) * 2022-06-06 2022-09-30 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common features
CN115712740A (en) * 2023-01-10 2023-02-24 苏州大学 Method and system for multi-modal implication enhanced image text retrieval
US20230162490A1 (en) * 2021-11-19 2023-05-25 Salesforce.Com, Inc. Systems and methods for vision-language distribution alignment
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116402063A (en) * 2023-06-09 2023-07-07 华南师范大学 Multi-modal irony recognition method, apparatus, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110122A (en) * 2018-06-22 2019-08-09 北京交通大学 Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval
CN110490946A (en) * 2019-07-15 2019-11-22 同济大学 Text generation image method based on cross-module state similarity and generation confrontation network
CN111026894A (en) * 2019-12-12 2020-04-17 清华大学 Cross-modal image text retrieval method based on credibility self-adaptive matching network
CN112148916A (en) * 2020-09-28 2020-12-29 华中科技大学 Cross-modal retrieval method, device, equipment and medium based on supervision
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
CN113157974A (en) * 2021-03-24 2021-07-23 西安维塑智能科技有限公司 Pedestrian retrieval method based on character expression

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110122A (en) * 2018-06-22 2019-08-09 北京交通大学 Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval
CN110490946A (en) * 2019-07-15 2019-11-22 同济大学 Text generation image method based on cross-module state similarity and generation confrontation network
CN111026894A (en) * 2019-12-12 2020-04-17 清华大学 Cross-modal image text retrieval method based on credibility self-adaptive matching network
CN112148916A (en) * 2020-09-28 2020-12-29 华中科技大学 Cross-modal retrieval method, device, equipment and medium based on supervision
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
CN113157974A (en) * 2021-03-24 2021-07-23 西安维塑智能科技有限公司 Pedestrian retrieval method based on character expression

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230162490A1 (en) * 2021-11-19 2023-05-25 Salesforce.Com, Inc. Systems and methods for vision-language distribution alignment
CN114239730A (en) * 2021-12-20 2022-03-25 华侨大学 Cross-modal retrieval method based on neighbor sorting relation
CN115129917A (en) * 2022-06-06 2022-09-30 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common features
CN115129917B (en) * 2022-06-06 2024-04-09 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common characteristics
CN114880441A (en) * 2022-07-06 2022-08-09 北京百度网讯科技有限公司 Visual content generation method, device, system, equipment and medium
CN115712740A (en) * 2023-01-10 2023-02-24 苏州大学 Method and system for multi-modal implication enhanced image text retrieval
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116402063A (en) * 2023-06-09 2023-07-07 华南师范大学 Multi-modal irony recognition method, apparatus, device and storage medium
CN116402063B (en) * 2023-06-09 2023-08-15 华南师范大学 Multi-modal irony recognition method, apparatus, device and storage medium

Also Published As

Publication number Publication date
CN113792207B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN107491541B (en) Text classification method and device
CN113792207A (en) Cross-modal retrieval method based on multi-level feature representation alignment
CN109522424B (en) Data processing method and device, electronic equipment and storage medium
WO2022011892A1 (en) Network training method and apparatus, target detection method and apparatus, and electronic device
CN110008401B (en) Keyword extraction method, keyword extraction device, and computer-readable storage medium
CN111931844B (en) Image processing method and device, electronic equipment and storage medium
US11856277B2 (en) Method and apparatus for processing video, electronic device, medium and product
CN110781305A (en) Text classification method and device based on classification model and model training method
CN111368541B (en) Named entity identification method and device
CN110175223A (en) A kind of method and device that problem of implementation generates
WO2022166069A1 (en) Deep learning network determination method and apparatus, and electronic device and storage medium
KR20210094445A (en) Method and device for processing information, and storage medium
CN113326768B (en) Training method, image feature extraction method, image recognition method and device
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN109558599B (en) Conversion method and device and electronic equipment
WO2023115911A1 (en) Object re-identification method and apparatus, electronic device, storage medium, and computer program product
CN116166843B (en) Text video cross-modal retrieval method and device based on fine granularity perception
CN111984749A (en) Method and device for ordering interest points
CN111259967A (en) Image classification and neural network training method, device, equipment and storage medium
CN112559673A (en) Language processing model training method and device, electronic equipment and storage medium
CN112381091B (en) Video content identification method, device, electronic equipment and storage medium
CN111339964B (en) Image processing method and device, electronic equipment and storage medium
CN111984765B (en) Knowledge base question-answering process relation detection method and device
CN116912478A (en) Object detection model construction, image classification method and electronic equipment
CN116484828A (en) Similar case determining method, device, apparatus, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant