CN114239805A

CN114239805A - Cross-modal retrieval neural network, training method and device, electronic equipment and medium

Info

Publication number: CN114239805A
Application number: CN202111535772.1A
Authority: CN
Inventors: 彭滢; 吴杰; 祝蕾
Original assignee: Chengdu Westone Information Industry Inc
Current assignee: Chengdu Westone Information Industry Inc
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-25

Abstract

The disclosure provides a cross-modal retrieval neural network and a training method, a device, equipment and a medium, wherein a text feature extraction network structure comprises the following steps: the word vector embedding layer is used for converting words in the target text into corresponding word vectors; the full connection layer is connected with the word vector embedding layer and is used for expanding the dimensionality of the word vector to the dimensionality of the target image; the first GCN network layer is connected with the full connection layer and used for extracting local semantic relation characteristics among word vectors; the first biGRU is connected with the first GCN network layer and used for extracting global semantic features of the target text; the image feature extraction network structure includes: an image detection network for detecting an object in a target image; the second GCN network layer is connected with the image detection network and is used for extracting the local semantic relation characteristics among the objects; and the second biGRU is connected with the second GCN network layer and is used for extracting the global semantic features of the target image. The cross-modal retrieval performance is good.

Description

Cross-modal retrieval neural network, training method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a cross-modal search neural network, a training method, an apparatus, an electronic device, and a medium.

Background

Currently, with the diversity of data recording modes, a user may need to perform cross-modal retrieval when applying data, i.e. a technique of using one type of data as a query to retrieve another type of related data, for example, a user inputs characters to perform retrieval to obtain an image retrieval result, etc., and in the cross-modal retrieval, the cross-modal retrieval is usually implemented by means of deep learning, which requires a corresponding Neural Network to cooperate with the operation, for example, the Neural Network extracts image features through a deep Residual Network (res), extracts text features using a Long Short-Term Memory Network (LSTM), and then performs interaction between the image features and the text features by means of an Attention Network Layer (Attention Layer), etc., to output a corresponding retrieval result. However, the existing neural network has poor extraction of text features and a complex network structure for extracting image features, so that the cross-modal retrieval performance of the neural network is poor.

In summary, how to improve the cross-modal retrieval performance of the neural network is a problem to be urgently solved by those skilled in the art.

Disclosure of Invention

The purpose of the present disclosure is to provide a cross-modal search neural network, which can solve the technical problem of how to improve the cross-modal search performance of the neural network to a certain extent. The disclosure also provides a cross-modal retrieval neural network training method, a device, an electronic device and a computer readable storage medium.

According to a first aspect of the embodiments of the present disclosure, a cross-modal search neural network is provided, which includes a text feature extraction network structure and an image feature extraction network structure;

the text feature extraction network structure includes: the word vector embedding layer is used for converting words in the target text into corresponding word vectors; the full connection layer is connected with the word vector embedding layer and is used for expanding the dimensionality of the word vector to the dimensionality of a target image; the first GCN network layer is connected with the full connection layer and used for extracting local semantic relation features among the word vectors; the first biGRU is connected with the first GCN network layer and used for extracting the global semantic features of the target text based on the local semantic relation features;

the image feature extraction network structure includes: an image detection network for detecting an object in the target image; the second GCN network layer is connected with the image detection network and is used for extracting the local semantic relation characteristics among the objects; and the second biGRU is connected with the second GCN network layer and is used for extracting the global semantic features of the target image based on the local semantic relation features.

Preferably, the first GCN network layer includes four GCN networks connected in sequence.

Preferably, the second GCN network layer includes four GCN networks connected in sequence.

Preferably, the number of the GCN network heads is 16.

Preferably, the image detection network is constructed based on a fast RCNN network.

According to a second aspect of the embodiments of the present disclosure, there is provided a cross-modal search neural network training method, including:

acquiring a training sample set and a verification sample set, wherein the training sample set and the verification sample set both comprise cross-modal sample pairs, and the cross-modal sample pairs comprise images and text descriptions of the images;

training a cross-modal retrieval neural network based on the training sample set;

inputting the verification images in the verification sample set into the cross-modal retrieval neural network;

determining a first type of text features generated by the cross-modal retrieval neural network and matched with the image features of the verification image and a second type of text features not matched with the image features;

determining a first parameter influencing the compactness degree of intra-class polymerization of the same label data and a second parameter influencing the dispersion degree of inter-class distances of different label data;

determining a loss value of the cross-modal search neural network based on the image characteristics, the first type of text features, the second type of text features, the first parameter, the second parameter and a cosine distance formula, so as to train the cross-modal search neural network based on the loss value;

the cross-modal retrieval neural network comprises a text feature extraction network structure and an image feature extraction network structure; the text feature extraction network structure includes: the word vector embedding layer is used for converting words in the target text into corresponding word vectors; the full connection layer is connected with the word vector embedding layer and is used for expanding the dimensionality of the word vector to the dimensionality of a target image; the first GCN network layer is connected with the full connection layer and used for extracting local semantic relation features among the word vectors; the first biGRU is connected with the first GCN network layer and used for extracting the global semantic features of the target text based on the local semantic relation features; the image feature extraction network structure includes: an image detection network for detecting an object in the target image; the second GCN network layer is connected with the image detection network and is used for extracting the local semantic relation characteristics among the objects; and the second biGRU is connected with the second GCN network layer and is used for extracting the global semantic features of the target image based on the local semantic relation features.

Preferably, the determining a loss value of the cross-modal search neural network based on the image characteristic, the first type of text feature, the second type of text feature, the first parameter, the second parameter, and a cosine distance formula includes:

determining the loss value of the cross-modal search neural network based on the image characteristics, the first type of text features, the second type of text features, the first parameter, the second parameter and a cosine distance formula through a loss function operation formula;

the loss function operation formula comprises:

wherein L is_cosineRepresenting the loss value; n represents a target number of the cross-modality sample pairs in the validation sample set; i denotes the number of the cross-modal sample pair; γ represents the first parameter, m represents the second parameter;

representing the image feature in the ith cross-modal sample,

representing the first type of text feature in the ith cross-modal sample;

representing the second type of text feature in the ith cross-modal sample; m represents the quantity value of the text characteristic corresponding to the text description in the ith cross-modal sample.

According to a third aspect of the embodiments of the present disclosure, there is provided a cross-modal search neural network training apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training sample set and a verification sample set, the training sample set and the verification sample set both comprise cross-modal sample pairs, and the cross-modal sample pairs comprise images and text descriptions of the images;

a first training module for training a cross-modal search neural network based on the training sample set;

a first verification module, configured to input a verification image in the verification sample set into the cross-modal search neural network;

the first determination module is used for determining a first type of text features which are generated by the cross-modal retrieval neural network and matched with the image features of the verification image, and a second type of text features which are not matched with the image features;

the second determining module is used for determining a first parameter influencing the compactness degree of intra-class polymerization of the same label data and a second parameter influencing the dispersion degree of inter-class distances of different label data;

a third determining module, configured to determine a loss value of the cross-modal search neural network based on the image characteristic, the first type of text feature, the second type of text feature, the first parameter, the second parameter, and a cosine distance formula, so as to train the cross-modal search neural network based on the loss value;

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a memory for storing a computer program;

a processor for executing the computer program in the memory to implement the steps of any of the methods described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.

The invention provides a cross-modal retrieval neural network, which comprises a text feature extraction network structure and an image feature extraction network structure; the text feature extraction network structure comprises: the word vector embedding layer is used for converting words in the target text into corresponding word vectors; the full connection layer is connected with the word vector embedding layer and is used for expanding the dimensionality of the word vector to the dimensionality of the target image; the first GCN network layer is connected with the full connection layer and used for extracting local semantic relation characteristics among word vectors; the first biGRU is connected with the first GCN network layer and used for extracting global semantic features of the target text based on the local semantic relation features; the image feature extraction network structure includes: an image detection network for detecting an object in a target image; the second GCN network layer is connected with the image detection network and is used for extracting the local semantic relation characteristics among the objects; and the second biGRU is connected with the second GCN network layer and is used for extracting the global semantic features of the target image based on the local semantic relation features. The cross-modal retrieval neural network provided by the disclosure has strong capability of extracting the features of the image and the text, and the dimensions of the two features of the image and the text are both target dimensions, so that the unification and the subsequent interaction between the two features of the image and the text are facilitated, and the cross-modal retrieval performance is good. The cross-modal retrieval neural network training method, the cross-modal retrieval neural network training device, the electronic equipment and the computer readable storage medium solve the corresponding technical problems.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a cross-modal neural network architecture, according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a cross-modal recurrent neural network training method in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a cross-modal neural network training apparatus in accordance with an exemplary embodiment;

fig. 4 is a block diagram illustrating an electronic device 900 in accordance with an example embodiment.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Referring to fig. 1, fig. 1 is a schematic structural diagram illustrating a cross-modal neural network according to an exemplary embodiment.

The invention relates to a cross-modal retrieval neural network, which comprises a text feature extraction network structure and an image feature extraction network structure;

the text feature extraction network structure comprises: the word vector embedding layer 11 is used for converting words in the target text into corresponding word vectors; a Full Connection layer (FC) 12 connected to the word vector embedding layer 11, for expanding the dimension of the word vector to the dimension of the target image; a first GCN (convolutional neural network) network layer 13 connected to the fully-connected layer 12, configured to extract local semantic relationship features between word vectors; a first biGRU (bidirectional gated cyclic unit network) 14 connected to the first GCN network layer 13, configured to extract global semantic features of the target text based on the local semantic relationship features;

the image feature extraction network structure includes: an image detection network 21 for detecting an object in a target image; the second GCN network layer 22 is connected with the image detection network 21 and is used for extracting local semantic relation characteristics among objects; and a second biGRU23 connected to the second GCN network layer 22 for extracting global semantic features of the target image based on the local semantic relationship features.

It should be noted that other structures in the cross-modal search neural network provided by the present disclosure, such as the feature interaction network, may be determined according to actual needs, and specific parameters of the cross-modal search neural network may be determined according to actual application scenarios, which is not specifically limited in the present disclosure.

The invention provides a cross-modal retrieval neural network, which comprises a text feature extraction network structure and an image feature extraction network structure; the text feature extraction network structure comprises: the word vector embedding layer is used for converting words in the target text into corresponding word vectors; the full connection layer is connected with the word vector embedding layer and is used for expanding the dimensionality of the word vector to the dimensionality of the target image; the first GCN network layer is connected with the full connection layer and used for extracting local semantic relation characteristics among word vectors; the first biGRU is connected with the first GCN network layer and used for extracting global semantic features of the target text based on the local semantic relation features; the image feature extraction network structure includes: an image detection network for detecting an object in a target image; the second GCN network layer is connected with the image detection network and is used for extracting the local semantic relation characteristics among the objects; and the second biGRU is connected with the second GCN network layer and is used for extracting the global semantic features of the target image based on the local semantic relation features. The cross-modal retrieval neural network provided by the disclosure has strong capability of extracting the features of the image and the text, and the dimensions of the two features of the image and the text are both target dimensions, so that the unification and the subsequent interaction between the two features of the image and the text are facilitated, and the cross-modal retrieval performance is good.

In the cross-modal retrieval neural network related by the disclosure, in order to realize dimension unification of characteristics between an image and a text, a first GCN network layer comprises four GCN networks which are sequentially connected; the second GCN network layer may also include four GCN networks connected in sequence. It is understood that the number of the GCN networks in the first GCN network layer and the second GCN network layer may be 16, and the like, and the image detection network may be configured based on the fast RCNN network, and the like.

It should be noted that, in order to further ensure the dimension of the features between the image and the text to be consistent, the structures at the corresponding hierarchies of the first GCN network layer and the second GCN network layer may be consistent, and the structures at the corresponding hierarchies of the first biGRU and the second biGRU may also be consistent. For example, all GCN relationship inference processes in this disclosure can be expressed as:

wherein R (v)_k,v_j) Representing two objects v in an image_kAnd v_jThe relationship between them;

and W_ψRepresents a learnable weight; v represents a set of object vectors in the image; w_rWeights representing residual structure; w_gRepresents the weight of the GCN layer; n represents the number of GCN heads.

Referring to fig. 2, fig. 2 is a flowchart of a cross-modal search neural network training method according to an embodiment of the present disclosure.

The cross-modal retrieval neural network training method provided by the embodiment of the disclosure can comprise the following steps:

step S101: the method comprises the steps of obtaining a training sample set and a verification sample set, wherein the training sample set and the verification sample set respectively comprise cross-modal sample pairs, and the cross-modal sample pairs comprise images and text descriptions of the images.

It can be understood that, because the cross-modal search neural network provided by the present disclosure is used for searching images or texts, in the training process of the cross-modal search neural network, both the acquired training sample set and the verification sample set need to include a cross-modal sample pair, and the cross-modal sample pair includes a text description of an image and an image, and a specific type thereof may be determined according to an application scenario, for example, the picture may be a high-heeled shoe picture with a leopard print pattern, a leopard print female sports shoe picture, or the like, and the text description may be "leopard print female shoes" or the like, which is not specifically limited in this disclosure.

Step S102: and training the cross-modal retrieval neural network based on the training sample set.

It can be understood that, after the training sample set is obtained, the cross-modal search neural network may be trained based on the training sample set, and the training process may be determined according to an application scenario, which is not specifically limited herein.

Step S103: and inputting the verification images in the verification sample set into the cross-modal retrieval neural network.

It can be understood that after the training of the cross-modal search neural network is completed based on the training sample set, the training degree of the cross-modal search neural network can be verified based on the verification sample set, that is, the verification image in the verification sample set can be input into the cross-modal search neural network, so that the cross-modal search neural network can generate a text feature corresponding to the verification image, and the text feature can reflect the training effect of the cross-modal search neural network if the text feature is compared with the verification text description in the verification sample set.

Step S104: a first type of text features generated across the modal search neural network that match the image features of the verification image and a second type of text features that do not match the image features are determined.

It can be understood that, in the training process of the cross-modal search neural network, in order to quantify the training effect of the cross-modal search neural network, a corresponding loss value may be calculated to guide the training process thereof.

Step S104: a first parameter affecting the degree of compactness of intra-class polymerization of the same tag data and a second parameter affecting the degree of dispersion of inter-class distances of different tag data are determined.

It can be understood that after determining the first type of text features generated by the cross-modal search neural network and matching with the image features of the image and the second type of text features not matching with the image features, a first parameter affecting the compactness degree of aggregation in the classes of the same label data and a second parameter affecting the dispersion degree of the distance between the classes of different label data are also determined, so that the loss value can be accurately calculated by combining the first parameter and the second parameter.

Step S105: determining a loss value of the cross-modal retrieval neural network based on the image characteristics, the first type of text characteristics, the second type of text characteristics, the first parameter, the second parameter and a cosine distance formula, and training the cross-modal retrieval neural network based on the loss value.

It can be understood that after the first type of text feature, the second type of text feature, the first parameter, and the second parameter are determined, the loss value of the cross-modal search neural network can be determined by integrating the image characteristics, the first type of text feature, the second type of text feature, the first parameter, the second parameter, and the cosine distance formula, so as to train the cross-modal search neural network based on the loss value.

According to the method and the device, in the training process of the cross-modal retrieval neural network, when the loss value of the cross-modal retrieval neural network is calculated, the loss value is calculated based on the image characteristics, the first type of text characteristics, the second type of text characteristics, the first parameter, the second parameter and the cosine distance formula, the number of types of parameters participating in calculation is large, the calculation accuracy of the loss value can be improved, and the training efficiency of the cross-modal retrieval neural network can be improved.

In practical application, in the process of determining the loss value of the cross-modal search neural network based on the image characteristics, the first type of text characteristics, the second type of text characteristics, the first parameter, the second parameter and the cosine distance formula, in order to quickly determine the loss value, the loss value of the cross-modal search neural network can be determined based on the image characteristics, the first type of text characteristics, the second type of text characteristics, the first parameter, the second parameter and the cosine distance formula through a loss function operation formula;

the loss function operation formula comprises:

wherein L is_cosineRepresents a loss value; n represents a target number of cross-modal sample pairs in the validation sample set; i denotes the number of the cross modal sample pair; γ represents a first parameter, m represents a second parameter;

representing image features in the ith cross-modal sample,

representing a first type of text feature in the ith cross-modal sample;

representing a second type of text feature in the ith cross-modal sample; m represents the quantity value of the text characteristic corresponding to the text description in the ith cross-modal sample.

In order to facilitate understanding of the retrieval effect of the cross-modal retrieval neural network provided by the disclosure, an ablation experiment is performed on three models by taking a DCG @5 score as an evaluation standard, wherein the model 1 uses the cross-modal retrieval neural network and a loss function provided by the disclosure; the image feature extraction layer of the neural network used by the model 2 adopts the network structure proposed by the present disclosure, and the text feature extraction layer adopts: the word vector embedding layer, biGRU, and the penalty function uses the penalty function of the present disclosure; the image feature extraction layer of the neural network used by the model 3 adopts the network structure proposed by the present disclosure, and the text feature extraction layer adopts: the word vector embedding layer, biGRU, whose loss function uses a triplet loss function. The final ablation test score results are shown in table 1.

TABLE 1 ablation test scoring results

Model (model)	Model 1	Model 2	Model 3
				Score of	0.7039	0.6796	0.5778

As can be seen from the ablation experiment score result in table 1, the score of the model 1 is the highest, and the retrieval accuracy is the best, because the cross-modal retrieval neural network provided by the present disclosure performs more detailed modeling on the text feature extraction layer, the cross-modal retrieval network can better learn the relationship between different modalities, the cross-modal retrieval effect can be improved, and the loss function provided by the present disclosure can also significantly improve the cross-modal retrieval effect.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a cross-modal search neural network training device according to an embodiment of the present disclosure.

The training apparatus 100 for cross-modal search neural network provided in the embodiments of the present disclosure may include:

a first obtaining module 110, configured to obtain a training sample set and a verification sample set, where the training sample set and the verification sample set both include a cross-modal sample pair, and the cross-modal sample pair includes an image and a text description of the image;

a first training module 120, configured to train a cross-modal search neural network based on a training sample set;

a first verification module 130, configured to input a verification image in a verification sample set into the cross-modal retrieval neural network;

a first determining module 140, configured to determine a first type of text feature generated across a modal search neural network and matching an image feature of the verification image, and a second type of text feature not matching the image feature;

a second determining module 150, configured to determine a first parameter that affects a compactness degree of intra-class aggregation of the same tag data and a second parameter that affects a dispersion degree of inter-class distances of different tag data;

the third determining module 160 is configured to determine a loss value of the cross-modal search neural network based on the image characteristic, the first-class text characteristic, the second-class text characteristic, the first parameter, the second parameter, and the cosine distance formula, so as to train the cross-modal search neural network based on the loss value.

In the training apparatus for a cross-modal search neural network provided in the embodiments of the present disclosure, the third determining module may be specifically configured to: determining a loss value of the cross-modal retrieval neural network based on the image characteristics, the first type of text characteristics, the second type of text characteristics, the first parameter, the second parameter and a cosine distance formula through a loss function operation formula;

the loss function operation formula comprises:

representing image features in the ith cross-modal sample,

representing a first type of text feature in the ith cross-modal sample;

Fig. 4 is a block diagram illustrating an electronic device 900 in accordance with an example embodiment. As shown in fig. 4, the electronic device 900 may include: a processor 901 and a memory 902. The electronic device 900 may also include one or more of a multimedia component 903, an input/output (I/O) interface 904, and a communications component 905.

The processor 901 is configured to control the overall operation of the electronic device 900, so as to complete all or part of the steps in the above training method for cross-modal search neural network. The memory 902 is used to store various types of data to support operation of the electronic device 900, such as instructions for any application or method operating on the electronic device 900 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and the like. The Memory 902 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 903 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 902 or transmitted through the communication component 905. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 904 provides an interface between the processor 901 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 905 is used for wired or wireless communication between the electronic device 900 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 905 may include: Wi-Fi module, bluetooth module, NFC module.

In an exemplary embodiment, the electronic Device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described cross-modal search neural network training method.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the cross-modal neuro-network training method described above is also provided. For example, the computer readable storage medium may be the memory 902 described above that includes program instructions executable by the processor 901 of the electronic device 900 to perform the cross-modal search neural network training method described above.

For a description of relevant parts in the training apparatus, the electronic device, and the computer-readable storage medium for cross-modal search neural network provided in the embodiments of the present disclosure, reference is made to detailed descriptions of corresponding parts in the training method for cross-modal search neural network provided in the embodiments of the present disclosure, and details are not repeated here. In addition, parts of the above technical solutions provided in the embodiments of the present disclosure that are consistent with the implementation principle of the corresponding technical solutions in the prior art are not described in detail, so as to avoid redundant description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A cross-modal retrieval neural network is characterized by comprising a text feature extraction network structure and an image feature extraction network structure;

2. The cross-modal search neural network of claim 1, wherein the first GCN network layer comprises four GCN networks connected in series.

3. The cross-modal retrieval neural network of claim 2, wherein the second GCN network layer comprises four GCN networks connected in series.

4. The cross-modal search neural network of claim 3, wherein the GCN network has a number of headers of 16.

5. The cross-modal search neural network of any one of claims 1 to 4, wherein the image detection network is constructed based on the fast RCNN network.

6. A cross-modal search neural network training method is characterized by comprising the following steps:

7. The method of claim 6, wherein determining the loss value of the cross-modal search neural network based on the image characteristic, the first type of text feature, the second type of text feature, the first parameter, the second parameter, and a cosine distance formula comprises:

the loss function operation formula comprises:

representing the image feature in the ith cross-modal sample,

representing the first type of text feature in the ith cross-modal sample;

8. A cross-modal search neural network training device, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 6 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 6 to 7.