CN111598712A

CN111598712A - Training and searching method for data feature generator in social media cross-modal search

Info

Publication number: CN111598712A
Application number: CN202010418678.7A
Authority: CN
Inventors: 杜军平; 周南; 崔婉秋; 寇菲菲
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2020-08-28
Anticipated expiration: 2040-05-18
Also published as: CN111598712B

Abstract

The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which comprises the following steps: the method comprises the steps of obtaining a training sample set, obtaining representation characteristics of each data information by adopting a generator for countermeasure learning based on the training sample set, monitoring a countermeasure generator through a discriminator, adjusting a parameter and optimizing the generator through a fixed discriminator and adjusting a parameter and optimizing the discriminator through the fixed generator, and iterating for multiple times to obtain a final generator. The searching method comprises the following steps: inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched; traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator; and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching. The method can adapt to the characteristic of semantic sparsity of data information in social media, and realizes accurate search of cross-modal data information.

Description

Training and searching method for data feature generator in social media cross-modal search

Technical Field

The invention relates to the technical field of data search, in particular to a training and searching method for a data feature generator in social media cross-modal search.

Background

The premise of searching the cross-modal data content of the social network is to perform search feature mining on the social network data, and two strategies are mainly adopted: the method comprises the steps of searching characteristic analysis and mining based on manual work and searching characteristic mining based on a machine learning method. The social media has a huge data volume, and the text is short and irregular, so that the text has the problem of semantic sparsity; meanwhile, the situation that the image pixels in the social network are low and the composition is incomplete causes the problem of semantic sparsity similar to the social network text. Based on the characteristics, manual search feature analysis cannot adapt to huge data volume in a social network, and the existing machine learning is difficult to realize feature extraction of texts or images with sparse semantics. Therefore, it is difficult to perform a search between different modality data contents.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a training and searching method for a data feature generator in social media cross-modal search, so as to solve the problem that the cross-modal search cannot be performed on data information in social media in the prior art.

The technical scheme of the invention is as follows:

in one aspect, the present invention provides a training method for a data representation feature generator in a social media cross-modal search, including:

obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information;

obtaining, with a generator, representative features of each data information based on the set of training samples, the generator including: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features;

supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function;

tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; and carrying out multiple iterations to obtain a final generator.

In some embodiments, obtaining raw features of data information of multiple modalities includes:

obtaining TF-IDF characteristics of the character modal information as original characteristics of the character modal information, obtaining convolution characteristics of the image modal information as original characteristics of the image modal information, and recording the original characteristics X ═ X of each data information_t ¹,x_t ²,…,x_t ^m,x_v ¹,x_v ²,…,x_v ⁿ}，x_t ^mIs the original feature of the m-th text mode information, x_v ⁿThe original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.

In some embodiments, segmenting each original feature to obtain a corresponding plurality of local features, and acquiring the representation features of the data information of each modality in the same representation subspace through a self-attention mechanism based on the local features includes:

dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of_t ^m＝{b_t ^m,1,b_t ^m,2,…,b_t ^m,k}，x_v ⁿ＝{b_v ^n,1,b_v ^n,2,…,b_v ^n,k}，b_t ^m,kSemantic features of the kth block of text as mth text mode information, b_v ^n,kThe k block of image semantic features of the n image modality information;

using function f_tAnd g_tConverting the segmented text semantic features into features representing subspaces:

wherein w_t ^fAnd w_t ^gIs f_tAnd g_tThe parameter vector of (2);

attention parameters between the ith block text semantic feature and the jth block text semantic feature of the mth word modal information are as follows:

the output characteristic expression of the i block text semantic characteristic of the m character modal information is as follows:

wherein the content of the first and second substances,

w_t ^his h_tThe parameter vector of (2);

the representation characteristics of the mth text mode information are as follows: s_t ^m＝{o_t ^m,1,o_t ^m,2,...,o_t ^m,k}；

Using function f_vAnd g_vConverting the segmented image semantic features into features representing subspaces:

wherein w_v ^fAnd w_v ^gIs f_vAnd g_vThe parameter vector of (2);

the attention parameter between the image semantic feature of the ith block and the image semantic feature of the jth block of the nth image modality information is as follows:

the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information is as follows:

wherein the content of the first and second substances,

w_v ^his h_vThe parameter vector of (2);

the representation characteristic of the nth image modality information is: s_v ⁿ＝{o_v ^n,1,o_v ^n,2,...,o_v ^n,k}。

In some embodiments, the intra-modal semantic loss function is:

wherein, y_t ⁱAnd y_v ^jIndividual watchShowing that topic label vectors of the ith character modal information and the jth image modal information in one-hot form in the training sample set are under the same topic

The parameter set for the text modality generator is theta_tThe representation characteristics corresponding to the ith character mode information,

the original characteristics of the ith character modal information are obtained;

a set of parameters for the image modality generator is combined to θ_vThe j-th image modality information corresponds to the representation characteristics,

original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; function(s)

For handles

And

processed by fully connected neural network to be able to communicate with y_t ⁱAnd/or y_v ^jThe dimension of the multiplication.

In some embodiments, the inter-modal similarity loss function is:

wherein, y_t ⁱAnd y_v ^jRespectively representing the ith character modal information and the jth image modal information one-hot form topic label vectors in the training sample set under the same topic

original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set;

the generation loss function is: l is_generation＝αL_label+βL_similarityα and β are weight coefficients of the intra-modality semantic loss function and the inter-modality similarity loss function, respectively.

In some embodiments, the cross-modal discriminant loss function is:

wherein, c^eThe modal label is in a form of searched target data information one-hot;

the parameter set for the text modality generator is theta_tThe representation characteristics corresponding to the e-th character mode information,

the original characteristics of the e-th character modal information are obtained;

a set of parameters for the image modality generator is combined to θ_vThe representation characteristic corresponding to the e-th image modality information,

original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function(s)

In the parameter set theta_pAnd converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.

On the other hand, the invention also provides a social media cross-modal data information searching method, which comprises the following steps:

inputting data information to be searched into a generator to obtain representation characteristics of the data information to be searched;

wherein the generator is derived by counterlearning based on a training sampler; the training sample set includes: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator includes: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features; supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function; tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;

traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator;

and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.

In some embodiments, obtaining the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching includes:

based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:

wherein the content of the first and second substances,

parameter sets for the text modality generator are

The representation characteristics corresponding to the ith character mode information,

a set of parameters for the image modality generator is combined as

The j-th image modality information corresponds to the representation characteristics,

original features of jth image modality information; fixing

Or

One is the representation characteristics of the data information to be searched in the corresponding mode, and the other is the representation characteristics of each existing data in the target mode;

and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.

In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.

In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.

The method has the advantages that the generator is used for respectively mapping the representation characteristics of the text modal information and the image modal information under the self-attention mechanism, and the semantic characteristics of the cross-modal data content in the social media under the same representation subspace are extracted; based on generation antagonism learning, the accuracy of mapping corresponding topics of the representation features generated by the generator between the same modal data information and between different modal data information is improved by using the supervision of the discriminator, and meanwhile, the distribution of the representation features of different modal data information under the same topic is differentiated. Therefore, the method adapts to the characteristic of semantic sparsity of data information in social media and improves the accuracy of searching between cross-modal data information.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts of the drawings may be exaggerated, i.e., may be larger, relative to other components in an exemplary apparatus actually manufactured according to the present invention. In the drawings:

FIG. 1 is a schematic flowchart illustrating a method for training a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a logic structure of a training method for a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram illustrating iterative optimization of a training method for a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a social media cross-modality data information search method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled," if not specifically stated, may refer herein to not only a direct connection, but also an indirect connection in which an intermediate is present.

It should be noted that, the "modality" mentioned in the present invention refers to the form of data information, and may include: text information, image information, audio information, or video information. "topic" refers to the semantically directed content of data information to represent matters discussed and focused on per media in social interaction, such as a particular news topic, which contains pieces of data information, image information, audio information, or video information associated therewith.

Because data information semantics in the social media are sparse, text modal information data is short and irregular, image modal information resolution is low, and composition is incomplete, the work of searching data information in the social media in a cross-modal mode is difficult to realize. In the prior art, the search of cross-modal data information in social media is difficult to adapt to the characteristic of sparse semantics to realize high-precision search, or the search analysis process is too complex and the realization difficulty is high.

The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which is used for extracting representation features of data information in the social media, realizing the search of the cross-modal data information in a similarity comparison mode, improving the precision of search matching, simplifying the search implementation process and improving the efficiency.

In one aspect, the present invention provides a training method for a data representation feature generator in cross-modal search of social media, wherein the generator for extracting data information representation features in social media is generated based on counterstudy training, as shown in fig. 1, the training method includes steps S101 to S104, it should be noted that the steps of the training method do not limit the sequence, and it should be understood that, in the training process, the steps S101 to S104 may synchronize or change the implementation sequence in some cases:

step S101: obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information.

Step S102: based on the training sample set, the generator is adopted to obtain the representation characteristics of each data information, and the generator comprises: the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features.

Step S103: supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculation value of a semantic loss function in the modes, the correlation between the representation features of different mode data information under the same topic is maximized by minimizing the calculation value of a similarity loss function between the modes, and the difference of the representation features of the different mode data information about the modes is maximized by minimizing the calculation value of a cross-mode discriminant loss function.

Step S104: adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.

In step S101, data information in social media is used as a data item, a sample training set is established for counterstudy, and topics and modalities corresponding to the data information are labeled as tags. In order to embody the characteristics of the social media cross-modal search, the data information at least comprises two forms of text modal information and image modal information, and in other embodiments, in order to adapt to higher retrieval requirements, the data information also comprises audio modal information and/or video information. The topic tags and the modal tags corresponding to the data information may be marked in a one-hot encoding form, and in other embodiments, the topic tags and the modal tags may also be marked in other forms of tag encoding according to specific situations. In some embodiments, the number of text modality information and image modality information in the default sample training set is consistent.

In step S102, as shown in fig. 2, a counterstudy method is adopted, and the generator is used to collect the representation features of each data information for similarity comparison, so as to implement cross-modal search. Specifically, because of the great difference in data form between different modality data information, features generated by single extraction of different modality data information are not consistent in form, content, meaning and standard, are not in the same evaluation dimension, cannot be directly compared, and even cannot be used for mutual retrieval. Therefore, in order to search between different modality data information, it is necessary to acquire features of the same evaluation dimension, i.e., features within the same representation subspace, generated by different modality data information.

In the present embodiment, the generator first obtains the original features directly generated from each data message, and the form and the collection method of the original features are determined according to the modality of the corresponding data message. For example, the text modal information may adopt TF-IDF (term frequency-inverse document frequency) characteristics as its corresponding original characteristics; the image modality information may employ a convolution feature (VGGNet convolution neural network feature) as its corresponding raw feature. Based on the principle of a self-attention mechanism, the original features of each piece of text mode information or image mode information are divided into a plurality of local features. For single text mode information or image mode information, attention parameters of each local feature relative to other local features are obtained, and the attention parameters are accumulated after being subjected to product in a uniform expression subspace, so that the attention of the local features can be expressed, and corresponding output features of the local features are obtained. The feature vector formed by combining the output features of the local features is used as a final representation feature, and the semantics and the mode of the feature vector can be reflected in a correlated manner.

Specifically, the training sample set has data information of a plurality of topics, and the modality of the data information includes two types, namely characters and images. Define the training sample set as C ═ t₁,t₂,…,t_m,v₁,v₂,…,v_n}，t_mRepresenting the m-th text modality information, v_nRepresenting nth image modality information; and simultaneously, marking topics and modes corresponding to each character mode information and each graph mode information as labels in a training sample set. Topic tags and modal tags can be labeled in a one-hot (one-hot coded) coding vector form, Q states are coded by using Q-bit state registers, each state has an independent register bit, only one of the register bits is effective, and the tableA state is reached. For example, 5 topic categories, encoded with a 5-bit status register, have tags of [1,0,0 ] when the data information belongs to the first category]. The mode can be encoded by using a 2-bit status register, so that [1,0 ]]Mark text Modal, [0,1]The image modality is labeled.

obtaining TF-IDF characteristics of character modal information as original characteristics of the character modal information, obtaining convolution characteristics of image modal information as original characteristics of the image modal information, and recording the original characteristics X ═ X of each data information_t ¹,x_t ²,…,x_t ^m,x_v ¹,x_v ²,…,x_v ⁿ}，x_t ^mIs the original feature of the m-th text mode information, x_v ⁿThe original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.

In some embodiments, as shown in fig. 2, each original feature is segmented to obtain a plurality of corresponding local features, and the representation features of the data information of each modality in the same representation subspace are obtained through an attention-driven mechanism based on the local features, including S201 to S208, where S202 to S204 are generation processes of the text modality information representation features, and S205 to S207 are generation processes of the image modality information representation features:

s201: dividing TF-IDF characteristics of the character modal information and convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of_t ^m＝{b_t ^m,1,b_t ^m,2,…,b_t ^m,k}，x_v ⁿ＝{b_v ^n,1,b_v ^n,2,…,b_v ^n,k}，b_t ^m,kSemantic features of the kth block of text as mth text mode information, b_v ^n,kAnd the semantic features of the kth block of the nth image modality information.

S202: using function f_tAnd g_tSemantic features of segmented textConversion to a feature representing a subspace:

wherein w_t ^fAnd w_t ^gIs f_tAnd g_tThe parameter vector of (2).

S203: calculating attention parameters between the ith block text semantic feature and the jth block text semantic feature of the mth word modal information as follows:

s204: calculating an output characteristic expression of the ith block text semantic characteristic of the mth character modal information as follows:

wherein the content of the first and second substances,

w_t ^his h_tThe parameter vector of (2);

outputting the representation characteristics of the mth text mode information as follows: s_t ^m＝{o_t ^m,1,o_t ^m,2,...,o_t ^m,k}。

S205: using function f_vAnd g_vConverting the segmented image semantic features into features representing subspaces:

wherein, w_v ^fAnd w_v ^gIs f_vAnd g_vThe parameter vector of (2);

s206: calculating attention parameters between the ith block image semantic feature and the jth block image semantic feature of the nth image modality information as follows:

s207: calculating the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information as follows:

wherein the content of the first and second substances,

w_v ^his h_vThe parameter vector of (2);

the representation characteristics of the nth image modality information are as follows: s_v ⁿ＝{o_v ^n,1,o_v ^n,2,...,o_v ^n,k}。

In the embodiment, the representation features for representing the semantics of the text modal information and the image modal information are extracted through a self-attention mechanism, and the evaluation dimensions are unified, so that the representation features of the data information among different modalities are in the same representation subspace, and the search of the cross-modality data information is realized. Wherein the function f_t、g_tAnd h_tIs a function f for converting each local feature of the original features of the text modal information into a representation subspace_v、g_vAnd h_vThe method is used for converting each local feature of the original features of the image mode information into the same expression subspace so as to realize the unification of evaluation dimensions. Function f_t、g_tAnd h_tAnd a function f_v、g_vAnd h_vCorresponding parameter vectors are all in oppositionIn the learning process, the optimal value is obtained by iterative updating under the supervision of the discriminator.

In step S103, as shown in fig. 2, optimization of the generator is achieved by providing a system in which the arbiter and the generator form a counterlearning. In particular, the invention aims to adjust the generator to make the representation characteristics generated based on the social media data information have the following effects through supervision of the discriminator: 1. minimizing a difference in distribution between the representative features and the corresponding topic labels even if the representative features of the respective data information are accurately associated to represent the topics thereof; 2. the relevance of the representation features of different modal data information of the same topic about the topic is maximized, even if the representation features of different modal data information of the same topic are converged in a semantic angle; 3. the difference of representation features of different modal data information about the modal is strengthened, and even the representation features of different modal data information under the same topic tend to be differentiated in the modal angle. In order to enable the representation features generated by the generator to achieve the 3 effects, supervised learning is required to be performed through a discriminator, and specifically, a generation loss function obtained by weighted summation of an intra-modal semantic loss function and an inter-modal similarity loss function and a cross-modal discriminant loss function are combined to achieve the purpose.

In some embodiments, an intra-modal semantic loss function is used to minimize the distribution difference between the representation features and the corresponding topic tags, the intra-modal semantic loss function being:

wherein, y_t ⁱAnd y_v ^jRespectively representing the topic label vectors of the ith character modal information and the jth image modal information in the training sample set in the one-hot form under the same topic

As arguments for text modality generatorsNumber set of theta_tThe representation characteristics corresponding to the ith character mode information,

the set of parameters for the image modality generator is combined to θ_vThe j-th image modality information corresponds to the representation characteristics,

original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; is a function of

For predicting topic probability distribution of each text or image representing features, processing the generated representation features into a probability distribution capable of being connected with y through a fully-connected neural network_t ⁱAnd/or y_v ^jThe dimension of the multiplication.

In some embodiments, an inter-modal similarity loss function is used to maximize the relevance of a representation feature with respect to a topic between different modal data information for the same topic, the inter-modal similarity loss function being:

original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set.

In some embodiments, the intra-modality semantic loss function and the inter-modality similarity loss function are summed in a weighted manner to obtain a resulting loss function. The resulting loss function is: l is_generation＝αL_label+βL_similarityα and β are weight coefficients corresponding to the intra-modal semantic loss function and the inter-modal similarity loss function, respectively.

In some embodiments, the cross-modal discriminant loss function is used to enhance the distinction of the representation features with respect to the modalities between different modality data information on the same topic, and the cross-modal discriminant loss function is:

the set of parameters for the text modality generator is theta_tThe representation characteristics corresponding to the e-th character mode information,

the set of parameters for the image modality generator is combined to θ_vThe representation characteristic corresponding to the e-th image modality information,

In this embodiment, since the search process is divided into two cases, namely, a text search image (the searched target data information is image modality information) and an image search text (the searched target data information is text modality information), it is necessary to distinguish the modalities, that is, to discriminate the action of the loss function across the modalities.

In step S104, as shown in fig. 3, the generator is optimized by the fixed arbiter, and then the generator is optimized by the fixed arbiter, and multiple iterations are performed to obtain a better generator, so as to implement an effective and complete antagonistic learning process.

Specifically, in this embodiment, based on the difference between the minimum generation loss function and the cross-modal discriminant loss function, the optimized parameter set θ of the text modal generator is obtained_tAnd a parameter set theta of an image modality generator_vNamely:

wherein the content of the first and second substances,

and

to the optimized theta_tAnd theta_v；

Obtaining a parameter set theta of an optimization discriminator based on the difference between the maximum generation loss function and the cross-mode discrimination loss function_pNamely:

wherein the content of the first and second substances,

to the optimized theta_p。

On the other hand, the invention also provides a social media cross-modal data information searching method, as shown in fig. 4, including steps S301 to S303:

step S301: and inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched.

Wherein the generator is obtained by counterlearning based on the training samplers; the training sample set includes: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator comprises: the system comprises a character modal generator and an image modal generator, wherein the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features; supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculation value of a semantic loss function in the modes, the correlation between the representation features of different mode data information under the same topic is maximized by minimizing the calculation value of a similarity loss function between the modes, and the difference of the representation features of the different mode data information about the modes is maximized by minimizing the calculation value of a cross-mode discriminant loss function; adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.

Step S302: and traversing the existing data information of the target modality, and acquiring the representation characteristics generated by the same generator of each existing data information.

Step S303: and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.

Based on the same inventive concept as steps S101 to S104, in step S301 of this embodiment, a generator generated by the training method of the data representation feature generator in the social media cross-modality search is used to collect the representation features of the data information to be searched. In step S302, the existing data is traversed to obtain the representation features of each existing data generated by the generator in step S301. In step S303, one or more pieces of recent existing data information of the target modality are obtained by the proximity matching search.

In some embodiments, in step S303, that is, acquiring existing data information of one or more target modalities that are closest to the representation features of the data information to be searched based on similarity matching, includes S3031 to S3032: :

s3031: based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:

wherein the content of the first and second substances,

parameter sets for text modality generators

a set of parameters for an image modality generator is

original features of jth image modality information; fixing

Or

One is the representation characteristics of the data information to be searched in the corresponding modality, and the other is the representation characteristics of each existing data in the target modality.

S3032: and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.

In the present embodiment, when searching for image modality information based on text modality information, it is fixed

Traversing the image mode information in the existing data information for the representation characteristics of the data information to be searched, and calculating the similarity based on the calculation formula (14), and based on the similarityAnd arranging the image modality information in the existing data information to obtain one or more pieces of image modality information with the highest similarity. Similarly, when searching for text modality information based on image modality information, it is fixed

Traversing character modal information in the existing data information for representing characteristics of the data information to be searched, calculating the similarity based on a calculation formula (14), and arranging the character modal information in the existing data information based on the similarity to obtain one or more pieces of character modal information with the highest similarity. Wherein, the smaller the sim, the higher the similarity.

In summary, the training and searching method for the data feature generator in the social media cross-modal search according to the present invention realizes the search between the social media cross-modal data information by the counterstudy method, and emphasizes the cross-modal content search between the text modal information and the image modal information. The generator for counterlearning reconstructs original features of different modal data information in the social media based on a self-attention mechanism, and the original features are mapped into a representation subspace which can be directly compared, so that the search of cross-modal data information is realized. Further, a joint loss function is established through the discriminator, and the generated representation features are guided to be representation features following the corresponding modal semantic distribution by utilizing the intra-modal semantic loss function and the inter-modal similarity loss function. A loss function is discriminated across modes to achieve discrimination of modes. The method can adapt to the characteristic of sparse data information semantics in social media, complete accurate, efficient and stable search of cross-modal data information, and greatly improve the efficiency compared with the prior art.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A training method for a data representation feature generator in a social media cross-modal search is characterized by comprising the following steps:

2. The method for training a data representation feature generator in social media cross-modal search according to claim 1, wherein obtaining original features of the data information in corresponding modalities comprises:

3. The training method of the data representation feature generator in the social media cross-modal search according to claim 2, wherein the step of segmenting each original feature to obtain a plurality of corresponding local features, and obtaining the representation features of the data information of each modality in the same representation subspace through a self-attention mechanism based on the local features comprises the steps of:

dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of_t ^m＝{b_t ^m,1,b_t ^m,2,…,b_t ^m,k}，x_v ⁿ＝{b_v ^n,1,b_v ^n,2,…,b_v ^n,k}，b_t ^m,kSemantic features of the kth block of text as mth text mode information, b_v ^n,kIs n thThe kth block of image semantic features of the image modality information;

wherein w_t ^fAnd w_t ^gIs f_tAnd g_tThe parameter vector of (2);

wherein the content of the first and second substances,

w_t ^his h_tThe parameter vector of (2);

wherein w_v ^fAnd w_v ^gIs f_vAnd g_vThe parameter vector of (2);

wherein the content of the first and second substances,

w_v ^his h_vThe parameter vector of (2);

4. The method for training a data representation feature generator in social media cross-modal search according to claim 1, wherein the intra-modal semantic loss function is:

For handles

And

5. The method for training a data representation feature generator in a social media cross-modality search according to claim 4, wherein the inter-modality similarity loss function is:

6. The method for training a data representation feature generator in social media cross-modal search according to claim 5, wherein the cross-modal discriminant loss function is:

the parameter set for the text modality generator is theta_tIs first ofe expression characteristics corresponding to the text mode information,

7. A social media cross-modal data information search method is characterized by comprising the following steps:

8. The method according to claim 7, wherein the obtaining of the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching comprises:

wherein the content of the first and second substances,

parameter sets for the text modality generator are

a set of parameters for the image modality generator is combined as

original features of jth image modality information; fixing

Or

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 8 are implemented when the processor executes the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.