CN111598712B

CN111598712B - Training and searching method for data feature generator in social media cross-modal search

Info

Publication number: CN111598712B
Application number: CN202010418678.7A
Authority: CN
Inventors: 杜军平; 周南; 崔婉秋; 寇菲菲
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2023-04-18
Anticipated expiration: 2040-05-18
Also published as: CN111598712A

Abstract

The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which comprises the following steps: the method comprises the steps of obtaining a training sample set, obtaining representation characteristics of each data information by adopting a generator for countermeasure learning based on the training sample set, monitoring a countermeasure generator through a discriminator, adjusting a parameter and optimizing the generator through a fixed discriminator and adjusting a parameter and optimizing the discriminator through the fixed generator, and iterating for multiple times to obtain a final generator. The searching method comprises the following steps: inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched; traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by a generator; and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching. The method can adapt to the characteristic of semantic sparsity of data information in social media, and realizes accurate search of cross-modal data information.

Description

Training and searching method for data feature generator in social media cross-modal search

Technical Field

The invention relates to the technical field of data search, in particular to a training and searching method for a data feature generator in social media cross-modal search.

Background

The premise of searching the cross-modal data content of the social network is to perform search feature mining on the social network data, and two strategies are mainly adopted: the method comprises the steps of searching characteristic analysis and mining based on manual work and searching characteristic mining based on a machine learning method. The social media has a huge data volume, and the text is short and irregular, so that the text has the problem of semantic sparsity; meanwhile, the situation that the image pixels in the social network are low and the composition is incomplete causes the problem of semantic sparsity similar to the social network text. Based on the characteristics, manual search feature analysis cannot adapt to huge data volume in a social network, and the existing machine learning is difficult to realize feature extraction of texts or images with sparse semantics. Therefore, it is difficult to perform a search between different modality data contents.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a training and searching method for a data feature generator in social media cross-modal search, so as to solve the problem that the cross-modal search cannot be performed on data information in social media in the prior art.

The technical scheme of the invention is as follows:

in one aspect, the present invention provides a training method for a data representation feature generator in a cross-modal search of social media, including:

obtaining a training sample set, the training sample set comprising: social media data information of multiple modals, and topics to which the data information belongs and corresponding modals are used as tags; wherein, the data information of the plurality of modalities comprises: text mode information and image mode information;

obtaining, with a generator, representative features of each data information based on the set of training samples, the generator including: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, segmenting each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features;

supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic tags is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction in terms of modality between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function;

adjusting parameters to optimize the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; and carrying out multiple iterations to obtain a final generator.

In some embodiments, obtaining the original features of the data information of each of the plurality of modalities includes:

obtaining TF-IDF characteristics of the character modal information as original characteristics of the character modal information, obtaining convolution characteristics of the image modal information as original characteristics of the image modal information, and recording the original characteristics X = { X } of each data information _t ¹ ,x _t ² ,…,x _t ^m ,x _v ¹ ,x _v ² ,…,x _v ⁿ }，x _t ^m Is the original feature of the m-th text mode information, x _v ⁿ The original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.

In some embodiments, segmenting each original feature to obtain a corresponding plurality of local features, and acquiring the representation features of the data information of each modality in the same representation subspace through a self-attention mechanism based on the local features includes:

dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of _t ^m ＝{b _t ^m,1 ,b _t ^m,2 ,…,b _t ^m,k }，x _v ⁿ ＝{b _v ^n,1 ,b _v ^n,2 ,…,b _v ^n,k }，b _t ^m,k Semantic features of the kth block of text as mth text mode information, b _v ^n,k The k block of image semantic features of the n image modality information;

using function f _t And g _t Converting the segmented text semantic features into features representing subspaces：

Wherein w _t ^f And w _t ^g Is f _t And g _t The parameter vector of (a);

attention parameters between the ith text semantic feature and the jth text semantic feature of the mth text modal information are as follows:

the output feature expression of the i-th block text semantic feature of the m-th character modal information is as follows:

wherein it is present>

w _t ^h Is h _t The parameter vector of (a);

the representation characteristics of the mth text mode information are as follows: s. the _t ^m ＝{o _t ^m,1 ,o _t ^m,2 ,...,o _t ^m,k }；

Using function f _v And g _v Converting the segmented image semantic features into features representing subspaces:

wherein w _v ^f And w _v ^g Is f _v And g _v The parameter vector of (2);

the attention parameter between the image semantic feature of the ith block and the image semantic feature of the jth block of the nth image modality information is as follows:

the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information is as follows:

wherein it is present>

w _v ^h Is h _v The parameter vector of (a);

the representation characteristic of the nth image modality information is: s _v ⁿ ＝{o _v ^n,1 ,o _v ^n,2 ,...,o _v ^n,k }。

In some embodiments, the intra-modal semantic loss function is:

wherein, y _t ⁱ And y _v ^j Respectively representing the ith character modal information and the jth image modal information one-hot form topic label vectors in the training sample set under the same topic

A set of parameters for the text modality generator is θ _t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>

The original characteristics of the ith character modal information are obtained; />

A set of parameters for the image modality generator is combined to θ _v The representation characteristic corresponding to the jth image modality information>

Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; function>

For holding>

And &>

Processed by fully connected neural network to be able to communicate with y _t ⁱ And/or y _v ^j The dimension of the multiplication.

In some embodiments, the inter-modal similarity loss function is:

The parameter set for the text modality generator is theta _t A representation characteristic corresponding to the ith character mode information>

A set of parameters for the image modality generator is combined to θ _v The j-th image modality information corresponds to the representation characteristics, device for selecting or keeping>

Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set;

the generative loss function is: l is _generation ＝αL _label +βL _similarity And alpha and beta are respectively weight coefficients of the intra-modal semantic loss function and the inter-modal similarity loss function.

In some embodiments, the cross-modal discriminant loss function is:

wherein, c ^e The modal label is in a form of searched target data information one-hot;

the parameter set for the text modality generator is theta _t The representation characteristic corresponding to the e-th character mode information is used for judging whether the E-th character mode information is matched with the E-th character mode information>

The original characteristics of the e-th character modal information are obtained; />

A set of parameters for the image modality generator is combined to θ _v The representation characteristic corresponding to the e-th image modality information>

Original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is dataThe number of pairs; function->

In the parameter set theta _p And converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.

On the other hand, the invention also provides a social media cross-modal data information searching method, which comprises the following steps:

inputting data information to be searched into a generator to obtain representation characteristics of the data information to be searched;

wherein the generator is based on training samplers and is obtained through antagonistic learning; the training sample set includes: social media data information of multiple modals, and topics to which the data information belongs and corresponding modals are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator includes: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features; supervising the combating of the generator by means of an arbiter, the penalty function employed by the arbiter comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function; tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;

traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator;

and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.

In some embodiments, obtaining the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching includes:

based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:

wherein the content of the first and second substances,

for the parameter set of the text modality generator to be>

The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>

Is combined as ^ er for the parameter set of the image modality generator>

The representation characteristic corresponding to the jth image modality information>

Original features of jth image modality information; fixing>

Or->

One is the representation characteristics of the data information to be searched in the corresponding mode, and the other is the representation characteristics of each existing data in the target mode;

and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.

In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.

In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.

The method has the advantages that the generator is used for respectively mapping the representation characteristics of the text modal information and the image modal information under the self-attention mechanism, and the semantic characteristics of the cross-modal data content in the social media under the same representation subspace are extracted; based on generation antagonism learning, the accuracy of mapping corresponding topics of the representation features generated by the generator between the same modal data information and between different modal data information is improved by using the supervision of the discriminator, and meanwhile, the distribution of the representation features of different modal data information under the same topic is differentiated. Therefore, the method adapts to the characteristic of semantic sparsity of data information in social media and improves the accuracy of searching between cross-modal data information.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to what has been particularly described hereinabove, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principle of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts may be exaggerated in the drawings, i.e., may be larger relative to other components in an exemplary device actually made according to the present invention. In the drawings:

fig. 1 is a schematic flowchart of a training method for a data feature generator in a social media cross-modal search according to an embodiment of the present invention;

fig. 2 is a schematic logical structure diagram of a training method for a data feature generator in a social media cross-modal search according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram illustrating iterative optimization of a training method for a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a social media cross-modality data information search method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted that, unless otherwise specified, the term "coupled" is used herein to refer not only to a direct connection, but also to an indirect connection with an intermediate.

It should be noted that, the "modality" mentioned in the present invention refers to the form of data information, and may include: text information, image information, audio information, or video information. "topic" refers to the semantically directed content of data information to represent matters discussed and focused on per media in social interaction, such as a particular news topic, which contains pieces of data information, image information, audio information, or video information associated therewith.

Because data information semantics in the social media are sparse, text modal information data is short and irregular, image modal information resolution is low, and composition is incomplete, the work of searching data information in the social media in a cross-modal mode is difficult to realize. In the prior art, the search of cross-modal data information in social media is difficult to adapt to the characteristic of sparse semantics to realize high-precision search, or the search analysis process is too complex and the realization difficulty is high.

The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which is used for extracting representation features of data information in the social media, realizing the search of the cross-modal data information in a similarity comparison mode, improving the precision of search matching, simplifying the search implementation process and improving the efficiency.

In one aspect, the present invention provides a training method for a data representation feature generator in cross-modal search of social media, wherein the generator for extracting data information representation features in social media is generated based on counterstudy training, as shown in fig. 1, the training method includes steps S101 to S104, it should be noted that the steps of the training method do not limit the sequence, and it should be understood that, in the training process, the steps S101 to S104 may synchronize or change the implementation sequence in some cases:

step S101: obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information.

Step S102: based on the training sample set, the generator is adopted to obtain the representation characteristics of each data information, and the generator comprises: the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features.

Step S103: supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculated value of a semantic loss function in the modes, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing the calculated value of a similarity loss function between the modes, and the difference of the representation features of different modal data information about the modes is maximized by minimizing the calculated value of a cross-mode discriminant loss function.

Step S104: adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.

In step S101, data information in social media is used as a data item, a sample training set is established for counterstudy, and topics and modalities corresponding to the data information are labeled as tags. In order to embody the characteristics of the social media cross-modal search, the data information at least comprises two forms of text modal information and image modal information, and in other embodiments, in order to adapt to higher retrieval requirements, the data information also comprises audio modal information and/or video information. The topic tags and the modal tags corresponding to the data information may be marked in a one-hot encoding form, and in other embodiments, the topic tags and the modal tags may also be marked in other forms of tag encoding according to specific situations. In some embodiments, the number of text modality information and image modality information in the default sample training set is consistent.

In step S102, as shown in fig. 2, a counterstudy method is adopted, and the generator is used to collect the representation features of each data information for similarity comparison, so as to implement a cross-modal search. Specifically, because of the great difference in data form between different modality data information, features generated by single extraction of different modality data information are not consistent in form, content, meaning and standard, are not in the same evaluation dimension, cannot be directly compared, and even cannot be used for mutual retrieval. Therefore, in order to search between different modality data information, it is necessary to acquire features of the same evaluation dimension, i.e., features within the same representation subspace, generated by different modality data information.

In the present embodiment, the generator first obtains the original features directly generated from each data message, and the form and the collection method of the original features are determined according to the modality of the corresponding data message. For example, the text modal information may adopt TF-IDF (term frequency-inverse document frequency) characteristics as its corresponding original characteristics; the image modality information may employ a convolution feature (VGGNet convolution neural network feature) as its corresponding raw feature. Based on the principle of a self-attention mechanism, the original features of each piece of text mode information or image mode information are divided into a plurality of local features. For single text mode information or image mode information, attention parameters of each local feature relative to other local features are obtained, and the attention parameters are accumulated after being subjected to product in a uniform expression subspace, so that the attention of the local features can be expressed, and corresponding output features of the local features are obtained. The feature vector formed by combining the output features of the local features is used as a final representation feature, and the semantics and the mode of the feature vector can be reflected in a correlated manner.

Specifically, the training sample set has data information of a plurality of topics, and the modality of the data information includes two types, namely characters and images. Defining the training sample set as C = { t = { (t) } ₁ ,t ₂ ,…,t _m ,v ₁ ,v ₂ ,…,v _n }，t _m Representing the m-th text modality information, v _n Representing nth image modality information; and simultaneously, marking topics and modes corresponding to each character mode information and each graph mode information as labels in a training sample set. The topic labels and the modal labels can be marked in a one-hot (one-hot coding) coding vector form, Q states are coded by Q state registers, each state has an independent register bit, and only one of the register bits is effective to express one state. For example, 5 topic categories, encoded with a 5-bit status register, and when the data information belongs to the first category, the tag is represented as [1,0,0,0,0 ]]. The mode can be coded by using a 2-bit state register, so that 1,0]Mark text Modal, [0,1]The image modality is labeled.

In some embodiments, obtaining raw features of data information of multiple modalities includes:

obtaining TF-IDF characteristics of character modal information as original characteristics of the character modal information, obtaining convolution characteristics of image modal information as original characteristics of the image modal information, and recording the original characteristics X = { X } of each data information _t ¹ ,x _t ² ,…,x _t ^m ,x _v ¹ ,x _v ² ,…,x _v ⁿ }，x _t ^m Is the original feature of the m-th text mode information, x _v ⁿ The original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.

In some embodiments, as shown in fig. 2, each original feature is segmented to obtain a plurality of corresponding local features, and the representation features of the data information of each modality in the same representation subspace are obtained through an attention-driven mechanism based on the local features, including S201 to S208, where S202 to S204 are generation processes of the text modality information representation features, and S205 to S207 are generation processes of the image modality information representation features:

s201: dividing TF-IDF characteristics of the character modal information and convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of _t ^m ＝{b _t ^m,1 ,b _t ^m,2 ,…,b _t ^m,k }，x _v ⁿ ＝{b _v ^n,1 ,b _v ^n,2 ,…,b _v ^n,k }，b _t ^m,k Semantic features of the kth block of text as mth text mode information, b _v ^n,k And the semantic features of the kth block of the nth image modality information.

S202: using function f _t And g _t Converting the segmented text semantic features into features representing subspaces:

wherein w _t ^f And w _t ^g Is f _t And g _t The parameter vector of (2).

S203: calculating attention parameters between the ith text semantic feature and the jth text semantic feature of the mth character modal information as follows:

s204: calculating an output characteristic expression of the ith block text semantic characteristic of the mth character modal information as follows:

wherein the content of the first and second substances,

w _t ^h is h _t The parameter vector of (2);

outputting the representation characteristics of the mth text mode information as follows: s _t ^m ＝{o _t ^m,1 ,o _t ^m,2 ,...,o _t ^m,k }。

S205: using function f _v And g _v Converting the segmented image semantic features into features representing subspaces:

wherein w _v ^f And w _v ^g Is f _v And g _v The parameter vector of (2);

s206: calculating attention parameters between the ith block image semantic feature and the jth block image semantic feature of the nth image modality information as follows:

s207: calculating the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information as follows:

wherein the content of the first and second substances,

w _v ^h is h _v The parameter vector of (a);

the representation characteristics of the nth image modality information are as follows: s _v ⁿ ＝{o _v ^n,1 ,o _v ^n,2 ,...,o _v ^n,k }。

In the embodiment, the representation features for representing the semantics of the text modal information and the image modal information are extracted through a self-attention mechanism, and the evaluation dimensions are unified, so that the representation features of the data information among different modalities are in the same representation subspace, and the search of the cross-modality data information is realized. Wherein the function f _t 、g _t And h _t Is a function f for converting each local feature of the original features of the text modal information into a representation subspace _v 、g _v And h _v The method is used for converting each local feature of the original features of the image mode information into the same expression subspace so as to realize the unification of evaluation dimensions. Function f _t 、g _t And h _t And a function f _v 、g _v And h _v And the corresponding parameter vectors are iteratively updated under the supervision of the discriminator to obtain the optimal values in the process of counterstudy.

In step S103, as shown in fig. 2, optimization of the generator is achieved by providing a system in which the arbiter and the generator form a counterlearning. In particular, the invention aims to adjust the generator to make the representation characteristics generated based on the social media data information have the following effects through supervision of the discriminator: 1. minimizing a difference in distribution between the representative features and the corresponding topic labels even if the representative features of the respective data information are accurately associated to represent the topics thereof; 2. the relevance of the representation features of different modal data information of the same topic about the topic is maximized, even if the representation features of different modal data information of the same topic are converged in a semantic angle; 3. the difference of representation features of different modal data information about the modal is strengthened, and even the representation features of different modal data information under the same topic tend to be differentiated in the modal angle. In order to enable the expression features generated by the generator to achieve the above 3 effects, supervised learning needs to be performed by the discriminator, and specifically, the generation loss function obtained by weighted summation of the intra-modal semantic loss function and the inter-modal similarity loss function is implemented by joint action of the cross-modal discriminant loss function.

In some embodiments, an intra-modal semantic loss function is used to minimize the distribution difference between the representation features and the corresponding topic tags, the intra-modal semantic loss function being:

wherein, y _t ⁱ And y _v ^j Respectively representing the topic label vectors of the ith character modal information and the jth image modal information in the training sample set in the one-hot form under the same topic

The parameter set for the text modality generator is theta _t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>

The set of parameters for the image modality generator is collectively θ _v The j-th image modality information is corresponding to the representation characteristics, device for combining or screening>

Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; is a function->

For predicting topic probability distribution of each text or image representing features, processing the generated representation features into a probability distribution capable of being connected with y through a fully-connected neural network _t ⁱ And/or y _v ^j The dimension of the multiplication.

In some embodiments, an inter-modal similarity loss function is used to maximize the relevance of the representation features on a topic between different modal data information for the same topic, the inter-modal similarity loss function being:

/>

The set of parameters for the text modality generator is theta _t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>

The set of parameters for the image modality generator is collectively θ _v The representation characteristic corresponding to the jth image modality information>

Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set.

In some embodiments, the intra-modality semantic loss function and the inter-modality similarity loss function are summed in a weighted manner to obtain a resulting loss function. The resulting loss function is: l is _generation ＝αL _label +βL _similarity And alpha and beta are respectively corresponding weight coefficients of a semantic loss function in the modes and a similarity loss function between the modes. In the present embodiment, the effect of supervision in the antagonistic learning is adjusted by setting a weight coefficient.

In some embodiments, the cross-modal discriminant loss function is used to enhance the distinction of representation features about the modalities between different modality data information under the same topic, and the cross-modal discriminant loss function is:

the set of parameters for the text modality generator is theta _t The representation characteristic corresponding to the e-th character mode information is used for judging whether the E-th character mode information is matched with the E-th character mode information>

The set of parameters for the image modality generator is combined to θ _v The representation characteristic corresponding to the e-th image modality information>

Is as followsOriginal characteristics of e pieces of image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function->

In this embodiment, since the search process is divided into two cases, namely, a text search image (the searched target data information is image modality information) and an image search text (the searched target data information is text modality information), it is necessary to distinguish the modalities, that is, to discriminate the function of the loss function across the modalities.

In step S104, as shown in fig. 3, the generator is optimized by the fixed arbiter, and then the generator is optimized by the fixed arbiter, and multiple iterations are performed to obtain a better generator, so as to implement an effective and complete antagonistic learning process.

Specifically, in this embodiment, based on the difference between the minimum generation loss function and the cross-modal discriminant loss function, the optimized parameter set θ of the text modal generator is obtained _t And a parameter set theta of an image modality generator _v Namely:

wherein, the first and the second end of the pipe are connected with each other,

and &>

To an optimized theta _t And theta _v ；

Based on the difference between the maximum generation loss function and the cross-modal discriminant loss function, a parameter set theta of the optimized discriminant is obtained _p Namely:

wherein the content of the first and second substances,

to the optimized theta _p 。

On the other hand, the invention also provides a social media cross-modal data information searching method, as shown in fig. 4, including steps S301 to S303:

step S301: and inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched.

Wherein the generator is obtained by counterlearning based on the training samplers; the training sample set includes: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator comprises: the system comprises a character modal generator and an image modal generator, wherein the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features; supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculation value of a semantic loss function in the modes, the correlation between the representation features of different mode data information under the same topic is maximized by minimizing the calculation value of a similarity loss function between the modes, and the difference of the representation features of the different mode data information about the modes is maximized by minimizing the calculation value of a cross-mode discriminant loss function; adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.

Step S302: and traversing the existing data information of the target modality, and acquiring the representation characteristics generated by the same generator of each existing data information.

Step S303: and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.

Based on the same inventive concept as steps S101 to S104, in step S301 of this embodiment, a generator generated by the training method of the data representation feature generator in the social media cross-modality search is used to collect the representation features of the data information to be searched. In step S302, the existing data is traversed to obtain the representation characteristics of each existing data generated by the generator in step S301. In step S303, one or more pieces of recent existing data information of the target modality are obtained by the proximity matching search.

In some embodiments, in step S303, that is, obtaining the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching includes S3031 to S3032: :

s3031: based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:

wherein the content of the first and second substances,

parameter set for text mode generator>

Parameter sets for an image modality generator are combined to ≥>

Original features of jth image modality information; fixed->

Or>

One is the representation characteristics of the data information to be searched in the corresponding modality, and the other is the representation characteristics of each existing data in the target modality.

S3032: and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.

In the present embodiment, when searching for image modality information based on text modality information, it is fixed

Traversing the image modality information in the existing data information for the representation characteristics of the data information to be searched, calculating the similarity based on a calculation formula (14), and arranging the image modality information in the existing data information based on the similarity to obtain one or more pieces of image modality information with the highest similarity. Similarly, when searching for text mode information based on image mode information, fix &>

As the number to be searchedTraversing the character modal information in the existing data information according to the representation characteristics of the information, calculating the similarity based on a calculation formula (14), and arranging the character modal information in the existing data information based on the similarity to obtain one or more pieces of character modal information with the highest similarity. Wherein, the smaller the sim, the higher the similarity.

In summary, the training and searching method for the data feature generator in the social media cross-modal search according to the present invention realizes the search between the social media cross-modal data information by the countercheck learning method, and emphasizes the cross-modal content search between the word modal information and the image modal information. The generator for counterlearning reconstructs original features of different modal data information in the social media based on a self-attention mechanism, and the original features are mapped into a representation subspace which can be directly compared, so that the search of cross-modal data information is realized. Further, a joint loss function is established through the discriminator, and the generated representation features are guided to be representation features following the corresponding modal semantic distribution by utilizing the intra-modal semantic loss function and the inter-modal similarity loss function. A loss function is discriminated across modes to achieve discrimination of modes. The method can adapt to the characteristic of sparse data information semantics in social media, complete accurate, efficient and stable search of cross-modal data information, and greatly improve the efficiency compared with the prior art.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an Erasable ROM (EROM), a floppy disk, a CD-ROM, an optical disk, a hard disk, an optical fiber medium, a Radio Frequency (RF) link, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments noted in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed at the same time.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A training method for a data representation feature generator in a social media cross-modal search is characterized by comprising the following steps:

obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modes comprises: text modality information and image modality information;

obtaining, with a generator, representative features of each data information based on the set of training samples, the generator including: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features;

supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function;

adjusting parameters to optimize the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;

the method for acquiring the original features of the data information in the corresponding modality comprises the following steps: obtainTaking TF-IDF characteristics of the character modal information as original characteristics of the character modal information, obtaining convolution characteristics of the image modal information as original characteristics of the image modal information, and recording the original characteristics X = { X = { of each data information _t ¹ ,x _t ² ,…,x _t ^m ,x _v ¹ ,x _v ² ,…,x _v ⁿ }，x _t ^m Is the original feature of the m-th text mode information, x _v ⁿ M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers;

the method comprises the following steps of segmenting each original feature to obtain a plurality of corresponding local features, and acquiring the representation features of data information of each mode in the same representation subspace through a self-attention mechanism on the basis of the local features, wherein the method comprises the following steps:

dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of _t ^m ＝{b _t ^m,1 ,b _t ^m,2 ,…,b _t ^m,k }，x _v ⁿ ＝{b _v ^n,1 ,b _v n ^,2 ,…,b _v ^n,k }，b _t ^m,k Semantic features of the kth block of text as mth text mode information, b _v ^n,k The k block of image semantic features of the n image modality information;

using function f _t And g _t Converting the segmented text semantic features into features representing subspaces:

wherein +>

And &>

Is f _t And g _t The parameter vector of (a);

the output characteristic expression of the i block text semantic characteristic of the m character modal information is as follows:

wherein +>

w _t ^h Is h _t The parameter vector of (2);

the representation characteristics of the mth text modal information are as follows: s _t ^m ＝{o _t ^m,1 ,o _t ^m,2 ,...,o _t ^m,k }；

wherein +>

And &>

Is f _v And g _v The parameter vector of (a);

wherein it is present>

w _v ^h Is h _v The parameter vector of (a);

the representation characteristic of the nth image modality information is: s. the _v ⁿ ＝{o _v ^n,1 ,o _v ^n,2 ,...,o _v ^n,k }。

2. The method for training a data representation feature generator in social media cross-modal search according to claim 1, wherein the intra-modal semantic loss function is:

and &>

Respectively representing the ith text modal information and the jth image modal information one-hot type topic label vectors in the training sample set, and based on the same topic>

Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; function->

By>

And &>

Processed by fully connected neural network to be able to communicate with y _t ⁱ And/or->

The dimension of the multiplication.

3. The method for training a data representation feature generator in social media cross-modal search according to claim 2, wherein the inter-modal similarity loss function is:

wherein, y _t ⁱ And

respectively representing the ith character modal information and the jth image modal information one-hot type topic label vectors in the training sample set, and under the same topic ^ H>

the generation loss function is: l is _generation ＝αL _label +βL _similarity And alpha and beta are respectively the weight coefficients of the intra-modal semantic loss function and the inter-modal similarity loss function.

4. The method for training a data representation feature generator in a cross-modal search of social media according to claim 3, wherein the cross-modal discriminant loss function is:

/>

A set of parameters for the image modality generator is combined to θ _v Characteristic corresponding to the mth image modality information>

Original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function>

5. A social media cross-modal data information search method is characterized by comprising the following steps:

wherein the generator is trained by a training method of a data representation feature generator in a social media cross-modal search according to any one of claims 1 to 4 based on a training sampler; the training sample set includes: social media data information of multiple modals, and topics to which the data information belongs and corresponding modals are used as tags; wherein, the data information of the plurality of modes comprises: text modality information and image modality information; the generator includes: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features; supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic tags is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction in terms of modality between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function; tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;

6. The method for searching the social media cross-modal data information according to claim 5, wherein the obtaining of the existing data information of one or more target modalities which are closest to the representation features of the data information to be searched based on similarity matching comprises:

wherein the content of the first and second substances,

for the parameter set of the text modality generator to be>

Is combined as ^ er for the parameter set of the image modality generator>

Original features of jth image modality information; fixed->

Or->

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the processor executes the program.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.