CN111598712A - Training and searching method for data feature generator in social media cross-modal search - Google Patents

Training and searching method for data feature generator in social media cross-modal search Download PDF

Info

Publication number
CN111598712A
CN111598712A CN202010418678.7A CN202010418678A CN111598712A CN 111598712 A CN111598712 A CN 111598712A CN 202010418678 A CN202010418678 A CN 202010418678A CN 111598712 A CN111598712 A CN 111598712A
Authority
CN
China
Prior art keywords
information
modal
generator
representation
data information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010418678.7A
Other languages
Chinese (zh)
Other versions
CN111598712B (en
Inventor
杜军平
周南
崔婉秋
寇菲菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202010418678.7A priority Critical patent/CN111598712B/en
Publication of CN111598712A publication Critical patent/CN111598712A/en
Application granted granted Critical
Publication of CN111598712B publication Critical patent/CN111598712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)

Abstract

The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which comprises the following steps: the method comprises the steps of obtaining a training sample set, obtaining representation characteristics of each data information by adopting a generator for countermeasure learning based on the training sample set, monitoring a countermeasure generator through a discriminator, adjusting a parameter and optimizing the generator through a fixed discriminator and adjusting a parameter and optimizing the discriminator through the fixed generator, and iterating for multiple times to obtain a final generator. The searching method comprises the following steps: inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched; traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator; and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching. The method can adapt to the characteristic of semantic sparsity of data information in social media, and realizes accurate search of cross-modal data information.

Description

Training and searching method for data feature generator in social media cross-modal search
Technical Field
The invention relates to the technical field of data search, in particular to a training and searching method for a data feature generator in social media cross-modal search.
Background
The premise of searching the cross-modal data content of the social network is to perform search feature mining on the social network data, and two strategies are mainly adopted: the method comprises the steps of searching characteristic analysis and mining based on manual work and searching characteristic mining based on a machine learning method. The social media has a huge data volume, and the text is short and irregular, so that the text has the problem of semantic sparsity; meanwhile, the situation that the image pixels in the social network are low and the composition is incomplete causes the problem of semantic sparsity similar to the social network text. Based on the characteristics, manual search feature analysis cannot adapt to huge data volume in a social network, and the existing machine learning is difficult to realize feature extraction of texts or images with sparse semantics. Therefore, it is difficult to perform a search between different modality data contents.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a training and searching method for a data feature generator in social media cross-modal search, so as to solve the problem that the cross-modal search cannot be performed on data information in social media in the prior art.
The technical scheme of the invention is as follows:
in one aspect, the present invention provides a training method for a data representation feature generator in a social media cross-modal search, including:
obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information;
obtaining, with a generator, representative features of each data information based on the set of training samples, the generator including: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features;
supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function;
tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; and carrying out multiple iterations to obtain a final generator.
In some embodiments, obtaining raw features of data information of multiple modalities includes:
obtaining TF-IDF characteristics of the character modal information as original characteristics of the character modal information, obtaining convolution characteristics of the image modal information as original characteristics of the image modal information, and recording the original characteristics X ═ X of each data informationt 1,xt 2,…,xt m,xv 1,xv 2,…,xv n},xt mIs the original feature of the m-th text mode information, xv nThe original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.
In some embodiments, segmenting each original feature to obtain a corresponding plurality of local features, and acquiring the representation features of the data information of each modality in the same representation subspace through a self-attention mechanism based on the local features includes:
dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number oft m={bt m,1,bt m,2,…,bt m,k},xv n={bv n,1,bv n,2,…,bv n,k},bt m,kSemantic features of the kth block of text as mth text mode information, bv n,kThe k block of image semantic features of the n image modality information;
using function ftAnd gtConverting the segmented text semantic features into features representing subspaces:
Figure BDA0002496063800000021
wherein wt fAnd wt gIs ftAnd gtThe parameter vector of (2);
attention parameters between the ith block text semantic feature and the jth block text semantic feature of the mth word modal information are as follows:
Figure BDA0002496063800000022
the output characteristic expression of the i block text semantic characteristic of the m character modal information is as follows:
Figure BDA0002496063800000023
wherein the content of the first and second substances,
Figure BDA0002496063800000024
wt his htThe parameter vector of (2);
the representation characteristics of the mth text mode information are as follows: st m={ot m,1,ot m,2,...,ot m,k};
Using function fvAnd gvConverting the segmented image semantic features into features representing subspaces:
Figure BDA0002496063800000031
wherein wv fAnd wv gIs fvAnd gvThe parameter vector of (2);
the attention parameter between the image semantic feature of the ith block and the image semantic feature of the jth block of the nth image modality information is as follows:
Figure BDA0002496063800000032
the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information is as follows:
Figure BDA0002496063800000033
wherein the content of the first and second substances,
Figure BDA0002496063800000034
wv his hvThe parameter vector of (2);
the representation characteristic of the nth image modality information is: sv n={ov n,1,ov n,2,...,ov n,k}。
In some embodiments, the intra-modal semantic loss function is:
Figure BDA0002496063800000035
wherein, yt iAnd yv jIndividual watchShowing that topic label vectors of the ith character modal information and the jth image modal information in one-hot form in the training sample set are under the same topic
Figure BDA0002496063800000036
Figure BDA0002496063800000037
The parameter set for the text modality generator is thetatThe representation characteristics corresponding to the ith character mode information,
Figure BDA0002496063800000038
the original characteristics of the ith character modal information are obtained;
Figure BDA0002496063800000039
a set of parameters for the image modality generator is combined to θvThe j-th image modality information corresponds to the representation characteristics,
Figure BDA00024960638000000310
original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; function(s)
Figure BDA00024960638000000311
For handles
Figure BDA00024960638000000312
And
Figure BDA00024960638000000313
processed by fully connected neural network to be able to communicate with yt iAnd/or yv jThe dimension of the multiplication.
In some embodiments, the inter-modal similarity loss function is:
Figure BDA00024960638000000314
wherein, yt iAnd yv jRespectively representing the ith character modal information and the jth image modal information one-hot form topic label vectors in the training sample set under the same topic
Figure BDA00024960638000000315
Figure BDA00024960638000000316
The parameter set for the text modality generator is thetatThe representation characteristics corresponding to the ith character mode information,
Figure BDA00024960638000000317
the original characteristics of the ith character modal information are obtained;
Figure BDA00024960638000000318
a set of parameters for the image modality generator is combined to θvThe j-th image modality information corresponds to the representation characteristics,
Figure BDA00024960638000000319
original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set;
the generation loss function is: l isgeneration=αLlabel+βLsimilarityα and β are weight coefficients of the intra-modality semantic loss function and the inter-modality similarity loss function, respectively.
In some embodiments, the cross-modal discriminant loss function is:
Figure BDA0002496063800000041
wherein, ceThe modal label is in a form of searched target data information one-hot;
Figure BDA0002496063800000042
the parameter set for the text modality generator is thetatThe representation characteristics corresponding to the e-th character mode information,
Figure BDA0002496063800000043
the original characteristics of the e-th character modal information are obtained;
Figure BDA0002496063800000044
a set of parameters for the image modality generator is combined to θvThe representation characteristic corresponding to the e-th image modality information,
Figure BDA0002496063800000045
original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function(s)
Figure BDA0002496063800000046
In the parameter set thetapAnd converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.
On the other hand, the invention also provides a social media cross-modal data information searching method, which comprises the following steps:
inputting data information to be searched into a generator to obtain representation characteristics of the data information to be searched;
wherein the generator is derived by counterlearning based on a training sampler; the training sample set includes: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator includes: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features; supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function; tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;
traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator;
and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.
In some embodiments, obtaining the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching includes:
based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:
Figure BDA0002496063800000051
wherein the content of the first and second substances,
Figure BDA0002496063800000052
parameter sets for the text modality generator are
Figure BDA0002496063800000053
The representation characteristics corresponding to the ith character mode information,
Figure BDA0002496063800000054
the original characteristics of the ith character modal information are obtained;
Figure BDA0002496063800000055
a set of parameters for the image modality generator is combined as
Figure BDA0002496063800000056
The j-th image modality information corresponds to the representation characteristics,
Figure BDA0002496063800000057
original features of jth image modality information; fixing
Figure BDA0002496063800000058
Or
Figure BDA0002496063800000059
One is the representation characteristics of the data information to be searched in the corresponding mode, and the other is the representation characteristics of each existing data in the target mode;
and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.
In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.
In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.
The method has the advantages that the generator is used for respectively mapping the representation characteristics of the text modal information and the image modal information under the self-attention mechanism, and the semantic characteristics of the cross-modal data content in the social media under the same representation subspace are extracted; based on generation antagonism learning, the accuracy of mapping corresponding topics of the representation features generated by the generator between the same modal data information and between different modal data information is improved by using the supervision of the discriminator, and meanwhile, the distribution of the representation features of different modal data information under the same topic is differentiated. Therefore, the method adapts to the characteristic of semantic sparsity of data information in social media and improves the accuracy of searching between cross-modal data information.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts of the drawings may be exaggerated, i.e., may be larger, relative to other components in an exemplary apparatus actually manufactured according to the present invention. In the drawings:
FIG. 1 is a schematic flowchart illustrating a method for training a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a logic structure of a training method for a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram illustrating iterative optimization of a training method for a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a social media cross-modality data information search method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
It is also noted herein that the term "coupled," if not specifically stated, may refer herein to not only a direct connection, but also an indirect connection in which an intermediate is present.
It should be noted that, the "modality" mentioned in the present invention refers to the form of data information, and may include: text information, image information, audio information, or video information. "topic" refers to the semantically directed content of data information to represent matters discussed and focused on per media in social interaction, such as a particular news topic, which contains pieces of data information, image information, audio information, or video information associated therewith.
Because data information semantics in the social media are sparse, text modal information data is short and irregular, image modal information resolution is low, and composition is incomplete, the work of searching data information in the social media in a cross-modal mode is difficult to realize. In the prior art, the search of cross-modal data information in social media is difficult to adapt to the characteristic of sparse semantics to realize high-precision search, or the search analysis process is too complex and the realization difficulty is high.
The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which is used for extracting representation features of data information in the social media, realizing the search of the cross-modal data information in a similarity comparison mode, improving the precision of search matching, simplifying the search implementation process and improving the efficiency.
In one aspect, the present invention provides a training method for a data representation feature generator in cross-modal search of social media, wherein the generator for extracting data information representation features in social media is generated based on counterstudy training, as shown in fig. 1, the training method includes steps S101 to S104, it should be noted that the steps of the training method do not limit the sequence, and it should be understood that, in the training process, the steps S101 to S104 may synchronize or change the implementation sequence in some cases:
step S101: obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information.
Step S102: based on the training sample set, the generator is adopted to obtain the representation characteristics of each data information, and the generator comprises: the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features.
Step S103: supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculation value of a semantic loss function in the modes, the correlation between the representation features of different mode data information under the same topic is maximized by minimizing the calculation value of a similarity loss function between the modes, and the difference of the representation features of the different mode data information about the modes is maximized by minimizing the calculation value of a cross-mode discriminant loss function.
Step S104: adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.
In step S101, data information in social media is used as a data item, a sample training set is established for counterstudy, and topics and modalities corresponding to the data information are labeled as tags. In order to embody the characteristics of the social media cross-modal search, the data information at least comprises two forms of text modal information and image modal information, and in other embodiments, in order to adapt to higher retrieval requirements, the data information also comprises audio modal information and/or video information. The topic tags and the modal tags corresponding to the data information may be marked in a one-hot encoding form, and in other embodiments, the topic tags and the modal tags may also be marked in other forms of tag encoding according to specific situations. In some embodiments, the number of text modality information and image modality information in the default sample training set is consistent.
In step S102, as shown in fig. 2, a counterstudy method is adopted, and the generator is used to collect the representation features of each data information for similarity comparison, so as to implement cross-modal search. Specifically, because of the great difference in data form between different modality data information, features generated by single extraction of different modality data information are not consistent in form, content, meaning and standard, are not in the same evaluation dimension, cannot be directly compared, and even cannot be used for mutual retrieval. Therefore, in order to search between different modality data information, it is necessary to acquire features of the same evaluation dimension, i.e., features within the same representation subspace, generated by different modality data information.
In the present embodiment, the generator first obtains the original features directly generated from each data message, and the form and the collection method of the original features are determined according to the modality of the corresponding data message. For example, the text modal information may adopt TF-IDF (term frequency-inverse document frequency) characteristics as its corresponding original characteristics; the image modality information may employ a convolution feature (VGGNet convolution neural network feature) as its corresponding raw feature. Based on the principle of a self-attention mechanism, the original features of each piece of text mode information or image mode information are divided into a plurality of local features. For single text mode information or image mode information, attention parameters of each local feature relative to other local features are obtained, and the attention parameters are accumulated after being subjected to product in a uniform expression subspace, so that the attention of the local features can be expressed, and corresponding output features of the local features are obtained. The feature vector formed by combining the output features of the local features is used as a final representation feature, and the semantics and the mode of the feature vector can be reflected in a correlated manner.
Specifically, the training sample set has data information of a plurality of topics, and the modality of the data information includes two types, namely characters and images. Define the training sample set as C ═ t1,t2,…,tm,v1,v2,…,vn},tmRepresenting the m-th text modality information, vnRepresenting nth image modality information; and simultaneously, marking topics and modes corresponding to each character mode information and each graph mode information as labels in a training sample set. Topic tags and modal tags can be labeled in a one-hot (one-hot coded) coding vector form, Q states are coded by using Q-bit state registers, each state has an independent register bit, only one of the register bits is effective, and the tableA state is reached. For example, 5 topic categories, encoded with a 5-bit status register, have tags of [1,0,0 ] when the data information belongs to the first category]. The mode can be encoded by using a 2-bit status register, so that [1,0 ]]Mark text Modal, [0,1]The image modality is labeled.
In some embodiments, obtaining raw features of data information of multiple modalities includes:
obtaining TF-IDF characteristics of character modal information as original characteristics of the character modal information, obtaining convolution characteristics of image modal information as original characteristics of the image modal information, and recording the original characteristics X ═ X of each data informationt 1,xt 2,…,xt m,xv 1,xv 2,…,xv n},xt mIs the original feature of the m-th text mode information, xv nThe original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.
In some embodiments, as shown in fig. 2, each original feature is segmented to obtain a plurality of corresponding local features, and the representation features of the data information of each modality in the same representation subspace are obtained through an attention-driven mechanism based on the local features, including S201 to S208, where S202 to S204 are generation processes of the text modality information representation features, and S205 to S207 are generation processes of the image modality information representation features:
s201: dividing TF-IDF characteristics of the character modal information and convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number oft m={bt m,1,bt m,2,…,bt m,k},xv n={bv n,1,bv n,2,…,bv n,k},bt m,kSemantic features of the kth block of text as mth text mode information, bv n,kAnd the semantic features of the kth block of the nth image modality information.
S202: using function ftAnd gtSemantic features of segmented textConversion to a feature representing a subspace:
Figure BDA0002496063800000091
Figure BDA0002496063800000092
wherein wt fAnd wt gIs ftAnd gtThe parameter vector of (2).
S203: calculating attention parameters between the ith block text semantic feature and the jth block text semantic feature of the mth word modal information as follows:
Figure BDA0002496063800000093
s204: calculating an output characteristic expression of the ith block text semantic characteristic of the mth character modal information as follows:
Figure BDA0002496063800000094
wherein the content of the first and second substances,
Figure BDA0002496063800000095
wt his htThe parameter vector of (2);
outputting the representation characteristics of the mth text mode information as follows: st m={ot m,1,ot m,2,...,ot m,k}。
S205: using function fvAnd gvConverting the segmented image semantic features into features representing subspaces:
Figure BDA0002496063800000096
Figure BDA0002496063800000097
wherein, wv fAnd wv gIs fvAnd gvThe parameter vector of (2);
s206: calculating attention parameters between the ith block image semantic feature and the jth block image semantic feature of the nth image modality information as follows:
Figure BDA0002496063800000098
s207: calculating the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information as follows:
Figure BDA0002496063800000101
wherein the content of the first and second substances,
Figure BDA0002496063800000102
wv his hvThe parameter vector of (2);
the representation characteristics of the nth image modality information are as follows: sv n={ov n,1,ov n,2,...,ov n,k}。
In the embodiment, the representation features for representing the semantics of the text modal information and the image modal information are extracted through a self-attention mechanism, and the evaluation dimensions are unified, so that the representation features of the data information among different modalities are in the same representation subspace, and the search of the cross-modality data information is realized. Wherein the function ft、gtAnd htIs a function f for converting each local feature of the original features of the text modal information into a representation subspacev、gvAnd hvThe method is used for converting each local feature of the original features of the image mode information into the same expression subspace so as to realize the unification of evaluation dimensions. Function ft、gtAnd htAnd a function fv、gvAnd hvCorresponding parameter vectors are all in oppositionIn the learning process, the optimal value is obtained by iterative updating under the supervision of the discriminator.
In step S103, as shown in fig. 2, optimization of the generator is achieved by providing a system in which the arbiter and the generator form a counterlearning. In particular, the invention aims to adjust the generator to make the representation characteristics generated based on the social media data information have the following effects through supervision of the discriminator: 1. minimizing a difference in distribution between the representative features and the corresponding topic labels even if the representative features of the respective data information are accurately associated to represent the topics thereof; 2. the relevance of the representation features of different modal data information of the same topic about the topic is maximized, even if the representation features of different modal data information of the same topic are converged in a semantic angle; 3. the difference of representation features of different modal data information about the modal is strengthened, and even the representation features of different modal data information under the same topic tend to be differentiated in the modal angle. In order to enable the representation features generated by the generator to achieve the 3 effects, supervised learning is required to be performed through a discriminator, and specifically, a generation loss function obtained by weighted summation of an intra-modal semantic loss function and an inter-modal similarity loss function and a cross-modal discriminant loss function are combined to achieve the purpose.
In some embodiments, an intra-modal semantic loss function is used to minimize the distribution difference between the representation features and the corresponding topic tags, the intra-modal semantic loss function being:
Figure BDA0002496063800000103
wherein, yt iAnd yv jRespectively representing the topic label vectors of the ith character modal information and the jth image modal information in the training sample set in the one-hot form under the same topic
Figure BDA0002496063800000104
Figure BDA0002496063800000105
As arguments for text modality generatorsNumber set of thetatThe representation characteristics corresponding to the ith character mode information,
Figure BDA0002496063800000106
the original characteristics of the ith character modal information are obtained;
Figure BDA0002496063800000107
the set of parameters for the image modality generator is combined to θvThe j-th image modality information corresponds to the representation characteristics,
Figure BDA0002496063800000108
original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; is a function of
Figure BDA00024960638000001112
For predicting topic probability distribution of each text or image representing features, processing the generated representation features into a probability distribution capable of being connected with y through a fully-connected neural networkt iAnd/or yv jThe dimension of the multiplication.
In some embodiments, an inter-modal similarity loss function is used to maximize the relevance of a representation feature with respect to a topic between different modal data information for the same topic, the inter-modal similarity loss function being:
Figure BDA0002496063800000111
wherein, yt iAnd yv jRespectively representing the topic label vectors of the ith character modal information and the jth image modal information in the training sample set in the one-hot form under the same topic
Figure BDA0002496063800000112
Figure BDA0002496063800000113
As arguments for text modality generatorsNumber set of thetatThe representation characteristics corresponding to the ith character mode information,
Figure BDA0002496063800000114
the original characteristics of the ith character modal information are obtained;
Figure BDA0002496063800000115
the set of parameters for the image modality generator is combined to θvThe j-th image modality information corresponds to the representation characteristics,
Figure BDA0002496063800000116
original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set.
In some embodiments, the intra-modality semantic loss function and the inter-modality similarity loss function are summed in a weighted manner to obtain a resulting loss function. The resulting loss function is: l isgeneration=αLlabel+βLsimilarityα and β are weight coefficients corresponding to the intra-modal semantic loss function and the inter-modal similarity loss function, respectively.
In some embodiments, the cross-modal discriminant loss function is used to enhance the distinction of the representation features with respect to the modalities between different modality data information on the same topic, and the cross-modal discriminant loss function is:
Figure BDA0002496063800000117
wherein, ceThe modal label is in a form of searched target data information one-hot;
Figure BDA0002496063800000118
the set of parameters for the text modality generator is thetatThe representation characteristics corresponding to the e-th character mode information,
Figure BDA0002496063800000119
the original characteristics of the e-th character modal information are obtained;
Figure BDA00024960638000001110
the set of parameters for the image modality generator is combined to θvThe representation characteristic corresponding to the e-th image modality information,
Figure BDA00024960638000001111
original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function(s)
Figure BDA00024960638000001113
In the parameter set thetapAnd converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.
In this embodiment, since the search process is divided into two cases, namely, a text search image (the searched target data information is image modality information) and an image search text (the searched target data information is text modality information), it is necessary to distinguish the modalities, that is, to discriminate the action of the loss function across the modalities.
In step S104, as shown in fig. 3, the generator is optimized by the fixed arbiter, and then the generator is optimized by the fixed arbiter, and multiple iterations are performed to obtain a better generator, so as to implement an effective and complete antagonistic learning process.
Specifically, in this embodiment, based on the difference between the minimum generation loss function and the cross-modal discriminant loss function, the optimized parameter set θ of the text modal generator is obtainedtAnd a parameter set theta of an image modality generatorvNamely:
Figure BDA0002496063800000121
wherein the content of the first and second substances,
Figure BDA0002496063800000122
and
Figure BDA0002496063800000123
to the optimized thetatAnd thetav
Obtaining a parameter set theta of an optimization discriminator based on the difference between the maximum generation loss function and the cross-mode discrimination loss functionpNamely:
Figure BDA0002496063800000124
wherein the content of the first and second substances,
Figure BDA0002496063800000125
to the optimized thetap
On the other hand, the invention also provides a social media cross-modal data information searching method, as shown in fig. 4, including steps S301 to S303:
step S301: and inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched.
Wherein the generator is obtained by counterlearning based on the training samplers; the training sample set includes: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator comprises: the system comprises a character modal generator and an image modal generator, wherein the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features; supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculation value of a semantic loss function in the modes, the correlation between the representation features of different mode data information under the same topic is maximized by minimizing the calculation value of a similarity loss function between the modes, and the difference of the representation features of the different mode data information about the modes is maximized by minimizing the calculation value of a cross-mode discriminant loss function; adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.
Step S302: and traversing the existing data information of the target modality, and acquiring the representation characteristics generated by the same generator of each existing data information.
Step S303: and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.
Based on the same inventive concept as steps S101 to S104, in step S301 of this embodiment, a generator generated by the training method of the data representation feature generator in the social media cross-modality search is used to collect the representation features of the data information to be searched. In step S302, the existing data is traversed to obtain the representation features of each existing data generated by the generator in step S301. In step S303, one or more pieces of recent existing data information of the target modality are obtained by the proximity matching search.
In some embodiments, in step S303, that is, acquiring existing data information of one or more target modalities that are closest to the representation features of the data information to be searched based on similarity matching, includes S3031 to S3032: :
s3031: based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:
Figure BDA0002496063800000131
wherein the content of the first and second substances,
Figure BDA0002496063800000132
parameter sets for text modality generators
Figure BDA0002496063800000133
The representation characteristics corresponding to the ith character mode information,
Figure BDA0002496063800000134
the original characteristics of the ith character modal information are obtained;
Figure BDA0002496063800000135
a set of parameters for an image modality generator is
Figure BDA0002496063800000136
The j-th image modality information corresponds to the representation characteristics,
Figure BDA0002496063800000137
original features of jth image modality information; fixing
Figure BDA0002496063800000138
Or
Figure BDA0002496063800000139
One is the representation characteristics of the data information to be searched in the corresponding modality, and the other is the representation characteristics of each existing data in the target modality.
S3032: and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.
In the present embodiment, when searching for image modality information based on text modality information, it is fixed
Figure BDA00024960638000001310
Traversing the image mode information in the existing data information for the representation characteristics of the data information to be searched, and calculating the similarity based on the calculation formula (14), and based on the similarityAnd arranging the image modality information in the existing data information to obtain one or more pieces of image modality information with the highest similarity. Similarly, when searching for text modality information based on image modality information, it is fixed
Figure BDA00024960638000001311
Traversing character modal information in the existing data information for representing characteristics of the data information to be searched, calculating the similarity based on a calculation formula (14), and arranging the character modal information in the existing data information based on the similarity to obtain one or more pieces of character modal information with the highest similarity. Wherein, the smaller the sim, the higher the similarity.
In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.
In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.
In summary, the training and searching method for the data feature generator in the social media cross-modal search according to the present invention realizes the search between the social media cross-modal data information by the counterstudy method, and emphasizes the cross-modal content search between the text modal information and the image modal information. The generator for counterlearning reconstructs original features of different modal data information in the social media based on a self-attention mechanism, and the original features are mapped into a representation subspace which can be directly compared, so that the search of cross-modal data information is realized. Further, a joint loss function is established through the discriminator, and the generated representation features are guided to be representation features following the corresponding modal semantic distribution by utilizing the intra-modal semantic loss function and the inter-modal similarity loss function. A loss function is discriminated across modes to achieve discrimination of modes. The method can adapt to the characteristic of sparse data information semantics in social media, complete accurate, efficient and stable search of cross-modal data information, and greatly improve the efficiency compared with the prior art.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A training method for a data representation feature generator in a social media cross-modal search is characterized by comprising the following steps:
obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information;
obtaining, with a generator, representative features of each data information based on the set of training samples, the generator including: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features;
supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function;
tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; and carrying out multiple iterations to obtain a final generator.
2. The method for training a data representation feature generator in social media cross-modal search according to claim 1, wherein obtaining original features of the data information in corresponding modalities comprises:
obtaining TF-IDF characteristics of the character modal information as original characteristics of the character modal information, obtaining convolution characteristics of the image modal information as original characteristics of the image modal information, and recording the original characteristics X ═ X of each data informationt 1,xt 2,…,xt m,xv 1,xv 2,…,xv n},xt mIs the original feature of the m-th text mode information, xv nThe original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.
3. The training method of the data representation feature generator in the social media cross-modal search according to claim 2, wherein the step of segmenting each original feature to obtain a plurality of corresponding local features, and obtaining the representation features of the data information of each modality in the same representation subspace through a self-attention mechanism based on the local features comprises the steps of:
dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number oft m={bt m,1,bt m,2,…,bt m,k},xv n={bv n,1,bv n,2,…,bv n,k},bt m,kSemantic features of the kth block of text as mth text mode information, bv n,kIs n thThe kth block of image semantic features of the image modality information;
using function ftAnd gtConverting the segmented text semantic features into features representing subspaces:
Figure FDA0002496063790000021
wherein wt fAnd wt gIs ftAnd gtThe parameter vector of (2);
attention parameters between the ith block text semantic feature and the jth block text semantic feature of the mth word modal information are as follows:
Figure FDA0002496063790000022
the output characteristic expression of the i block text semantic characteristic of the m character modal information is as follows:
Figure FDA0002496063790000023
wherein the content of the first and second substances,
Figure FDA0002496063790000024
wt his htThe parameter vector of (2);
the representation characteristics of the mth text mode information are as follows: st m={ot m,1,ot m,2,...,ot m,k};
Using function fvAnd gvConverting the segmented image semantic features into features representing subspaces:
Figure FDA0002496063790000025
wherein wv fAnd wv gIs fvAnd gvThe parameter vector of (2);
the attention parameter between the image semantic feature of the ith block and the image semantic feature of the jth block of the nth image modality information is as follows:
Figure FDA0002496063790000026
the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information is as follows:
Figure FDA0002496063790000027
wherein the content of the first and second substances,
Figure FDA0002496063790000028
wv his hvThe parameter vector of (2);
the representation characteristic of the nth image modality information is: sv n={ov n,1,ov n,2,...,ov n,k}。
4. The method for training a data representation feature generator in social media cross-modal search according to claim 1, wherein the intra-modal semantic loss function is:
Figure FDA0002496063790000029
wherein, yt iAnd yv jRespectively representing the ith character modal information and the jth image modal information one-hot form topic label vectors in the training sample set under the same topic
Figure FDA0002496063790000031
Figure FDA0002496063790000032
The parameter set for the text modality generator is thetatThe representation characteristics corresponding to the ith character mode information,
Figure FDA0002496063790000033
the original characteristics of the ith character modal information are obtained;
Figure FDA0002496063790000034
a set of parameters for the image modality generator is combined to θvThe j-th image modality information corresponds to the representation characteristics,
Figure FDA0002496063790000035
original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; function(s)
Figure FDA0002496063790000036
For handles
Figure FDA0002496063790000037
And
Figure FDA0002496063790000038
processed by fully connected neural network to be able to communicate with yt iAnd/or yv jThe dimension of the multiplication.
5. The method for training a data representation feature generator in a social media cross-modality search according to claim 4, wherein the inter-modality similarity loss function is:
Figure FDA0002496063790000039
wherein, yt iAnd yv jRespectively representing the ith character modal information and the jth image modal information one-hot form topic label vectors in the training sample set under the same topic
Figure FDA00024960637900000310
Figure FDA00024960637900000311
The parameter set for the text modality generator is thetatThe representation characteristics corresponding to the ith character mode information,
Figure FDA00024960637900000312
the original characteristics of the ith character modal information are obtained;
Figure FDA00024960637900000313
a set of parameters for the image modality generator is combined to θvThe j-th image modality information corresponds to the representation characteristics,
Figure FDA00024960637900000314
original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set;
the generation loss function is: l isgeneration=αLlabel+βLsimilarityα and β are weight coefficients of the intra-modality semantic loss function and the inter-modality similarity loss function, respectively.
6. The method for training a data representation feature generator in social media cross-modal search according to claim 5, wherein the cross-modal discriminant loss function is:
Figure FDA00024960637900000315
wherein, ceThe modal label is in a form of searched target data information one-hot;
Figure FDA00024960637900000316
the parameter set for the text modality generator is thetatIs first ofe expression characteristics corresponding to the text mode information,
Figure FDA00024960637900000317
the original characteristics of the e-th character modal information are obtained;
Figure FDA00024960637900000318
a set of parameters for the image modality generator is combined to θvThe representation characteristic corresponding to the e-th image modality information,
Figure FDA00024960637900000319
original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function(s)
Figure FDA00024960637900000320
In the parameter set thetapAnd converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.
7. A social media cross-modal data information search method is characterized by comprising the following steps:
inputting data information to be searched into a generator to obtain representation characteristics of the data information to be searched;
wherein the generator is derived by counterlearning based on a training sampler; the training sample set includes: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator includes: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features; supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function; tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;
traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator;
and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.
8. The method according to claim 7, wherein the obtaining of the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching comprises:
based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:
Figure FDA0002496063790000041
wherein the content of the first and second substances,
Figure FDA0002496063790000042
parameter sets for the text modality generator are
Figure FDA0002496063790000043
The representation characteristics corresponding to the ith character mode information,
Figure FDA0002496063790000051
the original characteristics of the ith character modal information are obtained;
Figure FDA0002496063790000052
a set of parameters for the image modality generator is combined as
Figure FDA0002496063790000053
The j-th image modality information corresponds to the representation characteristics,
Figure FDA0002496063790000054
original features of jth image modality information; fixing
Figure FDA0002496063790000055
Or
Figure FDA0002496063790000056
One is the representation characteristics of the data information to be searched in the corresponding mode, and the other is the representation characteristics of each existing data in the target mode;
and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 8 are implemented when the processor executes the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202010418678.7A 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search Active CN111598712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010418678.7A CN111598712B (en) 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010418678.7A CN111598712B (en) 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search

Publications (2)

Publication Number Publication Date
CN111598712A true CN111598712A (en) 2020-08-28
CN111598712B CN111598712B (en) 2023-04-18

Family

ID=72192242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010418678.7A Active CN111598712B (en) 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search

Country Status (1)

Country Link
CN (1) CN111598712B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215837A (en) * 2020-10-26 2021-01-12 北京邮电大学 Multi-attribute image semantic analysis method and device
CN113420166A (en) * 2021-03-26 2021-09-21 阿里巴巴新加坡控股有限公司 Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment
CN114091662A (en) * 2021-11-26 2022-02-25 广东伊莱特电器有限公司 Text image generation method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN110059157A (en) * 2019-03-18 2019-07-26 华南师范大学 A kind of picture and text cross-module state search method, system, device and storage medium
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash
US20190333199A1 (en) * 2018-04-26 2019-10-31 The Regents Of The University Of California Systems and methods for deep learning microscopy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190333199A1 (en) * 2018-04-26 2019-10-31 The Regents Of The University Of California Systems and methods for deep learning microscopy
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN110059157A (en) * 2019-03-18 2019-07-26 华南师范大学 A kind of picture and text cross-module state search method, system, device and storage medium
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215837A (en) * 2020-10-26 2021-01-12 北京邮电大学 Multi-attribute image semantic analysis method and device
CN113420166A (en) * 2021-03-26 2021-09-21 阿里巴巴新加坡控股有限公司 Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment
CN114091662A (en) * 2021-11-26 2022-02-25 广东伊莱特电器有限公司 Text image generation method and device and electronic equipment
CN114091662B (en) * 2021-11-26 2024-05-14 广东伊莱特生活电器有限公司 Text image generation method and device and electronic equipment

Also Published As

Publication number Publication date
CN111598712B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN109543084B (en) Method for establishing detection model of hidden sensitive text facing network social media
Li et al. Weakly supervised deep matrix factorization for social image understanding
CN112800776B (en) Bidirectional GRU relation extraction data processing method, system, terminal and medium
CN111598712B (en) Training and searching method for data feature generator in social media cross-modal search
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
JP4514082B2 (en) Method and apparatus for building a text classifier and text classifier
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
US9183173B2 (en) Learning element weighting for similarity measures
US7711673B1 (en) Automatic charset detection using SIM algorithm with charset grouping
CN109831460B (en) Web attack detection method based on collaborative training
CN111914156A (en) Cross-modal retrieval method and system for self-adaptive label perception graph convolution network
CN113239214A (en) Cross-modal retrieval method, system and equipment based on supervised contrast
CN113657425A (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
US8560466B2 (en) Method and arrangement for automatic charset detection
CN111475603A (en) Enterprise identifier identification method and device, computer equipment and storage medium
CN114510939A (en) Entity relationship extraction method and device, electronic equipment and storage medium
CN112163114B (en) Image retrieval method based on feature fusion
CN114528827A (en) Text-oriented confrontation sample generation method, system, equipment and terminal
Lee et al. Effective evolutionary multilabel feature selection under a budget constraint
CN116956289B (en) Method for dynamically adjusting potential blacklist and blacklist
CN112445862A (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
CN116385946A (en) Video-oriented target fragment positioning method, system, storage medium and equipment
US11907307B1 (en) Method and system for event prediction via causal map generation and visualization
Mady et al. Enhancing performance of biomedical named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant