CN111598712B - Training and searching method for data feature generator in social media cross-modal search - Google Patents

Training and searching method for data feature generator in social media cross-modal search Download PDF

Info

Publication number
CN111598712B
CN111598712B CN202010418678.7A CN202010418678A CN111598712B CN 111598712 B CN111598712 B CN 111598712B CN 202010418678 A CN202010418678 A CN 202010418678A CN 111598712 B CN111598712 B CN 111598712B
Authority
CN
China
Prior art keywords
information
modal
generator
representation
data information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010418678.7A
Other languages
Chinese (zh)
Other versions
CN111598712A (en
Inventor
杜军平
周南
崔婉秋
寇菲菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202010418678.7A priority Critical patent/CN111598712B/en
Publication of CN111598712A publication Critical patent/CN111598712A/en
Application granted granted Critical
Publication of CN111598712B publication Critical patent/CN111598712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)

Abstract

The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which comprises the following steps: the method comprises the steps of obtaining a training sample set, obtaining representation characteristics of each data information by adopting a generator for countermeasure learning based on the training sample set, monitoring a countermeasure generator through a discriminator, adjusting a parameter and optimizing the generator through a fixed discriminator and adjusting a parameter and optimizing the discriminator through the fixed generator, and iterating for multiple times to obtain a final generator. The searching method comprises the following steps: inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched; traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by a generator; and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching. The method can adapt to the characteristic of semantic sparsity of data information in social media, and realizes accurate search of cross-modal data information.

Description

Training and searching method for data feature generator in social media cross-modal search
Technical Field
The invention relates to the technical field of data search, in particular to a training and searching method for a data feature generator in social media cross-modal search.
Background
The premise of searching the cross-modal data content of the social network is to perform search feature mining on the social network data, and two strategies are mainly adopted: the method comprises the steps of searching characteristic analysis and mining based on manual work and searching characteristic mining based on a machine learning method. The social media has a huge data volume, and the text is short and irregular, so that the text has the problem of semantic sparsity; meanwhile, the situation that the image pixels in the social network are low and the composition is incomplete causes the problem of semantic sparsity similar to the social network text. Based on the characteristics, manual search feature analysis cannot adapt to huge data volume in a social network, and the existing machine learning is difficult to realize feature extraction of texts or images with sparse semantics. Therefore, it is difficult to perform a search between different modality data contents.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a training and searching method for a data feature generator in social media cross-modal search, so as to solve the problem that the cross-modal search cannot be performed on data information in social media in the prior art.
The technical scheme of the invention is as follows:
in one aspect, the present invention provides a training method for a data representation feature generator in a cross-modal search of social media, including:
obtaining a training sample set, the training sample set comprising: social media data information of multiple modals, and topics to which the data information belongs and corresponding modals are used as tags; wherein, the data information of the plurality of modalities comprises: text mode information and image mode information;
obtaining, with a generator, representative features of each data information based on the set of training samples, the generator including: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, segmenting each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features;
supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic tags is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction in terms of modality between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function;
adjusting parameters to optimize the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; and carrying out multiple iterations to obtain a final generator.
In some embodiments, obtaining the original features of the data information of each of the plurality of modalities includes:
obtaining TF-IDF characteristics of the character modal information as original characteristics of the character modal information, obtaining convolution characteristics of the image modal information as original characteristics of the image modal information, and recording the original characteristics X = { X } of each data information t 1 ,x t 2 ,…,x t m ,x v 1 ,x v 2 ,…,x v n },x t m Is the original feature of the m-th text mode information, x v n The original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.
In some embodiments, segmenting each original feature to obtain a corresponding plurality of local features, and acquiring the representation features of the data information of each modality in the same representation subspace through a self-attention mechanism based on the local features includes:
dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of t m ={b t m,1 ,b t m,2 ,…,b t m,k },x v n ={b v n,1 ,b v n,2 ,…,b v n,k },b t m,k Semantic features of the kth block of text as mth text mode information, b v n,k The k block of image semantic features of the n image modality information;
using function f t And g t Converting the segmented text semantic features into features representing subspaces:
Figure BDA0002496063800000021
Wherein w t f And w t g Is f t And g t The parameter vector of (a);
attention parameters between the ith text semantic feature and the jth text semantic feature of the mth text modal information are as follows:
Figure BDA0002496063800000022
the output feature expression of the i-th block text semantic feature of the m-th character modal information is as follows:
Figure BDA0002496063800000023
wherein it is present>
Figure BDA0002496063800000024
w t h Is h t The parameter vector of (a);
the representation characteristics of the mth text mode information are as follows: s. the t m ={o t m,1 ,o t m,2 ,...,o t m,k };
Using function f v And g v Converting the segmented image semantic features into features representing subspaces:
Figure BDA0002496063800000031
wherein w v f And w v g Is f v And g v The parameter vector of (2);
the attention parameter between the image semantic feature of the ith block and the image semantic feature of the jth block of the nth image modality information is as follows:
Figure BDA0002496063800000032
the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information is as follows:
Figure BDA0002496063800000033
wherein it is present>
Figure BDA0002496063800000034
w v h Is h v The parameter vector of (a);
the representation characteristic of the nth image modality information is: s v n ={o v n,1 ,o v n,2 ,...,o v n,k }。
In some embodiments, the intra-modal semantic loss function is:
Figure BDA0002496063800000035
wherein, y t i And y v j Respectively representing the ith character modal information and the jth image modal information one-hot form topic label vectors in the training sample set under the same topic
Figure BDA0002496063800000036
Figure BDA0002496063800000037
A set of parameters for the text modality generator is θ t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure BDA0002496063800000038
The original characteristics of the ith character modal information are obtained; />
Figure BDA0002496063800000039
A set of parameters for the image modality generator is combined to θ v The representation characteristic corresponding to the jth image modality information>
Figure BDA00024960638000000310
Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; function>
Figure BDA00024960638000000311
For holding>
Figure BDA00024960638000000312
And &>
Figure BDA00024960638000000313
Processed by fully connected neural network to be able to communicate with y t i And/or y v j The dimension of the multiplication.
In some embodiments, the inter-modal similarity loss function is:
Figure BDA00024960638000000314
wherein, y t i And y v j Respectively representing the ith character modal information and the jth image modal information one-hot form topic label vectors in the training sample set under the same topic
Figure BDA00024960638000000315
Figure BDA00024960638000000316
The parameter set for the text modality generator is theta t A representation characteristic corresponding to the ith character mode information>
Figure BDA00024960638000000317
The original characteristics of the ith character modal information are obtained; />
Figure BDA00024960638000000318
A set of parameters for the image modality generator is combined to θ v The j-th image modality information corresponds to the representation characteristics, device for selecting or keeping>
Figure BDA00024960638000000319
Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set;
the generative loss function is: l is generation =αL label +βL similarity And alpha and beta are respectively weight coefficients of the intra-modal semantic loss function and the inter-modal similarity loss function.
In some embodiments, the cross-modal discriminant loss function is:
Figure BDA0002496063800000041
wherein, c e The modal label is in a form of searched target data information one-hot;
Figure BDA0002496063800000042
the parameter set for the text modality generator is theta t The representation characteristic corresponding to the e-th character mode information is used for judging whether the E-th character mode information is matched with the E-th character mode information>
Figure BDA0002496063800000043
The original characteristics of the e-th character modal information are obtained; />
Figure BDA0002496063800000044
A set of parameters for the image modality generator is combined to θ v The representation characteristic corresponding to the e-th image modality information>
Figure BDA0002496063800000045
Original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is dataThe number of pairs; function->
Figure BDA0002496063800000046
In the parameter set theta p And converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.
On the other hand, the invention also provides a social media cross-modal data information searching method, which comprises the following steps:
inputting data information to be searched into a generator to obtain representation characteristics of the data information to be searched;
wherein the generator is based on training samplers and is obtained through antagonistic learning; the training sample set includes: social media data information of multiple modals, and topics to which the data information belongs and corresponding modals are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator includes: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features; supervising the combating of the generator by means of an arbiter, the penalty function employed by the arbiter comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function; tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;
traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator;
and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.
In some embodiments, obtaining the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching includes:
based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:
Figure BDA0002496063800000051
wherein the content of the first and second substances,
Figure BDA0002496063800000052
for the parameter set of the text modality generator to be>
Figure BDA0002496063800000053
The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure BDA0002496063800000054
The original characteristics of the ith character modal information are obtained; />
Figure BDA0002496063800000055
Is combined as ^ er for the parameter set of the image modality generator>
Figure BDA0002496063800000056
The representation characteristic corresponding to the jth image modality information>
Figure BDA0002496063800000057
Original features of jth image modality information; fixing>
Figure BDA0002496063800000058
Or->
Figure BDA0002496063800000059
One is the representation characteristics of the data information to be searched in the corresponding mode, and the other is the representation characteristics of each existing data in the target mode;
and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.
In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.
In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.
The method has the advantages that the generator is used for respectively mapping the representation characteristics of the text modal information and the image modal information under the self-attention mechanism, and the semantic characteristics of the cross-modal data content in the social media under the same representation subspace are extracted; based on generation antagonism learning, the accuracy of mapping corresponding topics of the representation features generated by the generator between the same modal data information and between different modal data information is improved by using the supervision of the discriminator, and meanwhile, the distribution of the representation features of different modal data information under the same topic is differentiated. Therefore, the method adapts to the characteristic of semantic sparsity of data information in social media and improves the accuracy of searching between cross-modal data information.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to what has been particularly described hereinabove, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principle of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts may be exaggerated in the drawings, i.e., may be larger relative to other components in an exemplary device actually made according to the present invention. In the drawings:
fig. 1 is a schematic flowchart of a training method for a data feature generator in a social media cross-modal search according to an embodiment of the present invention;
fig. 2 is a schematic logical structure diagram of a training method for a data feature generator in a social media cross-modal search according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram illustrating iterative optimization of a training method for a data feature generator in a cross-modal search of social media according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a social media cross-modality data information search method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
It is also noted that, unless otherwise specified, the term "coupled" is used herein to refer not only to a direct connection, but also to an indirect connection with an intermediate.
It should be noted that, the "modality" mentioned in the present invention refers to the form of data information, and may include: text information, image information, audio information, or video information. "topic" refers to the semantically directed content of data information to represent matters discussed and focused on per media in social interaction, such as a particular news topic, which contains pieces of data information, image information, audio information, or video information associated therewith.
Because data information semantics in the social media are sparse, text modal information data is short and irregular, image modal information resolution is low, and composition is incomplete, the work of searching data information in the social media in a cross-modal mode is difficult to realize. In the prior art, the search of cross-modal data information in social media is difficult to adapt to the characteristic of sparse semantics to realize high-precision search, or the search analysis process is too complex and the realization difficulty is high.
The invention provides a training and searching method for a data feature generator in cross-modal search of social media, which is used for extracting representation features of data information in the social media, realizing the search of the cross-modal data information in a similarity comparison mode, improving the precision of search matching, simplifying the search implementation process and improving the efficiency.
In one aspect, the present invention provides a training method for a data representation feature generator in cross-modal search of social media, wherein the generator for extracting data information representation features in social media is generated based on counterstudy training, as shown in fig. 1, the training method includes steps S101 to S104, it should be noted that the steps of the training method do not limit the sequence, and it should be understood that, in the training process, the steps S101 to S104 may synchronize or change the implementation sequence in some cases:
step S101: obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information.
Step S102: based on the training sample set, the generator is adopted to obtain the representation characteristics of each data information, and the generator comprises: the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features.
Step S103: supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculated value of a semantic loss function in the modes, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing the calculated value of a similarity loss function between the modes, and the difference of the representation features of different modal data information about the modes is maximized by minimizing the calculated value of a cross-mode discriminant loss function.
Step S104: adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.
In step S101, data information in social media is used as a data item, a sample training set is established for counterstudy, and topics and modalities corresponding to the data information are labeled as tags. In order to embody the characteristics of the social media cross-modal search, the data information at least comprises two forms of text modal information and image modal information, and in other embodiments, in order to adapt to higher retrieval requirements, the data information also comprises audio modal information and/or video information. The topic tags and the modal tags corresponding to the data information may be marked in a one-hot encoding form, and in other embodiments, the topic tags and the modal tags may also be marked in other forms of tag encoding according to specific situations. In some embodiments, the number of text modality information and image modality information in the default sample training set is consistent.
In step S102, as shown in fig. 2, a counterstudy method is adopted, and the generator is used to collect the representation features of each data information for similarity comparison, so as to implement a cross-modal search. Specifically, because of the great difference in data form between different modality data information, features generated by single extraction of different modality data information are not consistent in form, content, meaning and standard, are not in the same evaluation dimension, cannot be directly compared, and even cannot be used for mutual retrieval. Therefore, in order to search between different modality data information, it is necessary to acquire features of the same evaluation dimension, i.e., features within the same representation subspace, generated by different modality data information.
In the present embodiment, the generator first obtains the original features directly generated from each data message, and the form and the collection method of the original features are determined according to the modality of the corresponding data message. For example, the text modal information may adopt TF-IDF (term frequency-inverse document frequency) characteristics as its corresponding original characteristics; the image modality information may employ a convolution feature (VGGNet convolution neural network feature) as its corresponding raw feature. Based on the principle of a self-attention mechanism, the original features of each piece of text mode information or image mode information are divided into a plurality of local features. For single text mode information or image mode information, attention parameters of each local feature relative to other local features are obtained, and the attention parameters are accumulated after being subjected to product in a uniform expression subspace, so that the attention of the local features can be expressed, and corresponding output features of the local features are obtained. The feature vector formed by combining the output features of the local features is used as a final representation feature, and the semantics and the mode of the feature vector can be reflected in a correlated manner.
Specifically, the training sample set has data information of a plurality of topics, and the modality of the data information includes two types, namely characters and images. Defining the training sample set as C = { t = { (t) } 1 ,t 2 ,…,t m ,v 1 ,v 2 ,…,v n },t m Representing the m-th text modality information, v n Representing nth image modality information; and simultaneously, marking topics and modes corresponding to each character mode information and each graph mode information as labels in a training sample set. The topic labels and the modal labels can be marked in a one-hot (one-hot coding) coding vector form, Q states are coded by Q state registers, each state has an independent register bit, and only one of the register bits is effective to express one state. For example, 5 topic categories, encoded with a 5-bit status register, and when the data information belongs to the first category, the tag is represented as [1,0,0,0,0 ]]. The mode can be coded by using a 2-bit state register, so that 1,0]Mark text Modal, [0,1]The image modality is labeled.
In some embodiments, obtaining raw features of data information of multiple modalities includes:
obtaining TF-IDF characteristics of character modal information as original characteristics of the character modal information, obtaining convolution characteristics of image modal information as original characteristics of the image modal information, and recording the original characteristics X = { X } of each data information t 1 ,x t 2 ,…,x t m ,x v 1 ,x v 2 ,…,x v n },x t m Is the original feature of the m-th text mode information, x v n The original features of the nth image mode information are M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers.
In some embodiments, as shown in fig. 2, each original feature is segmented to obtain a plurality of corresponding local features, and the representation features of the data information of each modality in the same representation subspace are obtained through an attention-driven mechanism based on the local features, including S201 to S208, where S202 to S204 are generation processes of the text modality information representation features, and S205 to S207 are generation processes of the image modality information representation features:
s201: dividing TF-IDF characteristics of the character modal information and convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of t m ={b t m,1 ,b t m,2 ,…,b t m,k },x v n ={b v n,1 ,b v n,2 ,…,b v n,k },b t m,k Semantic features of the kth block of text as mth text mode information, b v n,k And the semantic features of the kth block of the nth image modality information.
S202: using function f t And g t Converting the segmented text semantic features into features representing subspaces:
Figure BDA0002496063800000091
Figure BDA0002496063800000092
wherein w t f And w t g Is f t And g t The parameter vector of (2).
S203: calculating attention parameters between the ith text semantic feature and the jth text semantic feature of the mth character modal information as follows:
Figure BDA0002496063800000093
s204: calculating an output characteristic expression of the ith block text semantic characteristic of the mth character modal information as follows:
Figure BDA0002496063800000094
wherein the content of the first and second substances,
Figure BDA0002496063800000095
w t h is h t The parameter vector of (2);
outputting the representation characteristics of the mth text mode information as follows: s t m ={o t m,1 ,o t m,2 ,...,o t m,k }。
S205: using function f v And g v Converting the segmented image semantic features into features representing subspaces:
Figure BDA0002496063800000096
Figure BDA0002496063800000097
wherein w v f And w v g Is f v And g v The parameter vector of (2);
s206: calculating attention parameters between the ith block image semantic feature and the jth block image semantic feature of the nth image modality information as follows:
Figure BDA0002496063800000098
s207: calculating the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information as follows:
Figure BDA0002496063800000101
wherein the content of the first and second substances,
Figure BDA0002496063800000102
w v h is h v The parameter vector of (a);
the representation characteristics of the nth image modality information are as follows: s v n ={o v n,1 ,o v n,2 ,...,o v n,k }。
In the embodiment, the representation features for representing the semantics of the text modal information and the image modal information are extracted through a self-attention mechanism, and the evaluation dimensions are unified, so that the representation features of the data information among different modalities are in the same representation subspace, and the search of the cross-modality data information is realized. Wherein the function f t 、g t And h t Is a function f for converting each local feature of the original features of the text modal information into a representation subspace v 、g v And h v The method is used for converting each local feature of the original features of the image mode information into the same expression subspace so as to realize the unification of evaluation dimensions. Function f t 、g t And h t And a function f v 、g v And h v And the corresponding parameter vectors are iteratively updated under the supervision of the discriminator to obtain the optimal values in the process of counterstudy.
In step S103, as shown in fig. 2, optimization of the generator is achieved by providing a system in which the arbiter and the generator form a counterlearning. In particular, the invention aims to adjust the generator to make the representation characteristics generated based on the social media data information have the following effects through supervision of the discriminator: 1. minimizing a difference in distribution between the representative features and the corresponding topic labels even if the representative features of the respective data information are accurately associated to represent the topics thereof; 2. the relevance of the representation features of different modal data information of the same topic about the topic is maximized, even if the representation features of different modal data information of the same topic are converged in a semantic angle; 3. the difference of representation features of different modal data information about the modal is strengthened, and even the representation features of different modal data information under the same topic tend to be differentiated in the modal angle. In order to enable the expression features generated by the generator to achieve the above 3 effects, supervised learning needs to be performed by the discriminator, and specifically, the generation loss function obtained by weighted summation of the intra-modal semantic loss function and the inter-modal similarity loss function is implemented by joint action of the cross-modal discriminant loss function.
In some embodiments, an intra-modal semantic loss function is used to minimize the distribution difference between the representation features and the corresponding topic tags, the intra-modal semantic loss function being:
Figure BDA0002496063800000103
wherein, y t i And y v j Respectively representing the topic label vectors of the ith character modal information and the jth image modal information in the training sample set in the one-hot form under the same topic
Figure BDA0002496063800000104
Figure BDA0002496063800000105
The parameter set for the text modality generator is theta t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure BDA0002496063800000106
The original characteristics of the ith character modal information are obtained; />
Figure BDA0002496063800000107
The set of parameters for the image modality generator is collectively θ v The j-th image modality information is corresponding to the representation characteristics, device for combining or screening>
Figure BDA0002496063800000108
Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; is a function->
Figure BDA00024960638000001112
For predicting topic probability distribution of each text or image representing features, processing the generated representation features into a probability distribution capable of being connected with y through a fully-connected neural network t i And/or y v j The dimension of the multiplication.
In some embodiments, an inter-modal similarity loss function is used to maximize the relevance of the representation features on a topic between different modal data information for the same topic, the inter-modal similarity loss function being:
Figure BDA0002496063800000111
/>
wherein, y t i And y v j Respectively representing the topic label vectors of the ith character modal information and the jth image modal information in the training sample set in the one-hot form under the same topic
Figure BDA0002496063800000112
Figure BDA0002496063800000113
The set of parameters for the text modality generator is theta t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure BDA0002496063800000114
The original characteristics of the ith character modal information are obtained; />
Figure BDA0002496063800000115
The set of parameters for the image modality generator is collectively θ v The representation characteristic corresponding to the jth image modality information>
Figure BDA0002496063800000116
Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set.
In some embodiments, the intra-modality semantic loss function and the inter-modality similarity loss function are summed in a weighted manner to obtain a resulting loss function. The resulting loss function is: l is generation =αL label +βL similarity And alpha and beta are respectively corresponding weight coefficients of a semantic loss function in the modes and a similarity loss function between the modes. In the present embodiment, the effect of supervision in the antagonistic learning is adjusted by setting a weight coefficient.
In some embodiments, the cross-modal discriminant loss function is used to enhance the distinction of representation features about the modalities between different modality data information under the same topic, and the cross-modal discriminant loss function is:
Figure BDA0002496063800000117
wherein, c e The modal label is in a form of searched target data information one-hot;
Figure BDA0002496063800000118
the set of parameters for the text modality generator is theta t The representation characteristic corresponding to the e-th character mode information is used for judging whether the E-th character mode information is matched with the E-th character mode information>
Figure BDA0002496063800000119
The original characteristics of the e-th character modal information are obtained; />
Figure BDA00024960638000001110
The set of parameters for the image modality generator is combined to θ v The representation characteristic corresponding to the e-th image modality information>
Figure BDA00024960638000001111
Is as followsOriginal characteristics of e pieces of image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function->
Figure BDA00024960638000001113
In the parameter set theta p And converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.
In this embodiment, since the search process is divided into two cases, namely, a text search image (the searched target data information is image modality information) and an image search text (the searched target data information is text modality information), it is necessary to distinguish the modalities, that is, to discriminate the function of the loss function across the modalities.
In step S104, as shown in fig. 3, the generator is optimized by the fixed arbiter, and then the generator is optimized by the fixed arbiter, and multiple iterations are performed to obtain a better generator, so as to implement an effective and complete antagonistic learning process.
Specifically, in this embodiment, based on the difference between the minimum generation loss function and the cross-modal discriminant loss function, the optimized parameter set θ of the text modal generator is obtained t And a parameter set theta of an image modality generator v Namely:
Figure BDA0002496063800000121
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002496063800000122
and &>
Figure BDA0002496063800000123
To an optimized theta t And theta v
Based on the difference between the maximum generation loss function and the cross-modal discriminant loss function, a parameter set theta of the optimized discriminant is obtained p Namely:
Figure BDA0002496063800000124
wherein the content of the first and second substances,
Figure BDA0002496063800000125
to the optimized theta p
On the other hand, the invention also provides a social media cross-modal data information searching method, as shown in fig. 4, including steps S301 to S303:
step S301: and inputting the data information to be searched into a generator to obtain the representation characteristics of the data information to be searched.
Wherein the generator is obtained by counterlearning based on the training samplers; the training sample set includes: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modalities comprises: text modality information and image modality information; the generator comprises: the system comprises a character modal generator and an image modal generator, wherein the character modal generator and the image modal generator are used for acquiring original features of data information under corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring representation features of the data information under each modality in the same representation subspace through a self-attention mechanism based on the local features; supervising the countermeasure generator by means of a discriminator, the penalty function employed by the discriminator comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; the distribution difference between the representation features and the corresponding topic labels is minimized by minimizing the calculation value of a semantic loss function in the modes, the correlation between the representation features of different mode data information under the same topic is maximized by minimizing the calculation value of a similarity loss function between the modes, and the difference of the representation features of the different mode data information about the modes is maximized by minimizing the calculation value of a cross-mode discriminant loss function; adjusting a parameter optimization generator by minimizing the difference between the calculated value of the generated loss function and the calculated value of the cross-mode discriminant loss function; adjusting a parameter and optimizing a discriminator by maximizing the difference between a calculated value of a generated loss function and a calculated value of a cross-mode discrimination loss function; and carrying out multiple iterations to obtain a final generator.
Step S302: and traversing the existing data information of the target modality, and acquiring the representation characteristics generated by the same generator of each existing data information.
Step S303: and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.
Based on the same inventive concept as steps S101 to S104, in step S301 of this embodiment, a generator generated by the training method of the data representation feature generator in the social media cross-modality search is used to collect the representation features of the data information to be searched. In step S302, the existing data is traversed to obtain the representation characteristics of each existing data generated by the generator in step S301. In step S303, one or more pieces of recent existing data information of the target modality are obtained by the proximity matching search.
In some embodiments, in step S303, that is, obtaining the existing data information of one or more target modalities closest to the representation features of the data information to be searched based on similarity matching includes S3031 to S3032: :
s3031: based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:
Figure BDA0002496063800000131
wherein the content of the first and second substances,
Figure BDA0002496063800000132
parameter set for text mode generator>
Figure BDA0002496063800000133
The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure BDA0002496063800000134
The original characteristics of the ith character modal information are obtained; />
Figure BDA0002496063800000135
Parameter sets for an image modality generator are combined to ≥>
Figure BDA0002496063800000136
The representation characteristic corresponding to the jth image modality information>
Figure BDA0002496063800000137
Original features of jth image modality information; fixed->
Figure BDA0002496063800000138
Or>
Figure BDA0002496063800000139
One is the representation characteristics of the data information to be searched in the corresponding modality, and the other is the representation characteristics of each existing data in the target modality.
S3032: and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.
In the present embodiment, when searching for image modality information based on text modality information, it is fixed
Figure BDA00024960638000001310
Traversing the image modality information in the existing data information for the representation characteristics of the data information to be searched, calculating the similarity based on a calculation formula (14), and arranging the image modality information in the existing data information based on the similarity to obtain one or more pieces of image modality information with the highest similarity. Similarly, when searching for text mode information based on image mode information, fix &>
Figure BDA00024960638000001311
As the number to be searchedTraversing the character modal information in the existing data information according to the representation characteristics of the information, calculating the similarity based on a calculation formula (14), and arranging the character modal information in the existing data information based on the similarity to obtain one or more pieces of character modal information with the highest similarity. Wherein, the smaller the sim, the higher the similarity.
In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.
In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.
In summary, the training and searching method for the data feature generator in the social media cross-modal search according to the present invention realizes the search between the social media cross-modal data information by the countercheck learning method, and emphasizes the cross-modal content search between the word modal information and the image modal information. The generator for counterlearning reconstructs original features of different modal data information in the social media based on a self-attention mechanism, and the original features are mapped into a representation subspace which can be directly compared, so that the search of cross-modal data information is realized. Further, a joint loss function is established through the discriminator, and the generated representation features are guided to be representation features following the corresponding modal semantic distribution by utilizing the intra-modal semantic loss function and the inter-modal similarity loss function. A loss function is discriminated across modes to achieve discrimination of modes. The method can adapt to the characteristic of sparse data information semantics in social media, complete accurate, efficient and stable search of cross-modal data information, and greatly improve the efficiency compared with the prior art.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an Erasable ROM (EROM), a floppy disk, a CD-ROM, an optical disk, a hard disk, an optical fiber medium, a Radio Frequency (RF) link, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments noted in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed at the same time.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A training method for a data representation feature generator in a social media cross-modal search is characterized by comprising the following steps:
obtaining a training sample set, the training sample set comprising: social media data information of multiple modalities, and topics to which the data information belongs and corresponding modalities are used as tags; wherein, the data information of the plurality of modes comprises: text modality information and image modality information;
obtaining, with a generator, representative features of each data information based on the set of training samples, the generator including: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features;
supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic labels is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction about modalities between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function;
adjusting parameters to optimize the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;
the method for acquiring the original features of the data information in the corresponding modality comprises the following steps: obtainTaking TF-IDF characteristics of the character modal information as original characteristics of the character modal information, obtaining convolution characteristics of the image modal information as original characteristics of the image modal information, and recording the original characteristics X = { X = { of each data information t 1 ,x t 2 ,…,x t m ,x v 1 ,x v 2 ,…,x v n },x t m Is the original feature of the m-th text mode information, x v n M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, and M and N are positive integers;
the method comprises the following steps of segmenting each original feature to obtain a plurality of corresponding local features, and acquiring the representation features of data information of each mode in the same representation subspace through a self-attention mechanism on the basis of the local features, wherein the method comprises the following steps:
dividing the TF-IDF characteristics of the character modal information and the convolution characteristics of the image modal information into k blocks respectively, and recording the k blocks as follows: x is the number of t m ={b t m,1 ,b t m,2 ,…,b t m,k },x v n ={b v n,1 ,b v n ,2 ,…,b v n,k },b t m,k Semantic features of the kth block of text as mth text mode information, b v n,k The k block of image semantic features of the n image modality information;
using function f t And g t Converting the segmented text semantic features into features representing subspaces:
Figure FDA0003990901150000021
wherein +>
Figure FDA0003990901150000022
And &>
Figure FDA0003990901150000023
Is f t And g t The parameter vector of (a);
attention parameters between the ith text semantic feature and the jth text semantic feature of the mth text modal information are as follows:
Figure FDA0003990901150000024
the output characteristic expression of the i block text semantic characteristic of the m character modal information is as follows:
Figure FDA0003990901150000025
wherein +>
Figure FDA0003990901150000026
w t h Is h t The parameter vector of (2);
the representation characteristics of the mth text modal information are as follows: s t m ={o t m,1 ,o t m,2 ,...,o t m,k };
Using function f v And g v Converting the segmented image semantic features into features representing subspaces:
Figure FDA0003990901150000027
wherein +>
Figure FDA0003990901150000028
And &>
Figure FDA0003990901150000029
Is f v And g v The parameter vector of (a);
the attention parameter between the image semantic feature of the ith block and the image semantic feature of the jth block of the nth image modality information is as follows:
Figure FDA00039909011500000210
the output characteristic expression of the ith block of image semantic characteristics of the nth image modality information is as follows:
Figure FDA00039909011500000211
wherein it is present>
Figure FDA00039909011500000212
w v h Is h v The parameter vector of (a);
the representation characteristic of the nth image modality information is: s. the v n ={o v n,1 ,o v n,2 ,...,o v n,k }。
2. The method for training a data representation feature generator in social media cross-modal search according to claim 1, wherein the intra-modal semantic loss function is:
Figure FDA00039909011500000213
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00039909011500000214
and &>
Figure FDA00039909011500000215
Respectively representing the ith text modal information and the jth image modal information one-hot type topic label vectors in the training sample set, and based on the same topic>
Figure FDA00039909011500000216
The parameter set for the text modality generator is theta t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure FDA0003990901150000031
The original characteristics of the ith character modal information are obtained; />
Figure FDA0003990901150000032
A set of parameters for the image modality generator is combined to θ v The representation characteristic corresponding to the jth image modality information>
Figure FDA0003990901150000033
Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set; function->
Figure FDA0003990901150000034
By>
Figure FDA0003990901150000035
Figure FDA0003990901150000036
And &>
Figure FDA0003990901150000037
Processed by fully connected neural network to be able to communicate with y t i And/or->
Figure FDA0003990901150000038
The dimension of the multiplication.
3. The method for training a data representation feature generator in social media cross-modal search according to claim 2, wherein the inter-modal similarity loss function is:
Figure FDA0003990901150000039
wherein, y t i And
Figure FDA00039909011500000310
respectively representing the ith character modal information and the jth image modal information one-hot type topic label vectors in the training sample set, and under the same topic ^ H>
Figure FDA00039909011500000311
The parameter set for the text modality generator is theta t The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure FDA00039909011500000312
The original characteristics of the ith character modal information are obtained; />
Figure FDA00039909011500000313
A set of parameters for the image modality generator is combined to θ v The representation characteristic corresponding to the jth image modality information>
Figure FDA00039909011500000314
Original features of jth image modality information; m is the number of the character mode information in the training sample set, and N is the number of the image mode information in the training sample set;
the generation loss function is: l is generation =αL label +βL similarity And alpha and beta are respectively the weight coefficients of the intra-modal semantic loss function and the inter-modal similarity loss function.
4. The method for training a data representation feature generator in a cross-modal search of social media according to claim 3, wherein the cross-modal discriminant loss function is:
Figure FDA00039909011500000315
/>
wherein, c e The modal label is in a form of searched target data information one-hot;
Figure FDA00039909011500000316
the parameter set for the text modality generator is theta t The representation characteristic corresponding to the e-th character mode information is used for judging whether the E-th character mode information is matched with the E-th character mode information>
Figure FDA00039909011500000317
The original characteristics of the e-th character modal information are obtained; />
Figure FDA00039909011500000318
A set of parameters for the image modality generator is combined to θ v Characteristic corresponding to the mth image modality information>
Figure FDA00039909011500000319
Original features of the e-th image modality information; in the training process, character modal information and image modal information are input in pairs, and E is the number of data pairs; function>
Figure FDA00039909011500000320
In the parameter set theta p And converting the representation characteristics of each character mode information and each image mode information into the same representation subspace under the control.
5. A social media cross-modal data information search method is characterized by comprising the following steps:
inputting data information to be searched into a generator to obtain representation characteristics of the data information to be searched;
wherein the generator is trained by a training method of a data representation feature generator in a social media cross-modal search according to any one of claims 1 to 4 based on a training sampler; the training sample set includes: social media data information of multiple modals, and topics to which the data information belongs and corresponding modals are used as tags; wherein, the data information of the plurality of modes comprises: text modality information and image modality information; the generator includes: the character modality generator and the image modality generator are used for acquiring original features of the data information in corresponding modalities, dividing each original feature to acquire a plurality of corresponding local features, and acquiring the representation features of the data information in the same representation subspace in each modality through a self-attention mechanism based on the local features; supervising combating the generator by means of an arbiter, the arbiter employing a loss function comprising: a generating loss function obtained by weighting and summing the intra-modal semantic loss function and the inter-modal similarity loss function, and a cross-modal discriminant loss function; wherein the distribution difference between the representation features and corresponding topic tags is minimized by minimizing a calculated value of the intra-modal semantic loss function, the correlation between the representation features of different modal data information under the same topic is maximized by minimizing a calculated value of the inter-modal similarity loss function, and the distinction in terms of modality between the representation features of different modal data information is maximized by minimizing a calculated value of the cross-modal discriminant loss function; tuning and optimizing the generator by minimizing a difference between the calculated value of the generation loss function and the calculated value of the cross-modal discriminant loss function; adjusting parameters to optimize the discriminator by maximizing a difference between the calculated value of the generating loss function and the calculated value of the cross-modal discriminant loss function; iterating for multiple times to obtain a final generator;
traversing the existing data information of the target mode, and acquiring the representation characteristics of the existing data information generated by the generator;
and acquiring the existing data information of one or more target modes which are most similar to the representation characteristics of the data information to be searched based on similarity matching.
6. The method for searching the social media cross-modal data information according to claim 5, wherein the obtaining of the existing data information of one or more target modalities which are closest to the representation features of the data information to be searched based on similarity matching comprises:
based on the representation features of the data information to be searched and the representation features corresponding to the existing data information of the target modality, calculating an L2 norm of cross-modality matching as a similarity:
Figure FDA0003990901150000041
wherein the content of the first and second substances,
Figure FDA0003990901150000042
for the parameter set of the text modality generator to be>
Figure FDA0003990901150000043
The corresponding representation characteristic of the ith character mode information is used for judging whether the character mode is the true or false>
Figure FDA0003990901150000044
The original characteristics of the ith character modal information are obtained; />
Figure FDA0003990901150000045
Is combined as ^ er for the parameter set of the image modality generator>
Figure FDA0003990901150000051
The representation characteristic corresponding to the jth image modality information>
Figure FDA0003990901150000052
Original features of jth image modality information; fixed->
Figure FDA0003990901150000053
Or->
Figure FDA0003990901150000054
One is the representation characteristics of the data information to be searched in the corresponding mode, and the other is the representation characteristics of each existing data in the target mode;
and sequencing the existing data information based on the similarity, and acquiring the existing data information of one or more target modes with the highest similarity with the data information to be searched.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the processor executes the program.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN202010418678.7A 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search Active CN111598712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010418678.7A CN111598712B (en) 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010418678.7A CN111598712B (en) 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search

Publications (2)

Publication Number Publication Date
CN111598712A CN111598712A (en) 2020-08-28
CN111598712B true CN111598712B (en) 2023-04-18

Family

ID=72192242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010418678.7A Active CN111598712B (en) 2020-05-18 2020-05-18 Training and searching method for data feature generator in social media cross-modal search

Country Status (1)

Country Link
CN (1) CN111598712B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215837B (en) * 2020-10-26 2023-01-06 北京邮电大学 Multi-attribute image semantic analysis method and device
CN113420166A (en) * 2021-03-26 2021-09-21 阿里巴巴新加坡控股有限公司 Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment
CN114091662B (en) * 2021-11-26 2024-05-14 广东伊莱特生活电器有限公司 Text image generation method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN110059157A (en) * 2019-03-18 2019-07-26 华南师范大学 A kind of picture and text cross-module state search method, system, device and storage medium
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11222415B2 (en) * 2018-04-26 2022-01-11 The Regents Of The University Of California Systems and methods for deep learning microscopy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN110059157A (en) * 2019-03-18 2019-07-26 华南师范大学 A kind of picture and text cross-module state search method, system, device and storage medium
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash

Also Published As

Publication number Publication date
CN111598712A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN111598712B (en) Training and searching method for data feature generator in social media cross-modal search
CN109543084B (en) Method for establishing detection model of hidden sensitive text facing network social media
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
CN112800776B (en) Bidirectional GRU relation extraction data processing method, system, terminal and medium
JP4514082B2 (en) Method and apparatus for building a text classifier and text classifier
CN111914156B (en) Cross-modal retrieval method and system for self-adaptive label perception graph convolution network
US7827133B2 (en) Method and arrangement for SIM algorithm automatic charset detection
CN109831460B (en) Web attack detection method based on collaborative training
CN113255294B (en) Named entity recognition model training method, recognition method and device
CN111460824B (en) Unmarked named entity identification method based on anti-migration learning
CN113239214A (en) Cross-modal retrieval method, system and equipment based on supervised contrast
CN113688631B (en) Nested named entity identification method, system, computer and storage medium
CN111475603A (en) Enterprise identifier identification method and device, computer equipment and storage medium
CN113657425A (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN114528827A (en) Text-oriented confrontation sample generation method, system, equipment and terminal
CN116527357A (en) Web attack detection method based on gate control converter
CN113761845A (en) Text generation method and device, storage medium and electronic equipment
CN112445862A (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
US11907307B1 (en) Method and system for event prediction via causal map generation and visualization
CN116305257A (en) Privacy information monitoring device and privacy information monitoring method
Steyn et al. A nearest neighbor open-set classifier based on excesses of distance ratios
Domazetoski et al. Using natural language processing to extract plant functional traits from unstructured text
Mady et al. Enhancing performance of biomedical named entity recognition
CN116561591B (en) Training method for semantic feature extraction model of scientific and technological literature, feature extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant