CN113537304A - Cross-modal semantic clustering method based on bidirectional CNN - Google Patents

Cross-modal semantic clustering method based on bidirectional CNN Download PDF

Info

Publication number
CN113537304A
CN113537304A CN202110718799.8A CN202110718799A CN113537304A CN 113537304 A CN113537304 A CN 113537304A CN 202110718799 A CN202110718799 A CN 202110718799A CN 113537304 A CN113537304 A CN 113537304A
Authority
CN
China
Prior art keywords
loss
network
cross
training
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110718799.8A
Other languages
Chinese (zh)
Inventor
颜成钢
王超怡
孙垚棋
张继勇
李宗鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110718799.8A priority Critical patent/CN113537304A/en
Publication of CN113537304A publication Critical patent/CN113537304A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal semantic clustering method based on bidirectional CNN, which comprises the steps of firstly preprocessing data and pre-training text samples of a training set; then constructing a cross-modal retrieval network, training the cross-modal retrieval network through a training set, and calculating a loss function of the network; performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model; and finally, testing the trained model through the test set, and calculating each evaluation index. The method of the invention improves the accuracy and efficiency of cross-modal retrieval by clustering the semantic information. The invention designs the loss of a sample and a clustering center in a target space, the distribution difference loss of categories in different modes and the discrimination loss to help semantic clustering, thereby not only enhancing the identification capability among different categories, but also enhancing the correlation among different modes.

Description

Cross-modal semantic clustering method based on bidirectional CNN
Technical Field
The invention relates to the field of computer vision, in particular to a cross-modal retrieval method based on deep learning.
Background
In the era of new media information explosion, each new media user can publish various multimedia information with different modalities, such as pictures, music, videos or texts, anytime and anywhere. With the rapid development of multimedia information, it is difficult for a user to accurately acquire information desired by the user as the number and types of multimedia information increase, and information acquisition is always accompanied by other information having different degrees of correlation. The data is not only huge in quantity, but also mostly data without labels, and different modalities have 'heterogeneous gaps' among data, so that the main technical problem of cross-modality retrieval is to cross the 'gaps' among the data of the different modalities to extract the precision and accuracy of the retrieved data.
The core of the cross-modal retrieval technology is to measure the similarity between different data. Due to the existence of a "heterogeneous gap," the key to cross-modality retrieval is how to match information of different modalities. So far, most cross-modal retrieval maps samples of different modalities to the same subspace. It is also possible to classify into an unsupervised method and a supervised method according to the usage information. The supervision method uses the label information carried by the sample.
Although the cross-modal search is based on mapping to the same subspace, it has different efficiency and accuracy depending on the choice and arrangement of the loss function. In the invention, the loss of samples and clustering centers in a target space, the distribution difference loss of categories in different modes and the discriminant loss are designed to help semantic clustering, so that the identification capability among different categories is enhanced, and the correlation among different modes is enhanced.
Disclosure of Invention
The invention provides a cross-modal semantic clustering method based on bidirectional CNN. The method can effectively improve the efficiency and accuracy of cross-modal retrieval.
The method is respectively carried out by utilizing two CNN network structures, and one deep CNN is used for extracting the characteristic vector of the picture sample. And a shallow CNN, extracting feature vectors of the text samples by using multi-core convolution with different sizes.
In the traditional cross-modal retrieval by using label information, only the content similarity between the modalities is considered, and the invention provides a cross-modal retrieval mode and a novel semantic clustering mode. According to the fact that samples with the same category should have uniform distribution, in order to enable the samples to correspond to the corresponding category distribution in the target space, the clustering center of the target space is calculated. The loss function is defined as the loss of the sample and the cluster center in the target space, the distribution difference loss of the category in different modes and the discriminant loss.
The method specifically comprises the following steps:
step 1: and preprocessing the data, namely performing pre-training on the text samples of the training set.
The existing data set is divided into a training set and a testing set according to a set proportion, and text samples of the training set are pre-trained.
Step 2: and constructing a cross-modal retrieval network.
The cross-modal retrieval network is performed simultaneously using dual CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.
And step 3: the cross-modal search network is trained through a training set.
And 4, step 4: a loss function of the network is calculated. And performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameter. And training for multiple rounds to obtain a final network model.
Finding an effective transition matrix
Figure BDA0003136133970000021
The sample is projected from a source space to a target space. And after the samples are transferred, clustering in a target space corresponding to the category clustering center of the samples. Defining a loss function as being in a target spaceLoss of sample and cluster centers, loss of distribution variance of classes in different modalities, and discriminant loss. The loss of samples and cluster centers in the target space learns a dimensional invariant matrix to minimize the variance of the class distribution. The category distribution difference between different modalities is narrowed down by minimizing the MMD of the category distribution. And (4) judging the loss, namely label prediction loss, and predicting the class label of the public space seed sample by applying a classifier.
And 5: and (3) testing the network model:
and testing the trained model through the test set, and calculating each evaluation index.
The specific method of the step 2 is as follows:
the cross-modal retrieval network adopts a double-layer CNN structure, and comprises a ResNet-50 network and a text CNN network, namely textCNN. The network structure is performed simultaneously by using double CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.
And extracting the information characteristic vector of the picture sample by adopting ResNet-50, and then carrying out common representation learning to obtain common representation of each picture.
Word embedding is a method of converting words in text into numeric vectors. The TextCNN comprises an embedding layer, a convolution layer, a pooling layer and a fully connected softmax layer. For each sentence, a two-dimensional sentence matrix is obtained according to the word vector, then filters with different sizes are selected to carry out convolution operation to obtain a plurality of features, then the maximum pooling is carried out, the features are spliced together, and finally classification is carried out through a softmax full-connected layer. Also, multiple fully connected layers are employed to learn a common representation of text.
With U ═ U1,u2,…,un],V=[v1,v2,…,vn]And Y ═ Y1,y2,…,yn]An image representation matrix, a text representation matrix and a label matrix representing all instances, respectively, where n is the number of categories.
Figure BDA0003136133970000031
A transition matrix is represented that represents the transition matrix,
Figure BDA0003136133970000032
representing a dimension-invariant matrix for optimizing a loss function, wherein DSIs the dimension of the source space, DτIs the dimension of the target space.
The specific method in step 3 is as follows:
the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.
The specific method of the step 4 is as follows:
the loss function is set to a combination of the loss of the sample and the cluster center in the target space, the distribution difference loss of the class in different modalities, and the discriminant loss. In order to reduce the overlapping of different types of distribution in a target space, a matrix with invariable dimension is learned, the variance of the type distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced.
Firstly, calculating target centers of semantic clustering, and calculating c clustering centers by average class samples
Figure BDA0003136133970000033
Thus:
Figure BDA0003136133970000034
wherein XτIs a set of cluster centers, N0Is the number of samples, DτIs the dimension of the target space and n is the number of classes.
It follows that the loss of samples and cluster centers in the target space is as follows:
Figure BDA0003136133970000035
Figure BDA0003136133970000036
represents the loss of samples and cluster centers in the target space, X samples in the target space.
Narrowing the category distribution difference between different modes by minimizing MMD of the category distribution by minimizing XSW and XτSquare of maximum mean difference between H
Figure BDA0003136133970000041
Wherein,
Figure BDA0003136133970000042
is N0Vector of X1, XSA sample representing the source domain.
And finally, calculating the prediction loss by using the cross entropy, namely the difference between the obtained result and the true value:
Figure BDA0003136133970000043
p*,iis the probability distribution, y, generated for each image or textiIs his true tag value.
The final common loss function is therefore expressed as:
Figure BDA0003136133970000044
where θ is a variable of the model to be optimized and λ is a weight coefficient.
And 5: testing the network model;
and inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the calculated average precision average (mAP). And finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.
The invention has the following beneficial effects:
the method of the invention improves the accuracy and efficiency of cross-modal retrieval by clustering the semantic information. The invention designs the loss of a sample and a clustering center in a target space, the distribution difference loss of categories in different modes and the discrimination loss to help semantic clustering, thereby not only enhancing the identification capability among different categories, but also enhancing the correlation among different modes.
Drawings
FIG. 1 is a schematic structural diagram of a cross-modal search network;
Detailed Description
The objects and effects of the present invention will become more apparent from the following detailed description of the present invention with reference to the accompanying drawings.
Step 1: and preprocessing the data, namely performing pre-training on the text samples of the training set.
The existing data set is divided into a training set and a testing set according to a set proportion, and text samples of the training set are pre-trained.
Step 2: and constructing a cross-modal retrieval network.
As shown in FIG. 1, the cross-modal search network adopts a two-layer CNN structure, including a ResNet-50 network and a text CNN network, i.e., TextCNN. The network structure is performed simultaneously by using double CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.
The ResNet-50 has the main idea that a direct connection channel is added in a network, original input information is allowed to be directly transmitted to a later layer, and the problems of information loss or gradient explosion and incapability of calculating due to too deep network are improved to a certain extent, so that the ResNet-50 is adopted for extracting information characteristic vectors of picture samples, and then common representation learning is carried out to obtain common representation of each picture.
Word embedding is a method of converting words in text into numeric vectors. The TextCNN comprises an embedding layer, a convolution layer, a pooling layer and a fully connected softmax layer. For each sentence, a two-dimensional sentence matrix is obtained according to the word vector, then filters with different sizes are selected to carry out convolution operation to obtain a plurality of features, then the maximum pooling is carried out, the features are spliced together, and finally classification is carried out through a softmax full-connected layer. Also, multiple fully connected layers are employed to learn a common representation of text.
With U ═ U1,u2,…,un],V=[v1,v2,…,vn]And Y ═ Y1,y2,…,yn]An image representation matrix, a text representation matrix and a label matrix representing all instances, respectively, where n is the number of categories.
Figure BDA0003136133970000051
A transition matrix is represented that represents the transition matrix,
Figure BDA0003136133970000052
representing a dimension-invariant matrix for optimizing a loss function, wherein DSIs the dimension of the source space, DτIs the dimension of the target space.
And step 3: the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.
And 4, step 4: and constructing a loss function, calculating the error of each forward propagation, and updating the weight of the network through a back propagation algorithm.
The loss function is set to a combination of the loss of the sample and the cluster center in the target space, the distribution difference loss of the class in different modalities, and the discriminant loss. In order to reduce the overlapping of different types of distribution in a target space, a matrix with invariable dimension is learned, the variance of the type distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced.
First the target centers of the semantic clusters are calculated, the samples with the same concept should have a uniform distribution, but c cluster centers are calculated by averaging the class samples
Figure BDA0003136133970000061
Thus:
Figure BDA0003136133970000062
wherein XτIs a set of cluster centers, N0Is the number of samples, DτIs the dimension of the target space and n is the number of classes.
It follows that the loss of samples and cluster centers in the target space is as follows:
Figure BDA0003136133970000063
Figure BDA0003136133970000064
represents the loss of samples and cluster centers in the target space, X samples in the target space.
The distribution of samples of the same category but different modalities is not exactly the same, and the MMD can construct a statistical test to determine if the two samples are from different distributions. Therefore, the difference in the category distribution between different modes is reduced by minimizing the MMD of the category distribution, by minimizing XSW and XτSquare of maximum mean difference between H
Figure BDA0003136133970000065
Wherein,
Figure BDA0003136133970000066
is N0Vector of X1, XSA sample representing the source domain.
And finally, calculating the prediction loss by using the cross entropy, namely the difference between the obtained result and the true value:
Figure BDA0003136133970000067
p*,iis the probability distribution, y, generated for each image or textiIs his true tagThe value is obtained.
The final common loss function is therefore expressed as:
Figure BDA0003136133970000068
where θ is a variable of the model to be optimized and λ is a weight coefficient.
And 5: testing the network model;
and inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the calculated average precision average (mAP). And finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.
The dataset implemented in this example of implementation is the Pascal Senntence dataset. The data set consisted of 1000 images, divided into 20 categories, each image having 5 corresponding sentences. We selected 40 image-text sample pairs for training, 5 for testing, and 5 for validation.
The evaluation index adopted in the implementation process is the average precision average (mAP), which is a performance metric of the algorithm of predicting the target position and the category.

Claims (5)

1. A cross-modal semantic clustering method based on bidirectional CNN is characterized by comprising the following steps:
step 1: preprocessing data, namely performing pre-training on a text sample of a training set;
dividing the existing data set into a training set and a testing set according to a set proportion, and pre-training text samples of the training set;
step 2: constructing a cross-modal retrieval network;
performing cross-modal retrieval network simultaneously by adopting double CNNs; extracting a feature vector of the picture sample through a ResNet-50 network; for a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through textCNN;
and step 3: training a cross-modal retrieval network through a training set;
and 4, step 4: calculating a loss function of the network; performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model;
finding an effective transition matrix
Figure FDA0003136133960000011
Projecting a sample from a source space to a target space; after the samples are transferred, clustering is carried out in a target space corresponding to the category clustering center of the samples; defining a loss function as the loss of a sample and a clustering center in a target space, the distribution difference loss of a category in different modes and the discrimination loss; the loss of the sample and the clustering center in the target space learns a dimensional invariant matrix, so that the variance of the category distribution is minimum; narrowing category distribution differences between different modalities by minimizing MMD of the category distribution; judging whether the loss is label prediction loss, and predicting the class label of the public space seed sample by applying a classifier;
and 5: and (3) testing the network model:
and testing the trained model through the test set, and calculating each evaluation index.
2. The bi-directional CNN-based cross-modal semantic clustering method according to claim 1, wherein the specific method in step 2 is as follows:
the cross-modal retrieval network adopts a double-layer CNN structure and comprises a ResNet-50 network and a text CNN network, namely textCNN; the network structure adopts double CNNs to carry out simultaneously; extracting a feature vector of the picture sample through a ResNet-50 network; for a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through textCNN;
extracting information characteristic vectors of the picture samples by adopting ResNet-50, and then carrying out public representation learning to obtain public representation of each picture;
word embedding is a method of converting words in text into digital vectors; the TextCNN comprises an embedded layer, a convolutional layer, a pooling layer and a fully-connected softmax layer; for each sentence, obtaining a two-dimensional sentence matrix according to the word vector, selecting filters with different sizes to perform convolution operation to obtain a plurality of characteristics, performing maximum pooling, splicing the characteristics, and finally classifying the characteristics through a softmax full-link layer; also, multiple fully connected layers are employed to learn a common representation of text;
with U ═ U1,u2,…,un],V=[v1,v2,…,vn]And Y ═ Y1,y2,…,yn]An image representation matrix, a text representation matrix and a label matrix respectively representing all instances, wherein n is the number of categories;
Figure FDA0003136133960000021
a transition matrix is represented that represents the transition matrix,
Figure FDA0003136133960000022
representing a dimension-invariant matrix for optimizing a loss function, wherein DSIs the dimension of the source space, DτIs the dimension of the target space.
3. The bi-directional CNN-based cross-modal semantic clustering method according to claim 2, wherein the specific method in step 3 is as follows:
the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.
4. The bi-directional CNN-based cross-modal semantic clustering method according to claim 3, wherein the specific method in step 4 is as follows:
setting a loss function as the combination of the loss of the sample and the clustering center in the target space, the distribution difference loss of the category in different modes and the discrimination loss; in order to reduce the overlapping of different categories in a target space, a matrix with invariable dimension is learned, the variance of category distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced;
firstly, calculating target centers of semantic clustering, and calculating c clustering centers by average class samples
Figure FDA0003136133960000023
Thus:
Figure FDA0003136133960000024
wherein XτIs a set of cluster centers, N0Is the number of samples, DτIs the dimension of the target space, n is the number of classes;
it follows that the loss of samples and cluster centers in the target space is as follows:
Figure FDA0003136133960000025
Figure FDA0003136133960000026
representing the loss of samples and cluster centers in target space, X samples in target space;
narrowing the category distribution difference between different modes by minimizing MMD of the category distribution by minimizing XSW and XτSquare of maximum mean difference between H
Figure FDA0003136133960000027
Wherein,
Figure FDA0003136133960000031
is N0Vector of X1, XSA sample representing a source domain;
and finally, calculating the prediction loss by using the cross entropy, namely the difference between the obtained result and the true value:
Figure FDA0003136133960000032
p*,iis the probability distribution, y, generated for each image or textiIs his true tag value;
the final common loss function is therefore expressed as:
Figure FDA0003136133960000033
where θ is a variable of the model to be optimized and λ is a weight coefficient.
5. The bi-directional CNN-based cross-modal semantic clustering method according to claim 4, wherein the specific method of step 5 is as follows;
inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the average accuracy average (mAP) obtained through calculation; and finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.
CN202110718799.8A 2021-06-28 2021-06-28 Cross-modal semantic clustering method based on bidirectional CNN Pending CN113537304A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110718799.8A CN113537304A (en) 2021-06-28 2021-06-28 Cross-modal semantic clustering method based on bidirectional CNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110718799.8A CN113537304A (en) 2021-06-28 2021-06-28 Cross-modal semantic clustering method based on bidirectional CNN

Publications (1)

Publication Number Publication Date
CN113537304A true CN113537304A (en) 2021-10-22

Family

ID=78125968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110718799.8A Pending CN113537304A (en) 2021-06-28 2021-06-28 Cross-modal semantic clustering method based on bidirectional CNN

Country Status (1)

Country Link
CN (1) CN113537304A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925238A (en) * 2022-07-20 2022-08-19 山东大学 Video clip retrieval method and system based on federal learning
WO2023093574A1 (en) * 2021-11-25 2023-06-01 北京邮电大学 News event search method and system based on multi-level image-text semantic alignment model
CN116503675A (en) * 2023-06-27 2023-07-28 南京理工大学 Multi-category target identification method and system based on strong clustering loss function
CN116955699A (en) * 2023-07-18 2023-10-27 北京邮电大学 Video cross-mode search model training method, searching method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111562108A (en) * 2020-05-09 2020-08-21 浙江工业大学 Rolling bearing intelligent fault diagnosis method based on CNN and FCMC
CN112487822A (en) * 2020-11-04 2021-03-12 杭州电子科技大学 Cross-modal retrieval method based on deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111562108A (en) * 2020-05-09 2020-08-21 浙江工业大学 Rolling bearing intelligent fault diagnosis method based on CNN and FCMC
CN112487822A (en) * 2020-11-04 2021-03-12 杭州电子科技大学 Cross-modal retrieval method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIANG LI等: ""Large scale image understanding with non-convex multi-task learning"", IEEEXPLORE, 27 November 2014 (2014-11-27), pages 1 - 6, XP032737024, DOI: 10.1109/GAMENETS.2014.7043721 *
梅子行: ""智能风控 原理、算法与工程实践"", 31 January 2020, 北京:机械工业出版社 , pages: 76 - 80 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023093574A1 (en) * 2021-11-25 2023-06-01 北京邮电大学 News event search method and system based on multi-level image-text semantic alignment model
CN114925238A (en) * 2022-07-20 2022-08-19 山东大学 Video clip retrieval method and system based on federal learning
CN114925238B (en) * 2022-07-20 2022-10-28 山东大学 Federal learning-based video clip retrieval method and system
CN116503675A (en) * 2023-06-27 2023-07-28 南京理工大学 Multi-category target identification method and system based on strong clustering loss function
CN116503675B (en) * 2023-06-27 2023-08-29 南京理工大学 Multi-category target identification method and system based on strong clustering loss function
CN116955699A (en) * 2023-07-18 2023-10-27 北京邮电大学 Video cross-mode search model training method, searching method and device
CN116955699B (en) * 2023-07-18 2024-04-26 北京邮电大学 Video cross-mode search model training method, searching method and device

Similar Documents

Publication Publication Date Title
JP7360497B2 (en) Cross-modal feature extraction method, extraction device, and program
CN107346328B (en) Cross-modal association learning method based on multi-granularity hierarchical network
CN110059217B (en) Image text cross-media retrieval method for two-stage network
CN112905822B (en) Deep supervision cross-modal counterwork learning method based on attention mechanism
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN110647904B (en) Cross-modal retrieval method and system based on unmarked data migration
CN106202256B (en) Web image retrieval method based on semantic propagation and mixed multi-instance learning
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN109858015B (en) Semantic similarity calculation method and device based on CTW (computational cost) and KM (K-value) algorithm
CN111125411B (en) Large-scale image retrieval method for deep strong correlation hash learning
CN109784405B (en) Cross-modal retrieval method and system based on pseudo-tag learning and semantic consistency
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
CN112100346A (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN110309268A (en) A kind of cross-language information retrieval method based on concept map
CN113821670B (en) Image retrieval method, device, equipment and computer readable storage medium
CN117273134A (en) Zero-sample knowledge graph completion method based on pre-training language model
CN113535949B (en) Multi-modal combined event detection method based on pictures and sentences
CN108470025A (en) Partial-Topic probability generates regularization own coding text and is embedded in representation method
CN115309930A (en) Cross-modal retrieval method and system based on semantic identification
CN117891939A (en) Text classification method combining particle swarm algorithm with CNN convolutional neural network
CN114239730B (en) Cross-modal retrieval method based on neighbor ordering relation
CN112182275A (en) Trademark approximate retrieval system and method based on multi-dimensional feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination