CN113537304A - Cross-modal semantic clustering method based on bidirectional CNN - Google Patents
Cross-modal semantic clustering method based on bidirectional CNN Download PDFInfo
- Publication number
- CN113537304A CN113537304A CN202110718799.8A CN202110718799A CN113537304A CN 113537304 A CN113537304 A CN 113537304A CN 202110718799 A CN202110718799 A CN 202110718799A CN 113537304 A CN113537304 A CN 113537304A
- Authority
- CN
- China
- Prior art keywords
- loss
- network
- cross
- training
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 5
- 238000009826 distribution Methods 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000012360 testing method Methods 0.000 claims abstract description 23
- 238000011156 evaluation Methods 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 29
- 239000011159 matrix material Substances 0.000 claims description 28
- 238000013527 convolutional neural network Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 16
- 230000007704 transition Effects 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 6
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims 1
- 230000002708 enhancing effect Effects 0.000 abstract description 4
- 238000004880 explosion Methods 0.000 description 2
- 238000009827 uniform distribution Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000000528 statistical test Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-modal semantic clustering method based on bidirectional CNN, which comprises the steps of firstly preprocessing data and pre-training text samples of a training set; then constructing a cross-modal retrieval network, training the cross-modal retrieval network through a training set, and calculating a loss function of the network; performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model; and finally, testing the trained model through the test set, and calculating each evaluation index. The method of the invention improves the accuracy and efficiency of cross-modal retrieval by clustering the semantic information. The invention designs the loss of a sample and a clustering center in a target space, the distribution difference loss of categories in different modes and the discrimination loss to help semantic clustering, thereby not only enhancing the identification capability among different categories, but also enhancing the correlation among different modes.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a cross-modal retrieval method based on deep learning.
Background
In the era of new media information explosion, each new media user can publish various multimedia information with different modalities, such as pictures, music, videos or texts, anytime and anywhere. With the rapid development of multimedia information, it is difficult for a user to accurately acquire information desired by the user as the number and types of multimedia information increase, and information acquisition is always accompanied by other information having different degrees of correlation. The data is not only huge in quantity, but also mostly data without labels, and different modalities have 'heterogeneous gaps' among data, so that the main technical problem of cross-modality retrieval is to cross the 'gaps' among the data of the different modalities to extract the precision and accuracy of the retrieved data.
The core of the cross-modal retrieval technology is to measure the similarity between different data. Due to the existence of a "heterogeneous gap," the key to cross-modality retrieval is how to match information of different modalities. So far, most cross-modal retrieval maps samples of different modalities to the same subspace. It is also possible to classify into an unsupervised method and a supervised method according to the usage information. The supervision method uses the label information carried by the sample.
Although the cross-modal search is based on mapping to the same subspace, it has different efficiency and accuracy depending on the choice and arrangement of the loss function. In the invention, the loss of samples and clustering centers in a target space, the distribution difference loss of categories in different modes and the discriminant loss are designed to help semantic clustering, so that the identification capability among different categories is enhanced, and the correlation among different modes is enhanced.
Disclosure of Invention
The invention provides a cross-modal semantic clustering method based on bidirectional CNN. The method can effectively improve the efficiency and accuracy of cross-modal retrieval.
The method is respectively carried out by utilizing two CNN network structures, and one deep CNN is used for extracting the characteristic vector of the picture sample. And a shallow CNN, extracting feature vectors of the text samples by using multi-core convolution with different sizes.
In the traditional cross-modal retrieval by using label information, only the content similarity between the modalities is considered, and the invention provides a cross-modal retrieval mode and a novel semantic clustering mode. According to the fact that samples with the same category should have uniform distribution, in order to enable the samples to correspond to the corresponding category distribution in the target space, the clustering center of the target space is calculated. The loss function is defined as the loss of the sample and the cluster center in the target space, the distribution difference loss of the category in different modes and the discriminant loss.
The method specifically comprises the following steps:
step 1: and preprocessing the data, namely performing pre-training on the text samples of the training set.
The existing data set is divided into a training set and a testing set according to a set proportion, and text samples of the training set are pre-trained.
Step 2: and constructing a cross-modal retrieval network.
The cross-modal retrieval network is performed simultaneously using dual CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.
And step 3: the cross-modal search network is trained through a training set.
And 4, step 4: a loss function of the network is calculated. And performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameter. And training for multiple rounds to obtain a final network model.
Finding an effective transition matrixThe sample is projected from a source space to a target space. And after the samples are transferred, clustering in a target space corresponding to the category clustering center of the samples. Defining a loss function as being in a target spaceLoss of sample and cluster centers, loss of distribution variance of classes in different modalities, and discriminant loss. The loss of samples and cluster centers in the target space learns a dimensional invariant matrix to minimize the variance of the class distribution. The category distribution difference between different modalities is narrowed down by minimizing the MMD of the category distribution. And (4) judging the loss, namely label prediction loss, and predicting the class label of the public space seed sample by applying a classifier.
And 5: and (3) testing the network model:
and testing the trained model through the test set, and calculating each evaluation index.
The specific method of the step 2 is as follows:
the cross-modal retrieval network adopts a double-layer CNN structure, and comprises a ResNet-50 network and a text CNN network, namely textCNN. The network structure is performed simultaneously by using double CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.
And extracting the information characteristic vector of the picture sample by adopting ResNet-50, and then carrying out common representation learning to obtain common representation of each picture.
Word embedding is a method of converting words in text into numeric vectors. The TextCNN comprises an embedding layer, a convolution layer, a pooling layer and a fully connected softmax layer. For each sentence, a two-dimensional sentence matrix is obtained according to the word vector, then filters with different sizes are selected to carry out convolution operation to obtain a plurality of features, then the maximum pooling is carried out, the features are spliced together, and finally classification is carried out through a softmax full-connected layer. Also, multiple fully connected layers are employed to learn a common representation of text.
With U ═ U1,u2,…,un],V=[v1,v2,…,vn]And Y ═ Y1,y2,…,yn]An image representation matrix, a text representation matrix and a label matrix representing all instances, respectively, where n is the number of categories.A transition matrix is represented that represents the transition matrix,representing a dimension-invariant matrix for optimizing a loss function, wherein DSIs the dimension of the source space, DτIs the dimension of the target space.
The specific method in step 3 is as follows:
the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.
The specific method of the step 4 is as follows:
the loss function is set to a combination of the loss of the sample and the cluster center in the target space, the distribution difference loss of the class in different modalities, and the discriminant loss. In order to reduce the overlapping of different types of distribution in a target space, a matrix with invariable dimension is learned, the variance of the type distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced.
Firstly, calculating target centers of semantic clustering, and calculating c clustering centers by average class samplesThus:
wherein XτIs a set of cluster centers, N0Is the number of samples, DτIs the dimension of the target space and n is the number of classes.
It follows that the loss of samples and cluster centers in the target space is as follows:
represents the loss of samples and cluster centers in the target space, X samples in the target space.
Narrowing the category distribution difference between different modes by minimizing MMD of the category distribution by minimizing XSW and XτSquare of maximum mean difference between H
And finally, calculating the prediction loss by using the cross entropy, namely the difference between the obtained result and the true value:
p*,iis the probability distribution, y, generated for each image or textiIs his true tag value.
The final common loss function is therefore expressed as:
where θ is a variable of the model to be optimized and λ is a weight coefficient.
And 5: testing the network model;
and inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the calculated average precision average (mAP). And finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.
The invention has the following beneficial effects:
the method of the invention improves the accuracy and efficiency of cross-modal retrieval by clustering the semantic information. The invention designs the loss of a sample and a clustering center in a target space, the distribution difference loss of categories in different modes and the discrimination loss to help semantic clustering, thereby not only enhancing the identification capability among different categories, but also enhancing the correlation among different modes.
Drawings
FIG. 1 is a schematic structural diagram of a cross-modal search network;
Detailed Description
The objects and effects of the present invention will become more apparent from the following detailed description of the present invention with reference to the accompanying drawings.
Step 1: and preprocessing the data, namely performing pre-training on the text samples of the training set.
The existing data set is divided into a training set and a testing set according to a set proportion, and text samples of the training set are pre-trained.
Step 2: and constructing a cross-modal retrieval network.
As shown in FIG. 1, the cross-modal search network adopts a two-layer CNN structure, including a ResNet-50 network and a text CNN network, i.e., TextCNN. The network structure is performed simultaneously by using double CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.
The ResNet-50 has the main idea that a direct connection channel is added in a network, original input information is allowed to be directly transmitted to a later layer, and the problems of information loss or gradient explosion and incapability of calculating due to too deep network are improved to a certain extent, so that the ResNet-50 is adopted for extracting information characteristic vectors of picture samples, and then common representation learning is carried out to obtain common representation of each picture.
Word embedding is a method of converting words in text into numeric vectors. The TextCNN comprises an embedding layer, a convolution layer, a pooling layer and a fully connected softmax layer. For each sentence, a two-dimensional sentence matrix is obtained according to the word vector, then filters with different sizes are selected to carry out convolution operation to obtain a plurality of features, then the maximum pooling is carried out, the features are spliced together, and finally classification is carried out through a softmax full-connected layer. Also, multiple fully connected layers are employed to learn a common representation of text.
With U ═ U1,u2,…,un],V=[v1,v2,…,vn]And Y ═ Y1,y2,…,yn]An image representation matrix, a text representation matrix and a label matrix representing all instances, respectively, where n is the number of categories.A transition matrix is represented that represents the transition matrix,representing a dimension-invariant matrix for optimizing a loss function, wherein DSIs the dimension of the source space, DτIs the dimension of the target space.
And step 3: the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.
And 4, step 4: and constructing a loss function, calculating the error of each forward propagation, and updating the weight of the network through a back propagation algorithm.
The loss function is set to a combination of the loss of the sample and the cluster center in the target space, the distribution difference loss of the class in different modalities, and the discriminant loss. In order to reduce the overlapping of different types of distribution in a target space, a matrix with invariable dimension is learned, the variance of the type distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced.
First the target centers of the semantic clusters are calculated, the samples with the same concept should have a uniform distribution, but c cluster centers are calculated by averaging the class samplesThus:
wherein XτIs a set of cluster centers, N0Is the number of samples, DτIs the dimension of the target space and n is the number of classes.
It follows that the loss of samples and cluster centers in the target space is as follows:
represents the loss of samples and cluster centers in the target space, X samples in the target space.
The distribution of samples of the same category but different modalities is not exactly the same, and the MMD can construct a statistical test to determine if the two samples are from different distributions. Therefore, the difference in the category distribution between different modes is reduced by minimizing the MMD of the category distribution, by minimizing XSW and XτSquare of maximum mean difference between H
And finally, calculating the prediction loss by using the cross entropy, namely the difference between the obtained result and the true value:
p*,iis the probability distribution, y, generated for each image or textiIs his true tagThe value is obtained.
The final common loss function is therefore expressed as:
where θ is a variable of the model to be optimized and λ is a weight coefficient.
And 5: testing the network model;
and inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the calculated average precision average (mAP). And finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.
The dataset implemented in this example of implementation is the Pascal Senntence dataset. The data set consisted of 1000 images, divided into 20 categories, each image having 5 corresponding sentences. We selected 40 image-text sample pairs for training, 5 for testing, and 5 for validation.
The evaluation index adopted in the implementation process is the average precision average (mAP), which is a performance metric of the algorithm of predicting the target position and the category.
Claims (5)
1. A cross-modal semantic clustering method based on bidirectional CNN is characterized by comprising the following steps:
step 1: preprocessing data, namely performing pre-training on a text sample of a training set;
dividing the existing data set into a training set and a testing set according to a set proportion, and pre-training text samples of the training set;
step 2: constructing a cross-modal retrieval network;
performing cross-modal retrieval network simultaneously by adopting double CNNs; extracting a feature vector of the picture sample through a ResNet-50 network; for a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through textCNN;
and step 3: training a cross-modal retrieval network through a training set;
and 4, step 4: calculating a loss function of the network; performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model;
finding an effective transition matrixProjecting a sample from a source space to a target space; after the samples are transferred, clustering is carried out in a target space corresponding to the category clustering center of the samples; defining a loss function as the loss of a sample and a clustering center in a target space, the distribution difference loss of a category in different modes and the discrimination loss; the loss of the sample and the clustering center in the target space learns a dimensional invariant matrix, so that the variance of the category distribution is minimum; narrowing category distribution differences between different modalities by minimizing MMD of the category distribution; judging whether the loss is label prediction loss, and predicting the class label of the public space seed sample by applying a classifier;
and 5: and (3) testing the network model:
and testing the trained model through the test set, and calculating each evaluation index.
2. The bi-directional CNN-based cross-modal semantic clustering method according to claim 1, wherein the specific method in step 2 is as follows:
the cross-modal retrieval network adopts a double-layer CNN structure and comprises a ResNet-50 network and a text CNN network, namely textCNN; the network structure adopts double CNNs to carry out simultaneously; extracting a feature vector of the picture sample through a ResNet-50 network; for a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through textCNN;
extracting information characteristic vectors of the picture samples by adopting ResNet-50, and then carrying out public representation learning to obtain public representation of each picture;
word embedding is a method of converting words in text into digital vectors; the TextCNN comprises an embedded layer, a convolutional layer, a pooling layer and a fully-connected softmax layer; for each sentence, obtaining a two-dimensional sentence matrix according to the word vector, selecting filters with different sizes to perform convolution operation to obtain a plurality of characteristics, performing maximum pooling, splicing the characteristics, and finally classifying the characteristics through a softmax full-link layer; also, multiple fully connected layers are employed to learn a common representation of text;
with U ═ U1,u2,…,un],V=[v1,v2,…,vn]And Y ═ Y1,y2,…,yn]An image representation matrix, a text representation matrix and a label matrix respectively representing all instances, wherein n is the number of categories;a transition matrix is represented that represents the transition matrix,representing a dimension-invariant matrix for optimizing a loss function, wherein DSIs the dimension of the source space, DτIs the dimension of the target space.
3. The bi-directional CNN-based cross-modal semantic clustering method according to claim 2, wherein the specific method in step 3 is as follows:
the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.
4. The bi-directional CNN-based cross-modal semantic clustering method according to claim 3, wherein the specific method in step 4 is as follows:
setting a loss function as the combination of the loss of the sample and the clustering center in the target space, the distribution difference loss of the category in different modes and the discrimination loss; in order to reduce the overlapping of different categories in a target space, a matrix with invariable dimension is learned, the variance of category distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced;
firstly, calculating target centers of semantic clustering, and calculating c clustering centers by average class samplesThus:
wherein XτIs a set of cluster centers, N0Is the number of samples, DτIs the dimension of the target space, n is the number of classes;
it follows that the loss of samples and cluster centers in the target space is as follows:
narrowing the category distribution difference between different modes by minimizing MMD of the category distribution by minimizing XSW and XτSquare of maximum mean difference between H
and finally, calculating the prediction loss by using the cross entropy, namely the difference between the obtained result and the true value:
p*,iis the probability distribution, y, generated for each image or textiIs his true tag value;
the final common loss function is therefore expressed as:
where θ is a variable of the model to be optimized and λ is a weight coefficient.
5. The bi-directional CNN-based cross-modal semantic clustering method according to claim 4, wherein the specific method of step 5 is as follows;
inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the average accuracy average (mAP) obtained through calculation; and finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110718799.8A CN113537304A (en) | 2021-06-28 | 2021-06-28 | Cross-modal semantic clustering method based on bidirectional CNN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110718799.8A CN113537304A (en) | 2021-06-28 | 2021-06-28 | Cross-modal semantic clustering method based on bidirectional CNN |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113537304A true CN113537304A (en) | 2021-10-22 |
Family
ID=78125968
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110718799.8A Pending CN113537304A (en) | 2021-06-28 | 2021-06-28 | Cross-modal semantic clustering method based on bidirectional CNN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113537304A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114925238A (en) * | 2022-07-20 | 2022-08-19 | 山东大学 | Video clip retrieval method and system based on federal learning |
WO2023093574A1 (en) * | 2021-11-25 | 2023-06-01 | 北京邮电大学 | News event search method and system based on multi-level image-text semantic alignment model |
CN116503675A (en) * | 2023-06-27 | 2023-07-28 | 南京理工大学 | Multi-category target identification method and system based on strong clustering loss function |
CN116955699A (en) * | 2023-07-18 | 2023-10-27 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111562108A (en) * | 2020-05-09 | 2020-08-21 | 浙江工业大学 | Rolling bearing intelligent fault diagnosis method based on CNN and FCMC |
CN112487822A (en) * | 2020-11-04 | 2021-03-12 | 杭州电子科技大学 | Cross-modal retrieval method based on deep learning |
-
2021
- 2021-06-28 CN CN202110718799.8A patent/CN113537304A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111562108A (en) * | 2020-05-09 | 2020-08-21 | 浙江工业大学 | Rolling bearing intelligent fault diagnosis method based on CNN and FCMC |
CN112487822A (en) * | 2020-11-04 | 2021-03-12 | 杭州电子科技大学 | Cross-modal retrieval method based on deep learning |
Non-Patent Citations (2)
Title |
---|
LIANG LI等: ""Large scale image understanding with non-convex multi-task learning"", IEEEXPLORE, 27 November 2014 (2014-11-27), pages 1 - 6, XP032737024, DOI: 10.1109/GAMENETS.2014.7043721 * |
梅子行: ""智能风控 原理、算法与工程实践"", 31 January 2020, 北京:机械工业出版社 , pages: 76 - 80 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023093574A1 (en) * | 2021-11-25 | 2023-06-01 | 北京邮电大学 | News event search method and system based on multi-level image-text semantic alignment model |
CN114925238A (en) * | 2022-07-20 | 2022-08-19 | 山东大学 | Video clip retrieval method and system based on federal learning |
CN114925238B (en) * | 2022-07-20 | 2022-10-28 | 山东大学 | Federal learning-based video clip retrieval method and system |
CN116503675A (en) * | 2023-06-27 | 2023-07-28 | 南京理工大学 | Multi-category target identification method and system based on strong clustering loss function |
CN116503675B (en) * | 2023-06-27 | 2023-08-29 | 南京理工大学 | Multi-category target identification method and system based on strong clustering loss function |
CN116955699A (en) * | 2023-07-18 | 2023-10-27 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
CN116955699B (en) * | 2023-07-18 | 2024-04-26 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7360497B2 (en) | Cross-modal feature extraction method, extraction device, and program | |
CN107346328B (en) | Cross-modal association learning method based on multi-granularity hierarchical network | |
CN110059217B (en) | Image text cross-media retrieval method for two-stage network | |
CN112905822B (en) | Deep supervision cross-modal counterwork learning method based on attention mechanism | |
CN104899253B (en) | Towards the society image across modality images-label degree of correlation learning method | |
CN112800292B (en) | Cross-modal retrieval method based on modal specific and shared feature learning | |
CN110647904B (en) | Cross-modal retrieval method and system based on unmarked data migration | |
CN106202256B (en) | Web image retrieval method based on semantic propagation and mixed multi-instance learning | |
CN112819023B (en) | Sample set acquisition method, device, computer equipment and storage medium | |
CN113537304A (en) | Cross-modal semantic clustering method based on bidirectional CNN | |
CN112487822A (en) | Cross-modal retrieval method based on deep learning | |
CN109858015B (en) | Semantic similarity calculation method and device based on CTW (computational cost) and KM (K-value) algorithm | |
CN111125411B (en) | Large-scale image retrieval method for deep strong correlation hash learning | |
CN109784405B (en) | Cross-modal retrieval method and system based on pseudo-tag learning and semantic consistency | |
CN114298122B (en) | Data classification method, apparatus, device, storage medium and computer program product | |
CN112100346A (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN110309268A (en) | A kind of cross-language information retrieval method based on concept map | |
CN113821670B (en) | Image retrieval method, device, equipment and computer readable storage medium | |
CN117273134A (en) | Zero-sample knowledge graph completion method based on pre-training language model | |
CN113535949B (en) | Multi-modal combined event detection method based on pictures and sentences | |
CN108470025A (en) | Partial-Topic probability generates regularization own coding text and is embedded in representation method | |
CN115309930A (en) | Cross-modal retrieval method and system based on semantic identification | |
CN117891939A (en) | Text classification method combining particle swarm algorithm with CNN convolutional neural network | |
CN114239730B (en) | Cross-modal retrieval method based on neighbor ordering relation | |
CN112182275A (en) | Trademark approximate retrieval system and method based on multi-dimensional feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |