CN114461821A - Cross-modal image-text inter-searching method based on self-attention reasoning - Google Patents

Cross-modal image-text inter-searching method based on self-attention reasoning Download PDF

Info

Publication number
CN114461821A
CN114461821A CN202210184249.7A CN202210184249A CN114461821A CN 114461821 A CN114461821 A CN 114461821A CN 202210184249 A CN202210184249 A CN 202210184249A CN 114461821 A CN114461821 A CN 114461821A
Authority
CN
China
Prior art keywords
attention
text
image
cross
reasoning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210184249.7A
Other languages
Chinese (zh)
Inventor
李召
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210184249.7A priority Critical patent/CN114461821A/en
Publication of CN114461821A publication Critical patent/CN114461821A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/434Query formulation using image data, e.g. images, photos, pictures taken by a user
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a cross-modal image-text inter-searching method based on self-attention reasoning, and belongs to the field of cross-modal retrieval. The self-attention reasoning model provided by the invention mainly comprises three modules: the first part extracts image significance characteristics by using a pre-training backbone network from top to bottom, and extracts text branch characteristics by using a word embedding + serialization model structure; a self-attention reasoning module is designed in the second part, the contribution of each bounding box to the whole semantics and the adhesiveness between the semantics are considered, and the negative influence brought by irrelevant semantics is further eliminated; the third part designs an interactive attention module between two branches, so that corresponding image-text pairs have greater weight to influence subsequent similarity evaluation. Experiments prove that compared with the traditional method, the method has the characteristics of higher matching precision and higher retrieval speed.

Description

Cross-modal image-text inter-searching method based on self-attention reasoning
Technical Field
The invention belongs to the crossing field of vision and language, is applied to a cross-modal retrieval task between an image and a text, and particularly relates to a cross-modal image-text inter-retrieval method based on self-attention reasoning.
Background
With the rapid development of network technologies, especially emerging social platforms and applications of mobile devices, the internet world is flooded with a large amount of multimodal information (text, images, video, audio, etc.). Just as a result, the demand of users for human-computer interest interaction, namely search function, has changed newly, and many platforms are no longer limited to matching between single modalities but realize cross-modality matching function. When users search for information by submitting queries of any modality, they can obtain search results in various forms, providing more comprehensive supplementary information in view of the different modalities of the data. The graph-text matching is an important branch of the multi-mode matching and is applied to a great number of fields, such as graph-text retrieval, image title generation, image question-answering systems, image knowledge reasoning and the like.
The research significance and value of the image-text retrieval algorithm are embodied in the following aspects. The method has practical application value, is most commonly applied to a search engine and a recommendation system, and carries out a great deal of research work and actual landing projects on various E-commerce platforms and social network sites. The image-text retrieval algorithm enables a user to obtain more interesting information, for example, when you shop, corresponding clothing image recommendation can be obtained by using some simple text descriptions; or the specific menu information can be acquired by using a food picture taken at any time, so that the life is more convenient. Secondly, the image-text retrieval algorithm can also be applied to the aspects of monitoring and security protection. For example, the pedestrian re-identification is realized by using the picture searching function, the pedestrian re-identification is realized by applying the unidirectional text-to-picture searching process on the surface, but the pedestrian re-identification is also image-text retrieval essentially, and the pedestrian re-identification can help law enforcement to quickly lock an initial target person to accelerate the solution process under the pursuit scene. In addition, the retrieval of graphics and texts can also help the development of new media, such as literary creation or media creativity matched with appropriate pictorial pictures.
Depending on the geometric growth of the internet information quantity, the development of big data and computing power, deep learning has a large amount of early support, and the image-text retrieval algorithm using the deep learning begins to develop rapidly. However, the challenge is also increased, how to realize faster image-text retrieval from massive data with interference and noise, and improve the extraction and measurement capability of image and text features, so that the model has good matching effect, which are all key problems to be faced in image-text retrieval.
Disclosure of Invention
The invention aims to design a cross-mode image-text mutual searching method based on self-attention reasoning, aiming at crossing a semantic gap between image vision and natural language and realizing better semantic understanding. The method can improve the accuracy and speed of cross-modal retrieval between the pictures and the texts based on a deep learning method, and can be applied to actual scenes.
In order to achieve the above object, the present invention provides a cross-modal image-text inter-searching method based on self-attention reasoning, comprising:
acquiring a data set to obtain paired original image data and text annotation data, and dividing the paired original image data and text annotation data into a training data set, a verification data set and a test data set;
for each group of data pairs in the training set, respectively extracting initial features of the images output by a pre-training model and feature embedding after text coding;
self-attention reasoning, namely mapping two modal characteristics to the same potential public space, carrying out reasoning on the internal coupling relation of the image branches, and calculating the contribution value of each local bounding box to the whole image by utilizing a self-attention mechanism to be fused into a new image representation;
designing a cross attention layer to obtain the representation of two modal semantics in the opposite semantic space, calculating the similarity, training by utilizing a triple loss function, and finally realizing the cross-modal semantic alignment;
verifying the model, namely verifying the trained model by using a verification set to select an optimal model;
evaluating the model, namely evaluating the optimal model by using a test set to obtain the retrieval precision of the optimal model;
and searching pictures and texts in two directions by using the final optimal model.
Further, ResNet-101 pre-trained on a Visual genome data set is used as a fast R-CNN of a backbone network to perform bottom-up detection and extract image features.
Furthermore, the importance of each bounding box, namely local image features, to global features and the internal relation between the bounding boxes and the local image features are obtained through an attention reasoning module in the image branch, irrelevant semantics are filtered out, and corresponding weight coefficients are deduced and fused to obtain the embedded representation of the final image.
Further, the method of word embedding is used for considering the relation among words, each word is embedded into a same-dimensional vector through an embedding matrix, and the text features are finally coded by using a bidirectional gating circulation unit and mapped into the same public subspace as the image.
Further, an interaction attention matrix between each local image and each word is calculated through a cross attention layer, the feature representation of the visual semantics in the text semantic space and the feature representation of the text in the visual semantic space are obtained, and the correlation between the two modalities is calculated by utilizing a cosine function.
Further, training the cross-modal matching model by adopting a maximum hinge and loss, wherein a calculation formula is as follows:
Loss=[λ-S(I,T)+S(I,T′)]++[λ-S(I,T)+S(I′,T)]+
in the formula, λ represents a margin, [ x ]]+Max (x, 0). S (I, T) represents the similarity between the matching image and the text S (I, T '), and S (I ', T) represents the similarity between the difficult example sample pair T ' ═ argmaxp≠TS(I,p),I'=argmaxq≠IS(q,T)。
Respectively extracting the characteristics of the pictures or texts to be inquired, inputting the pictures or texts into an optimal model to obtain a matching degree score matrix between the pictures and all the texts or between the texts and all the pictures, and sequencing the scores of the matrix from large to small to obtain a retrieval result.
In the specific training stage, a model is trained under a Pythrch framework by using hardware Nvidia GTX 2080Ti GPU and an Intel Core i7-9700k CPU and an Adam training strategy.
According to the technical scheme, compared with the prior art, the method can eliminate negative effects caused by irrelevant information through self-attention reasoning to obtain more accurate feature expression of visual branches, on the basis, in order to enable the corresponding relation between the modes to be more obvious, the feature expression of the visual semantics in the text semantic space and the feature expression of the text in the visual semantic space are further matched through the cross attention layer between the two modes, and finally the multi-modal features with sufficient general semantics and accuracy are obtained, so that the accuracy and the stability of the cross-modal retrieval model method are effectively improved.
Drawings
FIG. 1 is a schematic diagram of an implementation flow of a cross-modal graph-text interconnection method based on self-attention reasoning provided by the present invention;
FIG. 2 is a schematic diagram of an overall network structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a self-attention-driven network according to an embodiment of the present invention;
FIG. 4 is a cross-attention structure diagram according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.
Referring to the attached drawings 1 and 2, the invention designs a cross-modal image-text inter-searching method based on self-attention reasoning, which comprises the following concrete implementation processes:
step 1: and acquiring a data set to obtain paired original image data and text annotation data, and dividing the paired original image data and text annotation data into a training data set, a verification data set and a test data set.
The multimodal dataset used for training, assessment and testing contains pairs of images and text, one picture corresponding to five related text items. The embodiment divides the data set into a training set, a verification set and a test set.
Step 2: and for each group of data pairs in the training set, respectively extracting initial features of the images output by the pre-training model and embedding the features of the images after text coding.
Image feature extraction here focuses on the area at the object and instance level. Fa using ResNet-101 pre-trained on Visual Genomes datasets as backbone networkThe ster R-CNN, a bottom-up assay was performed. The network can predict the attribute class and the instance class, extract the salient region of the original image features, and obtain the bounding box 36 before the confidence ranking as the image embedding. For each bounding box i, fiIs defined as the average pooled feature of the region and has a dimension of 2048. Through the full connection layer, fiMapped to a potential common space to represent the final K-dimensional vector.
vi=Wvfi+bv
In the formula WV、bvIs a parameter of the fully connected layer. V ═ V1,v2,…,vn},vi∈RKIt is the result of the transformation that represents the image features in the common space, and n represents n bounding boxes per image, each bounding box being K-dimensional.
To connect visual and linguistic domains, it is desirable to map text to the same K-dimensional semantic vector space as the image region. Given a sentence, the simplest approach is to map each word separately. But each word is a unique heat vector at this time, representing the index of the word in the dictionary, independent of each other and without relevance. The method of word embedding is used here to take into account the association between words. By embedding the matrix WeEmbedding each word into a 300-dimensional vector, xi=Wewi,i∈[1,m]. Then, instead of using a simple linear mapping, because it does not efficiently describe the context concatenation of words, a bi-directional GRU is used to encode text features, mapping to the same joint subspace as the image.
Figure BDA0003517903420000031
GRU in the formulafRepresenting forward GRU, GRUbRepresenting a reverse GRU. The final text features are obtained by averaging the information of the sentence in two directions.
And step 3: referring to the self-attention reasoning shown in fig. 3, two modal features are mapped to the same potential public space, the internal coupling relation of the image branches is deduced, and the contribution values of each local bounding box to the whole image are calculated by using a self-attention mechanism and merged into a new image representation.
The network can train parameters to distribute weight according to the contribution of each bounding box to the expression of 'semantics' of the whole image; secondly, "semantic bonding" may exist between the bounding boxes, and different areas may finally point to a meaning, so that they are subjected to feature re-fusion or irrelevant parts are removed. First, the image representation of the prior feature extraction network output after flattening is input into two tandem attention modules S1, S2. Secondly, for calculating the response of a certain position in a sequence, the weighted average of all positions except the position in the embedding space is taken for representation. Then, a shortcut connection of a residual error network is applied, so that adverse effects caused by the increase of the layer number are avoided. Finally, a full connection layer is added to map the image features into the joint subspace again.
And 4, step 4: designing a cross attention layer to obtain the representation of two modal semantics in the opposite semantic space, calculating the similarity, training by utilizing a triple loss function, and finally realizing the cross-modal semantic alignment.
And learning the focusing degree among the text images by adopting a cross-modal cross attention mechanism, and realizing the alignment among local features. Calculating the original region viAnd characterized by self-attention
Figure BDA0003517903420000041
With each word tjAttention score a betweenij
Figure BDA0003517903420000042
This determines the degree of attention of the current region to all words. Referring to FIG. 4, v is demonstrated1And
Figure BDA0003517903420000043
calculating a weight for each word in the whole sentence, and obtaining a representation of the whole semantic meaning of the sentence with regional attention for measuring the whole semantic meaningAnd (4) similarity. And by analogy, similarity matrixes between all image areas and sentences can be obtained.
Figure BDA0003517903420000044
Figure BDA0003517903420000045
In the formula, alphaij
Figure BDA0003517903420000046
Indicating the attention score between the ith region and the jth word. Then, a gating mechanism is utilized to fuse the two attention weights to obtain the sentence semantics corresponding to the current region, and the sentence semantics are defined as the weighted combination of the words and the attention weights:
Figure BDA0003517903420000047
Figure BDA0003517903420000048
Figure BDA0003517903420000049
where concat represents a vector splicing operation, W1、W2Represents the parameter of the full link layer, betaijIs the post fusion attention score. Based on the method, the final image embedding is obtained
Figure BDA00035179034200000410
And text embedding T. For the semantic relevance measure between the image I and the text T, take the average of the relevance between all region features in the image embedding and the corresponding sentence semantic vector:
Figure BDA00035179034200000411
where r (,) represents a cosine similarity function.
The emphasis in the lost function training is to handle "difficult" negative examples. To improve computational efficiency, only "difficult" negative examples in small batches of random gradient descent are typically considered, and not all negative examples are summed. Maximum hinges and losses are used here to achieve cross-modality image-text alignment. In a small batch, queries can be made from both image to text and text to image directions.
Loss=[λ-S(I,T)+S(I,T′)]++[λ-S(I,T)+S(I',T)]+
In the formula, λ represents a margin, [ x ]]+Max (x, 0). S (I, T) represents the similarity between the matching image and the text S (I, T '), and S (I ', T) represents the similarity between the difficult example sample pair T ' ═ argmaxp≠TS(I,p),I'=argmaxq≠IS(q,T)。
And 5: and (5) model verification, namely verifying the trained model by using a verification set to select an optimal model. And (5) iterating to perform the steps 2 to 4, performing evaluation on the verification set every 1000 iterations, and keeping the model with the best performance.
Step 6: and (4) model evaluation, namely evaluating the optimal model by using a test set to obtain the retrieval precision of the optimal model.
Specifically, the preprocessed test set is input into the trained optimal model, and the retrieval accuracy of the text to picture and the retrieval accuracy of the picture to text are evaluated at the same time through testing. And comparing the difference between the result obtained by query and the result really labeled, and calculating the recall rate as the evaluation index of the model.
And 7: and searching pictures and texts in two directions by using the final optimal model.
Respectively extracting the characteristics of the pictures or texts to be inquired, inputting the pictures or texts into an optimal model to obtain a matching degree score matrix between the pictures and all the texts or between the texts and all the pictures, and sequencing the scores of the matrix from large to small to obtain a retrieval result.
The method provided by the invention is verified on a flow data set Flickr30K and an MSCOCO data set. The Flickr30K and MSCOCO datasets are widely used in teletext matching, image retrieval and image captioning tasks. Flickr30K contains 31000 pictures, each of which has five corresponding sentence entries. The invention uses 1000 pictures as a verification set, 1000 pictures as a test set, and the rest are used for training. The MSCOCO contains 123287 images, each also corresponding to five textual descriptions, of which 113287 images were used as a training set, 5000 images were used as a validation set, and 5000 images were used as a test set. Experiments prove that the method has certain superiority compared with the traditional method.
In summary, the negative effects caused by irrelevant information can be eliminated through self-attention reasoning to obtain more accurate feature expression of visual branches, on the basis, in order to make the corresponding relation between the modalities more obvious, the feature expression of visual semantics in a text semantic space and the feature expression of a text in the visual semantic space are further matched through a cross attention layer between the two modalities, and finally, the multi-modal features with sufficient general semantics and accuracy are obtained, so that the accuracy and the stability of the method for searching the model across the modalities are effectively improved.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention should not be limited to the embodiments shown herein, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A cross-modal graph-text inter-searching method based on self-attention reasoning is characterized by comprising the following steps:
acquiring a data set to obtain paired original image data and text annotation data, and dividing the paired original image data and text annotation data into a training set, a verification set and a test set;
for each group of data pairs in the training set, respectively extracting initial features of the images after pre-training models and feature embedding after text coding;
self-attention reasoning, namely mapping two modal characteristics to the same potential public space, carrying out reasoning on the internal coupling relation of the image branches, and calculating the contribution value of each local bounding box to the whole image by utilizing a self-attention mechanism to be fused into a new image representation;
designing a cross attention layer to obtain the representation of two modal semantics in the opposite semantic space, calculating the similarity, training by utilizing a triple loss function, and finally realizing the cross-modal semantic alignment;
verifying the model, namely verifying the trained model by using a verification set to select an optimal model;
evaluating the model, namely evaluating the optimal model by using a test set to obtain the retrieval precision of the optimal model;
and searching pictures and texts in two directions by using the final optimal model.
2. The cross-modal context interconnection method based on self-attention reasoning as claimed in claim 1, wherein a bottom-up detection is performed to extract image features using ResNet-101 pre-trained on Visual Genomes dataset as fast R-CNN of backbone network.
3. The cross-modal image-text interconnection method based on self-attention reasoning according to claim 2, characterized in that importance of each bounding box, i.e. local image feature, to global features and internal relations between them are obtained through an attention reasoning module inside an image branch, irrelevant semantics are filtered out, and corresponding weight coefficients are deduced and fused to obtain the embedded representation of the final image.
4. The cross-modal context retrieval method based on self-attention reasoning as claimed in claim 1, wherein the word embedding method is used to consider the relation between words, each word is embedded into a same-dimensional vector through an embedding matrix, and the text features are finally encoded by using a bidirectional gating cyclic unit and mapped into the same common subspace as the image.
5. The self-attention reasoning-based cross-modal graph-text inter-searching method as claimed in any one of claims 1-4, wherein an interaction attention matrix between each local image and a word is calculated through a cross attention layer, a feature representation of visual semantics in a text semantic space and a feature representation of a text in the visual semantic space are obtained, and a cosine function is used to calculate the correlation between two modalities.
6. The cross-modal graph-text interconnection method based on self-attention reasoning according to claim 5, characterized in that the cross-modal matching model is trained by adopting maximum hinge and loss, and the calculation formula is as follows:
Loss=[λ-S(I,T)+S(I,T′)]++[λ-S(I,T)+S(I′,T)]+
in the formula, S (I, T) represents the similarity between the matching image and the text, and S (I ', T) represents the similarity between the difficult example sample pair, I' argmaxp≠TS(I,p),I'=argmaxq≠IS (q, T), λ represents the margin, [ x ]]+=max(x,0)。
7. The cross-modal teletext retrieval method based on self-attention reasoning according to claim 1, characterized in that the teletext retrieval step specifically comprises: respectively extracting the characteristics of the pictures or texts to be inquired, inputting the pictures or texts into an optimal model to obtain a matching degree score matrix between the pictures and all the texts or between the texts and all the pictures, and sequencing the scores of the matrix from large to small to obtain a retrieval result.
CN202210184249.7A 2022-02-24 2022-02-24 Cross-modal image-text inter-searching method based on self-attention reasoning Pending CN114461821A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210184249.7A CN114461821A (en) 2022-02-24 2022-02-24 Cross-modal image-text inter-searching method based on self-attention reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210184249.7A CN114461821A (en) 2022-02-24 2022-02-24 Cross-modal image-text inter-searching method based on self-attention reasoning

Publications (1)

Publication Number Publication Date
CN114461821A true CN114461821A (en) 2022-05-10

Family

ID=81414959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210184249.7A Pending CN114461821A (en) 2022-02-24 2022-02-24 Cross-modal image-text inter-searching method based on self-attention reasoning

Country Status (1)

Country Link
CN (1) CN114461821A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017358A (en) * 2022-08-09 2022-09-06 南京理工大学 Cross-modal retrieval method and system for multi-modal interaction
CN116150418A (en) * 2023-04-20 2023-05-23 南京邮电大学 Image-text matching method and system based on mixed focusing attention mechanism
WO2024001100A1 (en) * 2022-06-30 2024-01-04 苏州元脑智能科技有限公司 Method and apparatus for processing text, and device and non-volatile readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024001100A1 (en) * 2022-06-30 2024-01-04 苏州元脑智能科技有限公司 Method and apparatus for processing text, and device and non-volatile readable storage medium
CN115017358A (en) * 2022-08-09 2022-09-06 南京理工大学 Cross-modal retrieval method and system for multi-modal interaction
CN115017358B (en) * 2022-08-09 2022-11-04 南京理工大学 Cross-modal retrieval method and system for multi-modal interaction
CN116150418A (en) * 2023-04-20 2023-05-23 南京邮电大学 Image-text matching method and system based on mixed focusing attention mechanism

Similar Documents

Publication Publication Date Title
CN112905827B (en) Cross-modal image-text matching method, device and computer readable storage medium
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN114461821A (en) Cross-modal image-text inter-searching method based on self-attention reasoning
CN109190117A (en) A kind of short text semantic similarity calculation method based on term vector
CN110543557A (en) construction method of medical intelligent question-answering system based on attention mechanism
Li et al. Measuring and predicting tag importance for image retrieval
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN113656660B (en) Cross-modal data matching method, device, equipment and medium
KR102059743B1 (en) Method and system for providing biomedical passage retrieval using deep-learning based knowledge structure construction
CN114595306A (en) Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN116450883A (en) Video moment retrieval method based on video content fine granularity information
Yu et al. Question classification based on MAC-LSTM
Renjit et al. CUSAT NLP@ AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings.
CN112528136A (en) Viewpoint label generation method and device, electronic equipment and storage medium
Khurram et al. Dense-captionnet: a sentence generation architecture for fine-grained description of image semantics
CN116244448A (en) Knowledge graph construction method, device and system based on multi-source data information
Verma et al. Automatic image caption generation using deep learning
CN114004236B (en) Cross-language news event retrieval method integrating knowledge of event entity
Liu et al. Improved Chinese sentence semantic similarity calculation method based on multi-feature fusion
CN112269892B (en) Based on multi-mode is unified at many levels Interactive phrase positioning and identifying method
CN112749566B (en) Semantic matching method and device for English writing assistance
CN112883182A (en) Question-answer matching method and device based on machine reading
CN113657116B (en) Social media popularity prediction method and device based on visual semantic relationship
CN115292533A (en) Cross-modal pedestrian retrieval method driven by visual positioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication