CN113095431B - Image description method, system and device based on attention mechanism - Google Patents

Image description method, system and device based on attention mechanism Download PDF

Info

Publication number
CN113095431B
CN113095431B CN202110457256.5A CN202110457256A CN113095431B CN 113095431 B CN113095431 B CN 113095431B CN 202110457256 A CN202110457256 A CN 202110457256A CN 113095431 B CN113095431 B CN 113095431B
Authority
CN
China
Prior art keywords
vector
image
attention
information
image description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110457256.5A
Other languages
Chinese (zh)
Other versions
CN113095431A (en
Inventor
胡海峰
夏志武
吴永波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110457256.5A priority Critical patent/CN113095431B/en
Publication of CN113095431A publication Critical patent/CN113095431A/en
Application granted granted Critical
Publication of CN113095431B publication Critical patent/CN113095431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The application discloses an image description method, a system and a device based on an attention mechanism, wherein the method comprises the following steps: processing the image characteristics based on the encoder module to obtain encoded information; acquiring sequence vector information based on a decoder module and decoding the coded information to obtain word probability distribution; repeating the encoding and decoding steps until the preset times are reached, and outputting the image description. The system comprises: encoder modules, decoder modules, and loop modules. The apparatus comprises a memory and a processor for performing the above described attention mechanism based image description method. By using the method and the device, the hidden built-in semantic relation and the hidden space position relation between the objects in the image can be fully dredged, and comprehensive and accurate image description is generated. The image description method, system and device based on the attention mechanism can be widely applied to image description generation detection.

Description

Image description method, system and device based on attention mechanism
Technical Field
The present application relates to the field of image description generation, and in particular, to an image description method, system and device based on an attention mechanism.
Background
Image description generation technology is a challenging task in the field of artificial intelligence, and is receiving increasing attention. The generation of the image description generation technology brings new development and application prospects for a computer to quickly acquire information from an image. Image description generation technology is closely related to technologies such as image semantic analysis, image annotation, image advanced semantic extraction and the like. The image description generation technology is that a computer automatically generates a complete and smooth description sentence for an image. Image description generation techniques in a large data background have found wide application in the business field. If the user inputs keywords in the shopping software, the commodity meeting the requirement is quickly searched; searching pictures in a search engine by a user; the method comprises the steps of identifying multi-object targets in videos, automatically and semantically labeling medical image professionals, identifying target objects in automatic driving, searching images, guiding intelligent blind persons, performing man-machine interaction and the like. However, the conventional image description generation method has the problems that the hidden semantic information of the image is not fully mined, the characteristics of the image are not fully utilized, and the generated description is not accurate and comprehensive.
Disclosure of Invention
In order to solve the technical problems, the application aims to provide an image description method, an image description system and an image description device based on an attention mechanism, which are used for deeply mining semantic relations among objects in an image and generating more flexible and more accurate text description.
The first technical scheme adopted by the application is as follows: an image description method based on an attention mechanism comprises the following steps:
acquiring image features X of an input image and performing linear transformation on the image features X to obtain a vector set Q, K 1 And V 1
For vector set K 1 And V 1 Respectively insert semantic association vectors S k 、S v Obtaining a vector set K 2 And V 2
Vector set Q, K 2 And V 2 Inputting the self-attention module S to obtain characteristic information S (X);
regularizing the feature representation S (X) through forward propagation and residual connection to obtain coding information
Acquiring sequence vector information Y of the previous time step and processing the sequence vector information Y by a mask self-attention module to obtain an inquiry vector Yq;
will encode informationObtaining a vector set K through linear transformation 2 And V 2
Query toQuantity Yq, vector set K 2 And V 2 Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C;
c is propagated forward through a Sigmoid operator to obtain word probability distribution
To encode informationAs a new image feature, word probability distribution +.>As a new sequence vector representation and returning to step S1 until the number of loops reaches four, the image description is output.
Further, the image feature X is linearly transformed to obtain a vector set Q, K 1 And V 1 The method comprises the following steps:
weight matrix W based on preset size q 、W k And W is v Dot-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature 1 And V 1
Further, the self-attention module S consists of a basic scaling matrix dot product operation, the vector set Q, K 2 And V 2 The step of inputting the self-attention module S to obtain the feature information S (X), the formula is:
S(X)=Attention(W q X,[W k X,S k ],[W v X,S v ])
where Attention (·) represents the self-Attention operator, softmax (·) represents the compression of the elements of the matrix to the range of (0, 1),representing the dimension of the feature vector.
Further, the feature representation S (X) is subjected to forward propagation and residual connection regularization to obtain encoded informationThis step, the formula is:
Z=AddNorm(S(X))
F(Z) i =Mσ(WZ i +b)+c
wherein AddNarm (·) consists of residual ligation and layer normalization, Z i Representing the i-th vector of input, F (Z) i Representing the i-th forward propagating output vector, M, W represents a learnable weight parameter, b, c represent a learnable bias term, σ (·) represents an activation function.
Further, the query vector Yq and the vector set K 2 And V 2 The input cross-attention module gets C, expressed as:
wherein Attention_c represents a cross Attention operator operation, W k 、W v And W is q Representing a learnable weight parameter.
Further, the C is propagated forward through a Sigmoid operator to obtain word probability distributionThe formula is:
wherein, [,]representing matrix stitching, a i Representing matrix stitching, Z represents to beAnd F (Z) represents the result of forward propagation processing after the Sigmoid activation function is activated.
The second technical scheme adopted by the application is as follows: an attention mechanism based image description system comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing a decoding step, obtaining sequence vector information and decoding the coded information to obtain word probability distribution;
and the circulation module is used for executing the circulation coding and decoding steps until the preset times are reached and outputting the final image description.
The third technical scheme adopted by the application is as follows: an attention mechanism based image description device comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement an attention mechanism based image description method as described above.
The method, the system and the device have the beneficial effects that: the application adopts a novel attention mechanism, and the coding stage utilizes the multistage encoder to process the input image characteristics, so that the precision measurement of the association degree between objects in the image and the depth mining of the hidden semantic association between objects in the image can be realized, the description quality is improved, and the model complexity is reduced.
Drawings
FIG. 1 is a flow chart of steps of an embodiment of the present application;
FIG. 2 is a schematic diagram of the insertion of an association vector into a vector set in accordance with an embodiment of the present application.
Detailed Description
The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
Referring to fig. 1, the present application provides an image description method based on an attention mechanism, the method comprising the steps of:
acquiring image features X of an input image and performing linear transformation on the image features X to obtain a vector set Q, K 1 And V 1
For vector set K 1 And V 1 Respectively insert semantic association vectors S k 、S v As shown in fig. 2, matrix multiplication is performed on the embedded representation R of the word and the image feature X to obtain multi-modal semantic association information RX, and then a weight matrix W is randomly set through an initial value 1 、W 2 Respectively linearly converting semantic association information RX into vector sets S k 、S v Thus obtaining the underlying semantic information in image feature X, it should be noted that the underlying semantic information obtained here is not found in Q, K, V converted from image features, resulting in vector set K 2 And V 2
Vector set Q, K 2 And V 2 Inputting the self-attention module S to obtain characteristic information S (X);
regularizing the feature representation S (X) through forward propagation and residual connection to obtain coding information
Acquiring sequence vector information Y of the previous time step and processing the sequence vector information Y by a mask self-attention module to obtain an inquiry vector Yq;
will encode informationObtaining a vector set K through linear transformation 2 And V 2
Query vector Yq, vector set K 2 And V 2 Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C;
specifically, C is obtained by cross attention of the coding information and the prediction result input obtained in the previous time step, and is a preliminary result of decoding in the current time step, and is used for further processing to obtain a final result of decoding in the current time step.
C is propagated forward through a Sigmoid operator to obtain word probability distribution
To encode informationAs a new image feature, word probability distribution +.>As a new sequence vector representation and returning to step S1 until the number of loops reaches four, the image description is output.
Further as a preferred embodiment of the method, the image feature X is linearly transformed to obtain a vector set Q, K 1 And V 1 The method comprises the following steps:
weight matrix W based on preset size q 、W k And W is v Dot-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature 1 And V 1
Specifically, q=w q X,K 1 =W k X,V 1 =W v X。
Further as a preferred embodiment of the method, the pair of vectors set K 1 And V 1 Respectively insert semantic association vectors S k 、S v Obtaining a vector set K 2 And V 2 The formula is:
K 2 =[W k X,S k ]
V 2 =[W k X,S k ]
further as a preferred embodiment of the method, the self-attention module S consists of a basic scaling matrix dot product operation, the vector set Q, K 2 And V 2 The step of inputting the self-attention module S to obtain the feature information S (X), the formula is:
S(X)=Attention(W q X,[W k X,S k ],[W v X,S v ])
where Attention (·) represents the self-Attention operator, softmax (·) represents the compression of the elements of the matrix to the range of (0, 1),representing the dimension of the feature vector.
Further as a preferred embodiment of the method, the feature representation S (X) is regularized by forward propagation and residual connection to obtain encoded informationThis step, the formula is:
Z=AddNorm(S(X))
F(Z) i =Mσ(WZ i +b)+c
wherein AddNarm (·) consists of residual ligation and layer normalization, Z i Representing the i-th vector of input, F (Z) i Representing the i-th forward propagating output vector, M, W represents a learnable weight parameter, b, c represent a learnable bias term, σ (·) represents an activation function.
Further as a preferred embodiment of the method, the sequence vector information Y of the previous time step is obtained and processed by the mask self-attention module to obtain the query vector Yq, where the formula is as follows:
Y q =AddNorm(Attention_m(Y))
where M represents the mask matrix.
Further as a preferred embodiment of the method, the query vector Yq, vector set K 2 And V 2 The input cross-attention module gets C, expressed as:
wherein Attention_c represents a cross Attention operator operation, W k 、W v And W is q Representing a learnable weight parameter.
Further as a preferred embodiment of the method, the step of propagating C forward through a Sigmoid operator to obtain word probability distributionThe formula is:
wherein, [,]representing matrix stitching, a i Representing matrix stitching, Z represents to beAnd F (Z) represents the result of forward propagation processing after the Sigmoid activation function is activated.
As shown in the upper part of fig. 1, the decoder of the encoder is schematically connected, the output of the previous encoder layer is used as the input of the next encoder layer, the output of the previous decoder layer is used as the input of the next decoder layer, and the encoder and the decoder corresponding to the serial numbers are reconnected, so that each stage of the multi-stage characteristics of the encoder is fully decoded in the decoder, the loss of the initial image characteristics is avoided, and the quality of the finally obtained image description is affected. Sequential concatenation of 4-layer encoders enables accurate representation of the degree of correlation between objects in an image and depth mining of semantic associations hidden between objects in an image. Of course, this is not limited to only 4-layer encoding-order-decoder connections.
An attention mechanism based image description system comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing a decoding step, obtaining sequence vector information and decoding the coded information to obtain word probability distribution;
and the circulation module is used for executing the circulation coding and decoding steps until the preset times are reached and outputting the final image description.
The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.
An image description device based on an attention mechanism:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement an attention mechanism based image description method as described above.
The content in the method embodiment is applicable to the embodiment of the device, and the functions specifically realized by the embodiment of the device are the same as those of the method embodiment, and the obtained beneficial effects are the same as those of the method embodiment.
While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (7)

1. An image description method based on an attention mechanism, which is characterized by comprising the following steps:
s1, acquiring image characteristics X of an input image and based on a weight matrix W of a preset size q 、W k And W is v With image feature XDot product, get corresponding to represent the vector set Q, K of the characteristic 1 And V 1
S2, vector set K 1 And V 1 Respectively insert semantic association vectors S k 、S v Obtaining a vector set K 2 And V 2
S3, vector set Q, K 2 And V 2 Inputting the self-attention module S to obtain characteristic information S (X);
s4, carrying out forward propagation and residual connection regularization on the characteristic representation S (X) to obtain coding information
S5, acquiring sequence vector information Y of the previous time step and processing the sequence vector information Y by a mask self-attention module to obtain an inquiry vector Yq;
s6, coding the informationObtaining a vector set K through linear transformation 2 And V 2
S7, inquiring the vector Yq and the vector set K 2 And V 2 Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C;
s8, carrying out Sigmoid operator and forward propagation on the decoding result C to obtain word probability distribution
S9, coding informationAs a new image feature, word probability distribution +.>As a new sequence vector representation and returning to step S1 until the number of loops reaches four, the image description is output.
2. An image description method based on an attention mechanism according to claim 1, characterized in that the self-attention module S consists of a basic scaling matrix dot product operation, the vector set Q, K 2 And V 2 The step of inputting the self-attention module S to obtain the feature information S (X), the formula is:
S(X)=Attention(W q X,[W k X,S k ],[W v X,S v ])
where Attention (·) represents the self-Attention operator, softmax (·) represents the compression of the elements of the matrix to the range of (0, 1),representing the dimension of the feature vector.
3. An image description method based on an attention mechanism according to claim 2, characterized in that the feature representation S (X) is regularized by forward propagation and residual connection to obtain encoded informationThis step, the formula is:
Z=AddNorm(S(X))
F(Z) i =Mσ(WZ i +b)+c
wherein AddNarm (·) consists of residual ligation and layer normalization, Z i Representing the i-th vector of input, F (Z) i Representing the i-th forward-propagating output vector, M, W representing a learnable weight parameter,b. c represents a learnable bias term and σ (·) represents an activation function.
4. An image description method based on an attention mechanism according to claim 3, characterized in that said query vector Yq, vector set K 2 And V 2 The input cross attention module obtains a decoding result C, and the formula is as follows:
wherein Attention_c represents a cross Attention operator operation, W k 、W v And W is q Representing a learnable weight parameter.
5. The method for describing an image based on an attention mechanism according to claim 4, wherein said decoding result C is propagated forward through a Sigmoid operator to obtain a word probability distributionThe formula is:
wherein, [,]representing matrix stitching, a i Representing matrix stitching, Z represents to beAnd F (Z) represents the result of forward propagation processing after the Sigmoid activation function is activated.
6. An attention mechanism based image description system comprising:
an encoder module for performing an encoding step, acquiring image features X of an input image and based on a weight matrix W of a preset size q 、W k And W is v Performing point multiplication on the image characteristic X to obtain a vector set Q, K corresponding to the characteristic 1 And V 1 The method comprises the steps of carrying out a first treatment on the surface of the For vector set K 1 And V 1 Respectively insert semantic association vectors S k 、S v Obtaining a vector set K 2 And V 2 The method comprises the steps of carrying out a first treatment on the surface of the Vector set Q, K 2 And V 2 Inputting the self-attention module S to obtain characteristic information S (X); regularizing the feature representation S (X) through forward propagation and residual connection to obtain coding information
The decoder module is used for executing a decoding step, obtaining sequence vector information Y of the previous time step and obtaining an inquiry vector Yq through processing of the mask self-attention module; will encode informationObtaining a new vector set K through linear transformation 2 And V 2 The method comprises the steps of carrying out a first treatment on the surface of the Query vector Yq, vector set K 2 And V 2 Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C; calculating the decoding result C by SigmoidSub-and forward propagation, resulting in a word probability distribution +.>
A loop module for encoding informationAs a new image feature, word probability distribution +.>As a new sequence vector representation and returning to the original step until the number of loops reaches four, the image description is output.
7. An attention mechanism based image description device, comprising:
at least one processor;
at least one memory for storing at least one program;
when the at least one program is executed by the at least one processor, the at least one processor is caused to implement an attention-based image description method as claimed in any one of claims 1 to 5.
CN202110457256.5A 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism Active CN113095431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110457256.5A CN113095431B (en) 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110457256.5A CN113095431B (en) 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism

Publications (2)

Publication Number Publication Date
CN113095431A CN113095431A (en) 2021-07-09
CN113095431B true CN113095431B (en) 2023-08-18

Family

ID=76680498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110457256.5A Active CN113095431B (en) 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism

Country Status (1)

Country Link
CN (1) CN113095431B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186568B (en) * 2021-12-16 2022-08-02 北京邮电大学 Image paragraph description method based on relational coding and hierarchical attention mechanism
CN114399646B (en) * 2021-12-21 2022-09-20 北京中科明彦科技有限公司 Image description method and device based on transform structure
CN114581543A (en) * 2022-03-28 2022-06-03 济南博观智能科技有限公司 Image description method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543820A (en) * 2018-11-23 2019-03-29 中山大学 Iamge description generation method based on framework short sentence constrained vector and dual visual attention location mechanism
CN110427605A (en) * 2019-05-09 2019-11-08 苏州大学 The Ellipsis recovering method understood towards short text
CN110458282A (en) * 2019-08-06 2019-11-15 齐鲁工业大学 Multi-angle multi-mode fused image description generation method and system
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data
CN111723937A (en) * 2019-03-21 2020-09-29 北京三星通信技术研究有限公司 Method, device, equipment and medium for generating description information of multimedia data
CN112329794A (en) * 2020-11-06 2021-02-05 北京工业大学 Image description method based on double self-attention mechanism

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10552709B2 (en) * 2016-10-05 2020-02-04 Ecole Polytechnique Federale De Lausanne (Epfl) Method, system, and device for learned invariant feature transform for computer images
RU2021116658A (en) * 2017-05-23 2021-07-05 ГУГЛ ЭлЭлСи NEURAL NETWORKS WITH SEQUENCE CONVERSION BASED ON ATTENTION
US11176330B2 (en) * 2019-07-22 2021-11-16 Advanced New Technologies Co., Ltd. Generating recommendation information
US11615240B2 (en) * 2019-08-15 2023-03-28 Salesforce.Com, Inc Systems and methods for a transformer network with tree-based attention for natural language processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543820A (en) * 2018-11-23 2019-03-29 中山大学 Iamge description generation method based on framework short sentence constrained vector and dual visual attention location mechanism
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data
CN111723937A (en) * 2019-03-21 2020-09-29 北京三星通信技术研究有限公司 Method, device, equipment and medium for generating description information of multimedia data
CN110427605A (en) * 2019-05-09 2019-11-08 苏州大学 The Ellipsis recovering method understood towards short text
CN110458282A (en) * 2019-08-06 2019-11-15 齐鲁工业大学 Multi-angle multi-mode fused image description generation method and system
CN112329794A (en) * 2020-11-06 2021-02-05 北京工业大学 Image description method based on double self-attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于双向注意力机制的图像描述生成;张家硕;《中文信息学报》;20200930;第34卷(第9期);第53-61页 *

Also Published As

Publication number Publication date
CN113095431A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN113095431B (en) Image description method, system and device based on attention mechanism
Zhang et al. Improved deep hashing with soft pairwise similarity for multi-label image retrieval
US11657230B2 (en) Referring image segmentation
CN115203380A (en) Text processing system and method based on multi-mode data fusion
CN109711463A (en) Important object detection method based on attention
CN112801280B (en) One-dimensional convolution position coding method of visual depth self-adaptive neural network
CN112507995B (en) Cross-model face feature vector conversion system and method
CN110516530A (en) A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN113837233B (en) Image description method of self-attention mechanism based on sample self-adaptive semantic guidance
CN110619124A (en) Named entity identification method and system combining attention mechanism and bidirectional LSTM
CN114065771A (en) Pre-training language processing method and device
CN114708436B (en) Training method of semantic segmentation model, semantic segmentation method, semantic segmentation device and semantic segmentation medium
CN115831105A (en) Speech recognition method and device based on improved Transformer model
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
CN115455226A (en) Text description driven pedestrian searching method
CN117496388A (en) Cross-modal video description model based on dynamic memory network
CN113762459A (en) Model training method, text generation method, device, medium and equipment
CN112199531A (en) Cross-modal retrieval method and device based on Hash algorithm and neighborhood map
CN116524407A (en) Short video event detection method and device based on multi-modal representation learning
CN114399646B (en) Image description method and device based on transform structure
WO2023168818A1 (en) Method and apparatus for determining similarity between video and text, electronic device, and storage medium
CN114550159A (en) Image subtitle generating method, device and equipment and readable storage medium
Ouenniche et al. Vision-text cross-modal fusion for accurate video captioning
Xie et al. Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant