CN113095431A - Image description method, system and device based on attention mechanism - Google Patents

Image description method, system and device based on attention mechanism Download PDF

Info

Publication number
CN113095431A
CN113095431A CN202110457256.5A CN202110457256A CN113095431A CN 113095431 A CN113095431 A CN 113095431A CN 202110457256 A CN202110457256 A CN 202110457256A CN 113095431 A CN113095431 A CN 113095431A
Authority
CN
China
Prior art keywords
attention
image
vector
information
image description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110457256.5A
Other languages
Chinese (zh)
Other versions
CN113095431B (en
Inventor
胡海峰
夏志武
吴永波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110457256.5A priority Critical patent/CN113095431B/en
Publication of CN113095431A publication Critical patent/CN113095431A/en
Application granted granted Critical
Publication of CN113095431B publication Critical patent/CN113095431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an image description method, a system and a device based on an attention mechanism, wherein the method comprises the following steps: processing the image characteristics based on an encoder module to obtain encoding information; obtaining sequence vector information based on a decoder module and decoding the coded information to obtain word probability distribution; and repeating the encoding and decoding steps until the preset times are reached, and outputting the image description. The system comprises: an encoder module, a decoder module, and a loop module. The apparatus includes a memory and a processor for performing the above-described attention-based image description method. By using the method and the device, the hidden internal semantic relation and the spatial position relation between the objects in the image can be fully excavated, and the comprehensive and accurate image description is generated. The image description method, the image description system and the image description device based on the attention mechanism can be widely applied to image description generation detection.

Description

Image description method, system and device based on attention mechanism
Technical Field
The invention relates to the field of image description generation, in particular to an image description method, system and device based on an attention mechanism.
Background
Image description generation technology is a challenging task in the field of artificial intelligence, and is receiving more and more attention. The generation of the image description generation technology brings new development and application prospects for a computer to rapidly acquire information from images. The image description generation technology is closely related to image semantic analysis, image annotation, high-level semantic extraction and the like. The image description generation technology is that a computer automatically generates a complete and smooth description sentence for an image. Image description generation techniques in the context of big data have wide application in the business field. If the user inputs keywords in the shopping software, the commodity meeting the requirements is quickly searched out; searching pictures in a search engine by a user; the method comprises the steps of identification of multiple-event targets in a video, professional automatic semantic annotation of medical images, identification of target objects in automatic driving, image retrieval, intelligent blind person guidance, man-machine interaction and the like. However, the conventional image description generation method has the problems that the semantic information implicit in the image is not sufficiently mined, the characteristics of the image are not sufficiently utilized, and the generated description is not accurate and comprehensive.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide an image description method, system and device based on an attention mechanism, which are used for deeply mining semantic relations among objects in an image and generating more flexible and accurate text descriptions.
The first technical scheme adopted by the invention is as follows: an attention mechanism-based image description method, comprising the steps of:
acquiring image characteristics X of an input image and performing linear transformation on the image characteristics X to obtain a vector set Q, K1And V1
Vector set K1And V1Respectively inserting semantic association vectors Sk、SvTo obtain a vector set K2And V2
Vector set Q, K2And V2Inputting the feature information S (X) from the attention module S;
the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Figure BDA0003040910770000011
Acquiring sequence vector information Y of a previous time step, and processing the sequence vector information Y by a mask self-attention module to obtain a query vector Yq;
will encode the information
Figure BDA0003040910770000012
Obtaining a vector set K through linear transformation2And V2
Query vector Yq, vector set K2And V2Inputting a cross attention module to obtain a decoding result C and further performing residual error connection and regularization updating on the decoding result C;
c is subjected to Sigmoid operator and forward propagation to obtain word probability distribution
Figure BDA0003040910770000013
To encode information
Figure BDA0003040910770000021
As new image features, word probability distributions
Figure BDA0003040910770000022
The image description is output as a new sequence vector representation and returns to step S1 until the number of loops reaches four.
Further, the image feature X is subjected to linear transformation to obtain a vector set Q, K1And V1The method specifically comprises the following steps:
weight matrix W based on preset sizeq、WkAnd WvPoint-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature1And V1
Further, the self-attention module S consists of basic scaling matrix dot product operations, the set of vectors Q, K2And V2The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:
Figure BDA0003040910770000023
S(X)=Attention(WqX,[WkX,Sk],[WvX,Sv])
wherein Attenttion (-) denotes a self-Attention operation operator, Softmax (-) denotes a range of compressing elements of a matrix to (0, 1),
Figure BDA0003040910770000024
representing the dimensions of the feature vector.
Further, the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Figure BDA0003040910770000025
This step, the formula is:
Z=AddNorm(S(X))
F(Z)i=Mσ(WZi+b)+c
Figure BDA0003040910770000026
wherein AddForm (. cndot.) consists of residual concatenation and layer normalization, ZiThe ith vector representing the input, F (Z)iThe output vector representing the i-th forward propagation, M, W the learnable weight parameter, b, c the learnable bias term, σ (-) the activation function.
Further, the query vector Yq, vector set K2And V2The input cross attention module yields C, which is formulated as:
Figure BDA0003040910770000027
Figure BDA0003040910770000028
wherein A isttention _ c denotes the cross attention operator operation, Wk、WvAnd WqRepresenting a learnable weight parameter.
Further, the C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution
Figure BDA0003040910770000029
The formula is expressed as:
Figure BDA0003040910770000031
Figure BDA0003040910770000032
Figure BDA0003040910770000033
Figure BDA0003040910770000034
wherein, the following components are added to the mixture,]representing a matrix splice, aiRepresenting a matrix concatenation, Z represents to
Figure BDA0003040910770000035
F (z) represents the result of the forward propagation process.
The second technical scheme adopted by the invention is as follows: an attention-based image description system comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing the decoding step, acquiring the sequence vector information and decoding the coding information to obtain the word probability distribution;
and the loop module is used for executing the loop coding and decoding steps until the preset times are reached and outputting the final image description.
The third technical scheme adopted by the invention is as follows: an attention-based image description apparatus comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement an attention-based image description method as described above.
The method, the system and the device have the advantages that: the invention adopts a novel attention mechanism, utilizes a multi-stage encoder to process the input image characteristics in the encoding stage, can realize the precision measurement of the association degree between objects in the image and the deep excavation of the hidden semantic association between the objects in the image, improves the description quality and reduces the model complexity.
Drawings
FIG. 1 is a flow chart of steps of an embodiment of the present invention;
FIG. 2 is a diagram illustrating the insertion of an association vector into a vector set according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
Referring to fig. 1, the present invention provides an attention-based image description method, comprising the steps of:
acquiring image characteristics X of an input image and performing linear transformation on the image characteristics X to obtain a vector set Q, K1And V1
Vector set K1And V1Respectively inserting semantic association vectors Sk、SvThe embedding of the word into the representation R and the image feature X is performed as shown in FIG. 2Matrix multiplication is carried out to obtain multi-modal semantic associated information RX, and then a weight matrix W with randomly set initial values is used1、W2Respectively linearly converting semantic associated information RX into vector set Sk、SvThereby obtaining the semantic information implicit in the image feature X, it should be noted that the implicit semantic information obtained here is not obtained in Q, K, V converted from the image feature, and the vector set K is obtained2And V2
Vector set Q, K2And V2Inputting the feature information S (X) from the attention module S;
the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Figure BDA0003040910770000041
Acquiring sequence vector information Y of a previous time step, and processing the sequence vector information Y by a mask self-attention module to obtain a query vector Yq;
will encode the information
Figure BDA0003040910770000042
Obtaining a vector set K through linear transformation2And V2
Query vector Yq, vector set K2And V2Inputting a cross attention module to obtain a decoding result C and further performing residual error connection and regularization updating on the decoding result C;
specifically, where C is obtained by inputting cross attention from the encoded information and the prediction result obtained at the previous time step, and is a preliminary result of decoding at the current time step, the preliminary result is used for further processing to obtain a final result of decoding at the current time step.
C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution
Figure BDA0003040910770000043
To encode information
Figure BDA0003040910770000044
As new image features, word probability distributions
Figure BDA0003040910770000045
The image description is output as a new sequence vector representation and returns to step S1 until the number of loops reaches four.
Further as a preferred embodiment of the method, the image feature X is linearly transformed to obtain a vector set Q, K1And V1The method specifically comprises the following steps:
weight matrix W based on preset sizeq、WkAnd WvPoint-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature1And V1
Specifically, Q ═ WqX,K1=WkX,V1=WvX。
Further as a preferred embodiment of the method, said set of vectors K is a set of vectors1And V1Respectively inserting semantic association vectors Sk、SvTo obtain a vector set K2And V2The formula is expressed as:
K2=[WkX,Sk]
V2=[WkX,Sk]
further as a preferred embodiment of the method, the self-attention module S consists of basic scaling matrix dot product operations, the set of vectors Q, K2And V2The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:
Figure BDA0003040910770000051
S(X)=Attention(WqX,[WkX,Sk],[WvX,Sv])
wherein Attenttion (-) denotes a self-Attention operation operator, Softmax (-) denotes a range of compressing elements of a matrix to (0, 1),
Figure BDA0003040910770000052
representing the dimensions of the feature vector.
Further as a preferred embodiment of the method, the feature representation s (x) is regularized by forward propagation and residual connection to obtain encoded information
Figure BDA0003040910770000053
This step, the formula is:
Z=AddNorm(S(X))
F(Z)i=Mσ(WZi+b)+c
Figure BDA0003040910770000054
wherein AddForm (. cndot.) consists of residual concatenation and layer normalization, ZiThe ith vector representing the input, F (Z)iThe output vector representing the i-th forward propagation, M, W the learnable weight parameter, b, c the learnable bias term, σ (-) the activation function.
As a preferred embodiment of the method, the query vector Yq is obtained by acquiring the sequence vector information Y of the previous time step and processing the sequence vector information Y by the mask self-attention module, and a formula is represented as:
Figure BDA0003040910770000055
Yq=AddNorm(Attention_m(Y))
where M denotes a mask matrix.
Further as a preferred embodiment of the method, the query vector Yq, the set of vectors K, are described2And V2The input cross attention module yields C, which is formulated as:
Figure BDA0003040910770000056
Figure BDA0003040910770000057
wherein Attention _ c denotes the cross Attention operator operation, Wk、WvAnd WqRepresenting a learnable weight parameter.
Further as a preferred embodiment of the method, the C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution
Figure BDA0003040910770000061
The formula is expressed as:
Figure BDA0003040910770000062
Figure BDA0003040910770000063
Figure BDA0003040910770000064
Figure BDA0003040910770000065
wherein, the following components are added to the mixture,]representing a matrix splice, aiRepresenting a matrix concatenation, Z represents to
Figure BDA0003040910770000066
F (z) represents the result of the forward propagation process.
As shown in the upper part of fig. 1, the connection diagram of the encoder and the decoder is shown, the output of the previous encoder layer is used as the input of the next encoder layer, the output of the previous decoder layer is used as the input of the next decoder layer, and the encoder and the decoder corresponding to the sequence number are connected again, so that each stage of the multi-stage features of the encoder is fully decoded in the decoder, and the loss of the initial image features is avoided, which affects the quality of the finally obtained image description. The sequential cascade of 4-layer encoders enables an accurate representation of the degree of correlation between objects in an image and deep mining of semantic correlations hidden between objects in an image. Of course, this is not limited to only 4-layer codec connections.
An attention-based image description system comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing the decoding step, acquiring the sequence vector information and decoding the coding information to obtain the word probability distribution;
and the loop module is used for executing the loop coding and decoding steps until the preset times are reached and outputting the final image description.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
An attention-based image description apparatus:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement an attention-based image description method as described above.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. An attention mechanism-based image description method is characterized by comprising the following steps:
acquiring image characteristics X of an input image and performing linear transformation on the image characteristics X to obtain a vector set Q, K1And V1
Vector set K1And V1Respectively inserting semantic association vectors Sk、SvTo obtain a vector set K2And V2
Vector set Q, K2And V2Inputting the feature information S (X) from the attention module S;
the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Figure FDA0003040910760000011
Acquiring sequence vector information Y of a previous time step, and processing the sequence vector information Y by a mask self-attention module to obtain a query vector Yq;
will encode the information
Figure FDA0003040910760000012
Obtaining a vector set K through linear transformation2And V2
Query vector Yq, vector set K2And V2Inputting a cross attention module to obtain a decoding result C and further performing residual error connection and regularization updating on the decoding result C;
the decoding result C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution
Figure FDA0003040910760000013
To encode information
Figure FDA0003040910760000014
As new image features, word probability distributions
Figure FDA0003040910760000015
The image description is output as a new sequence vector representation and returns to step S1 until the number of loops reaches four.
2. The method as claimed in claim 1, wherein the image feature X is transformed linearly to obtain a vector set Q, K1And V1The method specifically comprises the following steps:
weight matrix W based on preset sizeq、WkAnd WvDot product with image characteristic X to obtain vector set Q, K corresponding to the characteristic1And V1
3. The method as claimed in claim 2, wherein the self-attention module S is composed of basic scaling matrix dot product operations, and the vector set Q, K is set2And V2The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:
Figure FDA0003040910760000016
S(X)=Attention(WqX,[WkX,Sk],[WvX,Sv])
wherein Attenttion (-) denotes a self-Attention operation operator, Softmax (-) denotes a range of compressing elements of a matrix to (0, 1),
Figure FDA0003040910760000017
representing the dimensions of the feature vector.
4. The method as claimed in claim 3, wherein the method comprises a step of displaying the image according to the attention mechanismThen, the feature representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Figure FDA0003040910760000018
This step, the formula is:
Z=AddNorm(S(X))
F(Z)i=Mσ(WZi+b)+c
Figure FDA0003040910760000021
wherein AddForm (. cndot.) consists of residual concatenation and layer normalization, ZiThe ith vector representing the input, F (Z)iThe output vector representing the i-th forward propagation, M, W the learnable weight parameter, b, c the learnable bias term, σ (-) the activation function.
5. The method of claim 4, wherein the query vector Yq and the vector set K are set2And V2The input cross attention module obtains a decoding result C, and the formula is expressed as:
Figure FDA0003040910760000022
Figure FDA0003040910760000023
wherein Attention _ c denotes the cross Attention operator operation, Wk、WvAnd WqRepresenting a learnable weight parameter.
6. The method as claimed in claim 5, wherein the decoding result C is processed by Sigmoid operator and forward propagation to obtain word summaryRate distribution
Figure FDA0003040910760000024
The formula is expressed as:
Figure FDA0003040910760000025
Figure FDA0003040910760000026
Figure FDA0003040910760000027
Figure FDA0003040910760000028
wherein, the following components are added to the mixture,]representing a matrix splice, aiRepresenting a matrix concatenation, Z represents to
Figure FDA0003040910760000029
F (z) represents the result of the forward propagation process.
7. An attention-based image description system, comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing the decoding step, acquiring the sequence vector information and decoding the coding information to obtain the word probability distribution;
and the loop module is used for executing the loop coding and decoding steps until the preset times are reached and outputting the final image description.
8. An attention-based image description apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for attention-based image description according to any one of claims 1-6.
CN202110457256.5A 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism Active CN113095431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110457256.5A CN113095431B (en) 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110457256.5A CN113095431B (en) 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism

Publications (2)

Publication Number Publication Date
CN113095431A true CN113095431A (en) 2021-07-09
CN113095431B CN113095431B (en) 2023-08-18

Family

ID=76680498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110457256.5A Active CN113095431B (en) 2021-04-27 2021-04-27 Image description method, system and device based on attention mechanism

Country Status (1)

Country Link
CN (1) CN113095431B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186568A (en) * 2021-12-16 2022-03-15 北京邮电大学 Image paragraph description method based on relational coding and hierarchical attention mechanism
CN114399646A (en) * 2021-12-21 2022-04-26 北京中科明彦科技有限公司 Image description method and device based on Transformer structure
CN114581543A (en) * 2022-03-28 2022-06-03 济南博观智能科技有限公司 Image description method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096224A1 (en) * 2016-10-05 2018-04-05 Ecole Polytechnique Federale De Lausanne (Epfl) Method, System, and Device for Learned Invariant Feature Transform for Computer Images
US20180341860A1 (en) * 2017-05-23 2018-11-29 Google Llc Attention-based sequence transduction neural networks
CN109543820A (en) * 2018-11-23 2019-03-29 中山大学 Iamge description generation method based on framework short sentence constrained vector and dual visual attention location mechanism
CN110427605A (en) * 2019-05-09 2019-11-08 苏州大学 The Ellipsis recovering method understood towards short text
CN110458282A (en) * 2019-08-06 2019-11-15 齐鲁工业大学 Multi-angle multi-mode fused image description generation method and system
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data
CN111723937A (en) * 2019-03-21 2020-09-29 北京三星通信技术研究有限公司 Method, device, equipment and medium for generating description information of multimedia data
US20210027018A1 (en) * 2019-07-22 2021-01-28 Advanced New Technologies Co., Ltd. Generating recommendation information
CN112329794A (en) * 2020-11-06 2021-02-05 北京工业大学 Image description method based on double self-attention mechanism
US20210049236A1 (en) * 2019-08-15 2021-02-18 Salesforce.Com, Inc. Systems and methods for a transformer network with tree-based attention for natural language processing

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096224A1 (en) * 2016-10-05 2018-04-05 Ecole Polytechnique Federale De Lausanne (Epfl) Method, System, and Device for Learned Invariant Feature Transform for Computer Images
US20180341860A1 (en) * 2017-05-23 2018-11-29 Google Llc Attention-based sequence transduction neural networks
CN109543820A (en) * 2018-11-23 2019-03-29 中山大学 Iamge description generation method based on framework short sentence constrained vector and dual visual attention location mechanism
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data
CN111723937A (en) * 2019-03-21 2020-09-29 北京三星通信技术研究有限公司 Method, device, equipment and medium for generating description information of multimedia data
CN110427605A (en) * 2019-05-09 2019-11-08 苏州大学 The Ellipsis recovering method understood towards short text
US20210027018A1 (en) * 2019-07-22 2021-01-28 Advanced New Technologies Co., Ltd. Generating recommendation information
CN110458282A (en) * 2019-08-06 2019-11-15 齐鲁工业大学 Multi-angle multi-mode fused image description generation method and system
US20210049236A1 (en) * 2019-08-15 2021-02-18 Salesforce.Com, Inc. Systems and methods for a transformer network with tree-based attention for natural language processing
CN112329794A (en) * 2020-11-06 2021-02-05 北京工业大学 Image description method based on double self-attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张家硕: "基于双向注意力机制的图像描述生成", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186568A (en) * 2021-12-16 2022-03-15 北京邮电大学 Image paragraph description method based on relational coding and hierarchical attention mechanism
CN114399646A (en) * 2021-12-21 2022-04-26 北京中科明彦科技有限公司 Image description method and device based on Transformer structure
CN114581543A (en) * 2022-03-28 2022-06-03 济南博观智能科技有限公司 Image description method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113095431B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
US11657230B2 (en) Referring image segmentation
CN113095431B (en) Image description method, system and device based on attention mechanism
CN108846077B (en) Semantic matching method, device, medium and electronic equipment for question and answer text
CN113312500A (en) Method for constructing event map for safe operation of dam
CN111522936B (en) Intelligent customer service dialogue reply generation method and device containing emotion and electronic equipment
CN116662582B (en) Specific domain business knowledge retrieval method and retrieval device based on natural language
CN113837233B (en) Image description method of self-attention mechanism based on sample self-adaptive semantic guidance
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN113392265A (en) Multimedia processing method, device and equipment
CN115203409A (en) Video emotion classification method based on gating fusion and multitask learning
CN115994317A (en) Incomplete multi-view multi-label classification method and system based on depth contrast learning
Hossain et al. Bi-SAN-CAP: Bi-directional self-attention for image captioning
CN115831105A (en) Speech recognition method and device based on improved Transformer model
Zhuang et al. Improving remote sensing image captioning by combining grid features and transformer
CN117746078B (en) Object detection method and system based on user-defined category
CN117635275B (en) Intelligent electronic commerce operation commodity management platform and method based on big data
CN114120166A (en) Video question and answer method and device, electronic equipment and storage medium
CN113869324A (en) Video common-sense knowledge reasoning implementation method based on multi-mode fusion
CN117315249A (en) Image segmentation model training and segmentation method, system, equipment and medium
CN117496388A (en) Cross-modal video description model based on dynamic memory network
CN116524407A (en) Short video event detection method and device based on multi-modal representation learning
CN115658856A (en) Intelligent question-answering system and method based on polymorphic document views
CN116310984B (en) Multi-mode video subtitle generating method based on Token sampling
CN111666395A (en) Interpretable question answering method and device oriented to software defects, computer equipment and storage medium
Ouenniche et al. Vision-text cross-modal fusion for accurate video captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant