CN113095431A - Image description method, system and device based on attention mechanism - Google Patents
Image description method, system and device based on attention mechanism Download PDFInfo
- Publication number
- CN113095431A CN113095431A CN202110457256.5A CN202110457256A CN113095431A CN 113095431 A CN113095431 A CN 113095431A CN 202110457256 A CN202110457256 A CN 202110457256A CN 113095431 A CN113095431 A CN 113095431A
- Authority
- CN
- China
- Prior art keywords
- attention
- image
- vector
- information
- image description
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Image Processing (AREA)
Abstract
The invention discloses an image description method, a system and a device based on an attention mechanism, wherein the method comprises the following steps: processing the image characteristics based on an encoder module to obtain encoding information; obtaining sequence vector information based on a decoder module and decoding the coded information to obtain word probability distribution; and repeating the encoding and decoding steps until the preset times are reached, and outputting the image description. The system comprises: an encoder module, a decoder module, and a loop module. The apparatus includes a memory and a processor for performing the above-described attention-based image description method. By using the method and the device, the hidden internal semantic relation and the spatial position relation between the objects in the image can be fully excavated, and the comprehensive and accurate image description is generated. The image description method, the image description system and the image description device based on the attention mechanism can be widely applied to image description generation detection.
Description
Technical Field
The invention relates to the field of image description generation, in particular to an image description method, system and device based on an attention mechanism.
Background
Image description generation technology is a challenging task in the field of artificial intelligence, and is receiving more and more attention. The generation of the image description generation technology brings new development and application prospects for a computer to rapidly acquire information from images. The image description generation technology is closely related to image semantic analysis, image annotation, high-level semantic extraction and the like. The image description generation technology is that a computer automatically generates a complete and smooth description sentence for an image. Image description generation techniques in the context of big data have wide application in the business field. If the user inputs keywords in the shopping software, the commodity meeting the requirements is quickly searched out; searching pictures in a search engine by a user; the method comprises the steps of identification of multiple-event targets in a video, professional automatic semantic annotation of medical images, identification of target objects in automatic driving, image retrieval, intelligent blind person guidance, man-machine interaction and the like. However, the conventional image description generation method has the problems that the semantic information implicit in the image is not sufficiently mined, the characteristics of the image are not sufficiently utilized, and the generated description is not accurate and comprehensive.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide an image description method, system and device based on an attention mechanism, which are used for deeply mining semantic relations among objects in an image and generating more flexible and accurate text descriptions.
The first technical scheme adopted by the invention is as follows: an attention mechanism-based image description method, comprising the steps of:
acquiring image characteristics X of an input image and performing linear transformation on the image characteristics X to obtain a vector set Q, K1And V1;
Vector set K1And V1Respectively inserting semantic association vectors Sk、SvTo obtain a vector set K2And V2;
Vector set Q, K2And V2Inputting the feature information S (X) from the attention module S;
the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Acquiring sequence vector information Y of a previous time step, and processing the sequence vector information Y by a mask self-attention module to obtain a query vector Yq;
Query vector Yq, vector set K2And V2Inputting a cross attention module to obtain a decoding result C and further performing residual error connection and regularization updating on the decoding result C;
To encode informationAs new image features, word probability distributionsThe image description is output as a new sequence vector representation and returns to step S1 until the number of loops reaches four.
Further, the image feature X is subjected to linear transformation to obtain a vector set Q, K1And V1The method specifically comprises the following steps:
weight matrix W based on preset sizeq、WkAnd WvPoint-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature1And V1。
Further, the self-attention module S consists of basic scaling matrix dot product operations, the set of vectors Q, K2And V2The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:
S(X)=Attention(WqX,[WkX,Sk],[WvX,Sv])
wherein Attenttion (-) denotes a self-Attention operation operator, Softmax (-) denotes a range of compressing elements of a matrix to (0, 1),representing the dimensions of the feature vector.
Further, the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding informationThis step, the formula is:
Z=AddNorm(S(X))
F(Z)i=Mσ(WZi+b)+c
wherein AddForm (. cndot.) consists of residual concatenation and layer normalization, ZiThe ith vector representing the input, F (Z)iThe output vector representing the i-th forward propagation, M, W the learnable weight parameter, b, c the learnable bias term, σ (-) the activation function.
Further, the query vector Yq, vector set K2And V2The input cross attention module yields C, which is formulated as:
wherein A isttention _ c denotes the cross attention operator operation, Wk、WvAnd WqRepresenting a learnable weight parameter.
Further, the C is subjected to Sigmoid operator and forward propagation to obtain word probability distributionThe formula is expressed as:
wherein, the following components are added to the mixture,]representing a matrix splice, aiRepresenting a matrix concatenation, Z represents toF (z) represents the result of the forward propagation process.
The second technical scheme adopted by the invention is as follows: an attention-based image description system comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing the decoding step, acquiring the sequence vector information and decoding the coding information to obtain the word probability distribution;
and the loop module is used for executing the loop coding and decoding steps until the preset times are reached and outputting the final image description.
The third technical scheme adopted by the invention is as follows: an attention-based image description apparatus comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement an attention-based image description method as described above.
The method, the system and the device have the advantages that: the invention adopts a novel attention mechanism, utilizes a multi-stage encoder to process the input image characteristics in the encoding stage, can realize the precision measurement of the association degree between objects in the image and the deep excavation of the hidden semantic association between the objects in the image, improves the description quality and reduces the model complexity.
Drawings
FIG. 1 is a flow chart of steps of an embodiment of the present invention;
FIG. 2 is a diagram illustrating the insertion of an association vector into a vector set according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
Referring to fig. 1, the present invention provides an attention-based image description method, comprising the steps of:
acquiring image characteristics X of an input image and performing linear transformation on the image characteristics X to obtain a vector set Q, K1And V1;
Vector set K1And V1Respectively inserting semantic association vectors Sk、SvThe embedding of the word into the representation R and the image feature X is performed as shown in FIG. 2Matrix multiplication is carried out to obtain multi-modal semantic associated information RX, and then a weight matrix W with randomly set initial values is used1、W2Respectively linearly converting semantic associated information RX into vector set Sk、SvThereby obtaining the semantic information implicit in the image feature X, it should be noted that the implicit semantic information obtained here is not obtained in Q, K, V converted from the image feature, and the vector set K is obtained2And V2。
Vector set Q, K2And V2Inputting the feature information S (X) from the attention module S;
the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Acquiring sequence vector information Y of a previous time step, and processing the sequence vector information Y by a mask self-attention module to obtain a query vector Yq;
Query vector Yq, vector set K2And V2Inputting a cross attention module to obtain a decoding result C and further performing residual error connection and regularization updating on the decoding result C;
specifically, where C is obtained by inputting cross attention from the encoded information and the prediction result obtained at the previous time step, and is a preliminary result of decoding at the current time step, the preliminary result is used for further processing to obtain a final result of decoding at the current time step.
To encode informationAs new image features, word probability distributionsThe image description is output as a new sequence vector representation and returns to step S1 until the number of loops reaches four.
Further as a preferred embodiment of the method, the image feature X is linearly transformed to obtain a vector set Q, K1And V1The method specifically comprises the following steps:
weight matrix W based on preset sizeq、WkAnd WvPoint-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature1And V1。
Specifically, Q ═ WqX,K1=WkX,V1=WvX。
Further as a preferred embodiment of the method, said set of vectors K is a set of vectors1And V1Respectively inserting semantic association vectors Sk、SvTo obtain a vector set K2And V2The formula is expressed as:
K2=[WkX,Sk]
V2=[WkX,Sk]
further as a preferred embodiment of the method, the self-attention module S consists of basic scaling matrix dot product operations, the set of vectors Q, K2And V2The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:
S(X)=Attention(WqX,[WkX,Sk],[WvX,Sv])
wherein Attenttion (-) denotes a self-Attention operation operator, Softmax (-) denotes a range of compressing elements of a matrix to (0, 1),representing the dimensions of the feature vector.
Further as a preferred embodiment of the method, the feature representation s (x) is regularized by forward propagation and residual connection to obtain encoded informationThis step, the formula is:
Z=AddNorm(S(X))
F(Z)i=Mσ(WZi+b)+c
wherein AddForm (. cndot.) consists of residual concatenation and layer normalization, ZiThe ith vector representing the input, F (Z)iThe output vector representing the i-th forward propagation, M, W the learnable weight parameter, b, c the learnable bias term, σ (-) the activation function.
As a preferred embodiment of the method, the query vector Yq is obtained by acquiring the sequence vector information Y of the previous time step and processing the sequence vector information Y by the mask self-attention module, and a formula is represented as:
Yq=AddNorm(Attention_m(Y))
where M denotes a mask matrix.
Further as a preferred embodiment of the method, the query vector Yq, the set of vectors K, are described2And V2The input cross attention module yields C, which is formulated as:
wherein Attention _ c denotes the cross Attention operator operation, Wk、WvAnd WqRepresenting a learnable weight parameter.
Further as a preferred embodiment of the method, the C is subjected to Sigmoid operator and forward propagation to obtain word probability distributionThe formula is expressed as:
wherein, the following components are added to the mixture,]representing a matrix splice, aiRepresenting a matrix concatenation, Z represents toF (z) represents the result of the forward propagation process.
As shown in the upper part of fig. 1, the connection diagram of the encoder and the decoder is shown, the output of the previous encoder layer is used as the input of the next encoder layer, the output of the previous decoder layer is used as the input of the next decoder layer, and the encoder and the decoder corresponding to the sequence number are connected again, so that each stage of the multi-stage features of the encoder is fully decoded in the decoder, and the loss of the initial image features is avoided, which affects the quality of the finally obtained image description. The sequential cascade of 4-layer encoders enables an accurate representation of the degree of correlation between objects in an image and deep mining of semantic correlations hidden between objects in an image. Of course, this is not limited to only 4-layer codec connections.
An attention-based image description system comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing the decoding step, acquiring the sequence vector information and decoding the coding information to obtain the word probability distribution;
and the loop module is used for executing the loop coding and decoding steps until the preset times are reached and outputting the final image description.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
An attention-based image description apparatus:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement an attention-based image description method as described above.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. An attention mechanism-based image description method is characterized by comprising the following steps:
acquiring image characteristics X of an input image and performing linear transformation on the image characteristics X to obtain a vector set Q, K1And V1;
Vector set K1And V1Respectively inserting semantic association vectors Sk、SvTo obtain a vector set K2And V2;
Vector set Q, K2And V2Inputting the feature information S (X) from the attention module S;
the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information
Acquiring sequence vector information Y of a previous time step, and processing the sequence vector information Y by a mask self-attention module to obtain a query vector Yq;
Query vector Yq, vector set K2And V2Inputting a cross attention module to obtain a decoding result C and further performing residual error connection and regularization updating on the decoding result C;
the decoding result C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution
2. The method as claimed in claim 1, wherein the image feature X is transformed linearly to obtain a vector set Q, K1And V1The method specifically comprises the following steps:
weight matrix W based on preset sizeq、WkAnd WvDot product with image characteristic X to obtain vector set Q, K corresponding to the characteristic1And V1。
3. The method as claimed in claim 2, wherein the self-attention module S is composed of basic scaling matrix dot product operations, and the vector set Q, K is set2And V2The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:
S(X)=Attention(WqX,[WkX,Sk],[WvX,Sv])
4. The method as claimed in claim 3, wherein the method comprises a step of displaying the image according to the attention mechanismThen, the feature representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding informationThis step, the formula is:
Z=AddNorm(S(X))
F(Z)i=Mσ(WZi+b)+c
wherein AddForm (. cndot.) consists of residual concatenation and layer normalization, ZiThe ith vector representing the input, F (Z)iThe output vector representing the i-th forward propagation, M, W the learnable weight parameter, b, c the learnable bias term, σ (-) the activation function.
5. The method of claim 4, wherein the query vector Yq and the vector set K are set2And V2The input cross attention module obtains a decoding result C, and the formula is expressed as:
wherein Attention _ c denotes the cross Attention operator operation, Wk、WvAnd WqRepresenting a learnable weight parameter.
6. The method as claimed in claim 5, wherein the decoding result C is processed by Sigmoid operator and forward propagation to obtain word summaryRate distributionThe formula is expressed as:
7. An attention-based image description system, comprising:
the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;
the decoder module is used for executing the decoding step, acquiring the sequence vector information and decoding the coding information to obtain the word probability distribution;
and the loop module is used for executing the loop coding and decoding steps until the preset times are reached and outputting the final image description.
8. An attention-based image description apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for attention-based image description according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110457256.5A CN113095431B (en) | 2021-04-27 | 2021-04-27 | Image description method, system and device based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110457256.5A CN113095431B (en) | 2021-04-27 | 2021-04-27 | Image description method, system and device based on attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113095431A true CN113095431A (en) | 2021-07-09 |
CN113095431B CN113095431B (en) | 2023-08-18 |
Family
ID=76680498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110457256.5A Active CN113095431B (en) | 2021-04-27 | 2021-04-27 | Image description method, system and device based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113095431B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114186568A (en) * | 2021-12-16 | 2022-03-15 | 北京邮电大学 | Image paragraph description method based on relational coding and hierarchical attention mechanism |
CN114399646A (en) * | 2021-12-21 | 2022-04-26 | 北京中科明彦科技有限公司 | Image description method and device based on Transformer structure |
CN114581543A (en) * | 2022-03-28 | 2022-06-03 | 济南博观智能科技有限公司 | Image description method, device, equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180096224A1 (en) * | 2016-10-05 | 2018-04-05 | Ecole Polytechnique Federale De Lausanne (Epfl) | Method, System, and Device for Learned Invariant Feature Transform for Computer Images |
US20180341860A1 (en) * | 2017-05-23 | 2018-11-29 | Google Llc | Attention-based sequence transduction neural networks |
CN109543820A (en) * | 2018-11-23 | 2019-03-29 | 中山大学 | Iamge description generation method based on framework short sentence constrained vector and dual visual attention location mechanism |
CN110427605A (en) * | 2019-05-09 | 2019-11-08 | 苏州大学 | The Ellipsis recovering method understood towards short text |
CN110458282A (en) * | 2019-08-06 | 2019-11-15 | 齐鲁工业大学 | Multi-angle multi-mode fused image description generation method and system |
WO2020190112A1 (en) * | 2019-03-21 | 2020-09-24 | Samsung Electronics Co., Ltd. | Method, apparatus, device and medium for generating captioning information of multimedia data |
CN111723937A (en) * | 2019-03-21 | 2020-09-29 | 北京三星通信技术研究有限公司 | Method, device, equipment and medium for generating description information of multimedia data |
US20210027018A1 (en) * | 2019-07-22 | 2021-01-28 | Advanced New Technologies Co., Ltd. | Generating recommendation information |
CN112329794A (en) * | 2020-11-06 | 2021-02-05 | 北京工业大学 | Image description method based on double self-attention mechanism |
US20210049236A1 (en) * | 2019-08-15 | 2021-02-18 | Salesforce.Com, Inc. | Systems and methods for a transformer network with tree-based attention for natural language processing |
-
2021
- 2021-04-27 CN CN202110457256.5A patent/CN113095431B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180096224A1 (en) * | 2016-10-05 | 2018-04-05 | Ecole Polytechnique Federale De Lausanne (Epfl) | Method, System, and Device for Learned Invariant Feature Transform for Computer Images |
US20180341860A1 (en) * | 2017-05-23 | 2018-11-29 | Google Llc | Attention-based sequence transduction neural networks |
CN109543820A (en) * | 2018-11-23 | 2019-03-29 | 中山大学 | Iamge description generation method based on framework short sentence constrained vector and dual visual attention location mechanism |
WO2020190112A1 (en) * | 2019-03-21 | 2020-09-24 | Samsung Electronics Co., Ltd. | Method, apparatus, device and medium for generating captioning information of multimedia data |
CN111723937A (en) * | 2019-03-21 | 2020-09-29 | 北京三星通信技术研究有限公司 | Method, device, equipment and medium for generating description information of multimedia data |
CN110427605A (en) * | 2019-05-09 | 2019-11-08 | 苏州大学 | The Ellipsis recovering method understood towards short text |
US20210027018A1 (en) * | 2019-07-22 | 2021-01-28 | Advanced New Technologies Co., Ltd. | Generating recommendation information |
CN110458282A (en) * | 2019-08-06 | 2019-11-15 | 齐鲁工业大学 | Multi-angle multi-mode fused image description generation method and system |
US20210049236A1 (en) * | 2019-08-15 | 2021-02-18 | Salesforce.Com, Inc. | Systems and methods for a transformer network with tree-based attention for natural language processing |
CN112329794A (en) * | 2020-11-06 | 2021-02-05 | 北京工业大学 | Image description method based on double self-attention mechanism |
Non-Patent Citations (1)
Title |
---|
张家硕: "基于双向注意力机制的图像描述生成", 《中文信息学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114186568A (en) * | 2021-12-16 | 2022-03-15 | 北京邮电大学 | Image paragraph description method based on relational coding and hierarchical attention mechanism |
CN114399646A (en) * | 2021-12-21 | 2022-04-26 | 北京中科明彦科技有限公司 | Image description method and device based on Transformer structure |
CN114581543A (en) * | 2022-03-28 | 2022-06-03 | 济南博观智能科技有限公司 | Image description method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113095431B (en) | 2023-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11657230B2 (en) | Referring image segmentation | |
CN113095431B (en) | Image description method, system and device based on attention mechanism | |
CN108846077B (en) | Semantic matching method, device, medium and electronic equipment for question and answer text | |
CN113312500A (en) | Method for constructing event map for safe operation of dam | |
CN111522936B (en) | Intelligent customer service dialogue reply generation method and device containing emotion and electronic equipment | |
CN116662582B (en) | Specific domain business knowledge retrieval method and retrieval device based on natural language | |
CN113837233B (en) | Image description method of self-attention mechanism based on sample self-adaptive semantic guidance | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN115203409A (en) | Video emotion classification method based on gating fusion and multitask learning | |
CN115994317A (en) | Incomplete multi-view multi-label classification method and system based on depth contrast learning | |
Hossain et al. | Bi-SAN-CAP: Bi-directional self-attention for image captioning | |
CN115831105A (en) | Speech recognition method and device based on improved Transformer model | |
Zhuang et al. | Improving remote sensing image captioning by combining grid features and transformer | |
CN117746078B (en) | Object detection method and system based on user-defined category | |
CN117635275B (en) | Intelligent electronic commerce operation commodity management platform and method based on big data | |
CN114120166A (en) | Video question and answer method and device, electronic equipment and storage medium | |
CN113869324A (en) | Video common-sense knowledge reasoning implementation method based on multi-mode fusion | |
CN117315249A (en) | Image segmentation model training and segmentation method, system, equipment and medium | |
CN117496388A (en) | Cross-modal video description model based on dynamic memory network | |
CN116524407A (en) | Short video event detection method and device based on multi-modal representation learning | |
CN115658856A (en) | Intelligent question-answering system and method based on polymorphic document views | |
CN116310984B (en) | Multi-mode video subtitle generating method based on Token sampling | |
CN111666395A (en) | Interpretable question answering method and device oriented to software defects, computer equipment and storage medium | |
Ouenniche et al. | Vision-text cross-modal fusion for accurate video captioning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |